Regex behavior differences from windows to Linux

Dust · January 17, 2017, 11:17am

Hi there,

i have this script:

parenthesized parts 7 of matches (regex"^([^\s]*)\s*([^\s]*)\s*([^\s]*)\s*([^\s]*)\s*([^\s]*)\s*([^\s]*)\s*([^\s]*)\s") of ((if it contains "&&" then it;("* * * * root "&it) of (matches (regex"&&.*")of it) else it ) of "28 0 * * * root test -x /etc/cron.daily/popularity-contest && /etc/cron.daily/popularity-contest --crond ")

it takes the string and returns only the commands (checking for multicommands with “&&”) in parenthesized part 7, it works fine in windows, with this outuput:

A: test A: /etc/cron.daily/popularity-contest T: 0.506 ms I: plural substring

but when i try it on Linux this is the output:

A: A: T: 1022

i saw that something is printed on parenthesized part 1:

A: 28 0 * * * root te A: * * * * root && /etc/cron.daily/popularity-conte T: 1384

why last words are cut? and why this behavior?

thank you in advance

JasonWalker · January 17, 2017, 1:58pm

Yes there are regex differences between platforms due to the underlying libraries provided on each platform. There’s a long discussion at Digit \d matching in regex

It seems the safest form is to use POSIX-compliant regexen such as [[:space:]]

Dust · January 17, 2017, 4:54pm

Thank you, i saw that it will works also using
regex"[^ ]*"

that is " " as space

jgstew · January 17, 2017, 5:50pm

In general, I recommend narrowing the scope of what lines / strings you are actually parsing with RegEx by using lines containing and preceding text or following text and similar inspectors, as well as Whose filters. Often you don’t end up needing RegEx at all.

There are a few different reasons I recommend this approach over regex alone:

A regex that needs to handle any input and find just the string pattern you are looking for in particular is going to be much more complicated and error prone.
Regex can take an indeterminate amount of time to evaluate that varies based upon the input text. It is not possible to predict how long it will take with certainty. This is partially mitigated by having simplified regex that only operates on limited input text.
It can be hard to tell what a RegEx is supposed to do by just looking at the RegEx itself. It is non-obvious.
Every RegEx engine can have slightly different quirks, which can be annoying.
It is possible in some cases to have a RegEx that effectively results in an infinite loop that never resolves, which could have bad consequences. This usually depends on the RegEx engine, so I don’t think it applies to BigFix.

Benefits of RegEx over other methods alone:

sometimes it is the only option to accomplish the task
You may have existing RegEx that you are already using in other languages
Easier to find RegEx examples online.

In the cases where RegEx is the best method, in almost all cases you are better off using a hybrid approach, using a combination of string parsing, filtering, and RegEx.

RegEx is valuable for input validation. It is not hard to parse out words that are likely email addresses from an arbitrary string, but it is very hard to validate that they are “valid” email addresses without RegEx. This is less needed in cases where you are reasonably sure that the input only contains valid items already.

Example:

Q: substrings separated by " " whose(it contains "@" AND it contains ".") of "ab.c abc def abc@def.com abc@abc@def.com @.com xyz"
A: abc@def.com
A: abc@abc@def.com
A: @.com
T: 2556

Only 1 of the results is actually a valid email address, but many other possibilities have been eliminated, which means that using RegEx for validation only needs to run on a more limited set of possible inputs.

This would get only the 2nd command from a multi command, but without RegEx: (not a complete solution, but a partial example)

Q: (preceding text of first " " of it | it) of following texts of firsts " && " of it whose(it contains " && ") of "28 0 * * * root test -x /etc/cron.daily/popularity-contest && /etc/cron.daily/popularity-contest --crond "
A: /etc/cron.daily/popularity-contest
T: 2484

This would give the 7th word from the string:

Q: tuple string items 6 of concatenations ", " of substrings separated by " " of "28 0 * * * root test -x /etc/cron.daily/popularity-contest && /etc/cron.daily/popularity-contest --crond "
A: test
T: 2457

this combines both, but only works properly for exactly 2 commands:

Q: ( ( tuple string items 6 of concatenations ", " of substrings separated by " " of it ) ; ( (preceding text of first " " of it | it) of following texts of firsts " && " of it whose(it contains " && ") of it ) ) of "28 0 * * * root test -x /etc/cron.daily/popularity-contest && /etc/cron.daily/popularity-contest --crond "
A: test
A: /etc/cron.daily/popularity-contest
T: 2451

My point isn’t that this is the right solution, or the best solution, but just an example of how it is possible to do some things without regex. It is hard to do this without regex and handle between 1 and an unknown number of commands.