Regex vs contains

So my compadre in crime and I had a bet… and I think I lost and I’m sad. :frowning: I like having the defined pattern for checking things

q: setting "PropName" of client as string as lowercase contains "-xx" as lowercase
A: True
**T: 0.188 ms**

q: value of setting "PropName" of client as lowercase = regex "([0-9]{3}-xx[0-9]{3})"
A: True
**T: 0.486 ms**

Can anyone think a way to accomplish a good pattern without taking such a performance hit?

I’m a fan of regex for its accuracy more than for its performance. As @jgstew has rightly pointed out on some of my other regex postings, it’s certainly possibly to incur a significant penalty, especially when the data is not well defined; but ill-defined data is usually where regex is the most useful to me.

I’d point out that in your example, the “contains -xx” is not nearly as precise as the regex; the regex calls for a three digit number followed by -xx followed by another three-digit number. “foobar-xx-myfakestring” will match the “contains -xx” while it will not match the regex.
** Edit - On a second read-through I see your regex is actually looking for a string that is exactly equal to three digits “-xx” then three more digits. My “exists matches” below looks for a string that contains that pattern but may have additional strings before/after. You can change the “exists matches” to “=regex” and the logic still holds.

Using a simpler regex (without the preceding/following number) incurs less penalty:

q: (it contains "-xx") of "asdflkjadfglkj9h4t8hzunvjksdbniuo-xx-asdkfjasdioufhnasdfh"
A: True
T: 0.033 ms
I: singular boolean


q: exists matches(regex("-xx")) of "asdflkjadfglkj9h4t8hzunvjksdbniuo-xx-asdkfjasdioufhnasdfh"
A: True
T: 0.069 ms
I: singular boolean

When you look at the added complexity of the relevance query to match all three conditions - that there is an “-xx” in the string; that there is a three-digit number before it; and that there is a three-digit number following it, I think the regex more than makes up for its overhead:

q:  (lasts 3 of substrings before "-xx" of it, substrings "-xx" of it, firsts 3 of substrings after "-xx" of it) of "123-xx456"
A: 123, -xx, 456
T: 0.089 ms
I: plural ( substring, substring, substring )

q: exists (items 0 of it as integer, items 2 of it as integer) of (lasts 3 of substrings before "-xx" of it, substrings "-xx" of it, firsts 3 of substrings after "-xx" of it) of "123-xx456"
A: True
T: 0.069 ms
I: singular boolean

q: exists (items 0 of it as integer, items 2 of it as integer) of (lasts 3 of substrings before "-xx" of it, substrings "-xx" of it, firsts 3 of substrings after "-xx" of it) of "asdflkjadfglkj9h4t8hzunvjksdbniuo-xx-asdkfjasdioufhnasdfh"
A: False
T: 0.101 ms
I: singular boolean

q: exists matches(regex("([0-9]{3}-xx[0-9]{3})")) of "123-xx456"
A: True
T: 0.134 ms
I: singular boolean

q: exists matches(regex("([0-9]{3}-xx[0-9]{3})")) of "asdflkjadfglkj9h4t8hzunvjksdbniuo-xx-asdkfjasdioufhnasdfh"
A: False
T: 0.148 ms
I: singular boolean

And, by the way, the non-regex still false identifies where the string following the “-xx” itself starts with a dash; because “-45” is also valid as an integer:

q:  (lasts 3 of substrings before "-xx" of it, substrings "-xx" of it, firsts 3 of substrings after "-xx" of it) of "123-xx-456"
A: 123, -xx, -45
T: 0.085 ms
I: plural ( substring, substring, substring )
q: exists (items 0 of it as integer, items 2 of it as integer) of (lasts 3 of substrings before "-xx" of it, substrings "-xx" of it, firsts 3 of substrings after "-xx" of it) of "123-xx-456"
A: True
T: 0.079 ms
I: singular boolean
2 Likes

I’ll repeat here a little bit of what @JasonWalker is referring to about what I have said before.

RegEx is incredibly powerful, and sometimes it is the only option to use.

With RegEx you can make a query that has an indeterminately long time to find the answer, to a much larger degree than things like contains. It may be possible to construct a regex that is so complicated that it would never determine an answer.

Also, I find RegEx to be less human readable, which I think is a significant negative.

In some cases it might be better to use contains AND RegEx. If the string satisfies then contains statement, only then see if it satisfies a more complicated RegEx. This will save execution time on all strings that fail the contains.

(it as string) whose(it as lowercase contains "-xx" AND exists matches(regex("([0-9]{3}-xx[0-9]{3})")) of it) of values of settings "PropName" of clients
1 Like

All awesome info. Thanks for the clarity.