Extracting version strings from a log file

atlauren · October 4, 2021, 12:51am

Hi all,

I’d like to extract the version numbers from a daemon’s log file, for use in tracking versions and upgrades. In unix I’d do this with awk and regex matching. Some simple relevance pulls just the applicable lines:

Q: ((lines containing "nessusd" of file "/Library/NessusAgent/run/var/nessus/logs/nessusd.messages") whose (it contains "build"))
A: [Wed May 26 13:10:49 2021][2466.1] nessusd 6.11.1 (build M20101) started 
A: [Wed May 26 21:33:09 2021][258.1] nessusd 6.11.1 (build M20101) started 
A: [Wed Jun  2 14:59:29 2021][298.1] nessusd 6.11.1 (build M20101) started 
A: [Tue Jun  8 11:21:22 2021][274.1] nessusd 6.11.1 (build M20101) started 
A: [Thu Jun 10 19:01:47 2021][272.1] nessusd 6.11.1 (build M20101) started 
A: [Wed Jun 16 21:08:08 2021][278.1] nessusd 6.11.1 (build M20101) started 
A: [Fri Jul  2 19:57:55 2021][287.1] nessusd 6.11.1 (build M20101) started 
A: [Wed Jul 21 12:03:02 2021][269.1] nessusd 6.11.1 (build M20101) started 
A: [Wed Jul 21 12:33:47 2021][317.1] nessusd 6.11.1 (build M20101) started 
A: [Mon Jul 26 11:59:34 2021][301.1] nessusd 6.11.1 (build M20101) started 
A: [Mon Jul 26 12:07:18 2021][351.1] nessusd 6.11.1 (build M20101) started 
A: [Fri Jul 30 03:54:22 2021][276.1] nessusd 6.11.1 (build M20101) started 
A: [Mon Aug  9 11:47:09 2021][272.1] nessusd 6.11.1 (build M20101) started 
A: [Fri Aug 20 14:37:49 2021][293.1] nessusd 6.11.1 (build M20101) started 
A: [Tue Sep  7 13:52:10 2021][295.1] nessusd 6.11.1 (build M20101) started 
A: [Mon Sep 13 12:21:49 2021][308.1] nessusd 6.11.1 (build M20101) started 
A: [Fri Sep 17 14:30:06 2021][288.1] nessusd 6.11.1 (build M20101) started 
A: [Tue Sep 21 13:16:04 2021][277.1] nessusd 6.11.1 (build M20101) started 
A: [Mon Sep 27 14:51:14 2021][277.1] nessusd 6.11.1 (build M20101) started 
A: [Thu Sep 30 10:52:46 2021][257.1] nessusd 6.11.1 (build M20101) started 
A: [Sat Oct  2 11:28:59 2021][274.1] nessusd 6.11.1 (build M20101) started 
T: 1187

I’m unsure about how in relevance to extract just the version strings?

Seeking to get further, I resorted to relevance for matches regex:

Q: (matches (regex("[0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}")) of ((lines containing "nessusd" of file "/Library/NessusAgent/run/var/nessus/logs/nessusd.messages") whose (it contains "build")))
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
T: 1704

From there it’s pretty simple to convert to versions and get the maximum:

Q: maximum of ((matches (regex("[0-9]{1,2}[.][0-9]{1,2}[.][0-9]{1,2}")) of ((lines containing "nessusd" of file "/Library/NessusAgent/run/var/nessus/logs/nessusd.messages") whose (it contains "build"))) as version)
A: 6.11.1
T: 1512

How might you have solved this?

Thanks,
Andrew

atlauren · October 4, 2021, 1:14am

Also, I’m not particularly thrilled with relying only on this regex to isolate the right part of the line :
matches (regex("[0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}"))

If I were using awk, that would be something like:
% awk '/nessusd [0-9.]+ \(build/{print $7}' /Library/NessusAgent/run/var/nessus/logs/nessusd.messages

But, in only relevance, how to break the line into space-delimited objects and return just the seventh one? The first part is easy enough:
substrings separated by " " of ((lines containing "nessusd" of file "/Library/NessusAgent/run/var/nessus/logs/nessusd.messages") whose (it contains "build"))
…but then what? Jam them into elements of sets?

JasonWalker · October 4, 2021, 1:41am

Not at debugger now, but I might try

maximum of (it as version) of preceding texts of firsts " (build " of following texts of firsts "nessusd " of lines of files "/Library/NessusAgent/run/var/nessus/logs/nessusd.messages"

I think the use of plurals for ‘preceding texts of’ and ‘following texts of’ should allow skipping any lines that don’t contain those strings, so we don’t need to filter with ‘lines containing’. Only the lines containing both strings should match.

There may be some efficiency in whether to start with the ‘preceding’ or ‘following’ texts, depending on how many lines contain one but not the other.

JasonWalker · October 4, 2021, 1:45am

I can think of two other methods, one using ‘parenthesized parts’ of a larger regex, and another coercing the space-delimited strings into a tuple string, so we can retrieve 'tuple string item 6 of it’s.

…now I gotta go get the debugger up

atlauren · October 4, 2021, 1:57am

[head smack] Of course! I don’t know why I didn’t think of preceding texts using a longer string. Curiously, my regex blend seems to be appreciably faster:

Q: maximum of (it as version) of preceding texts of firsts " (build " of following texts of firsts "nessusd " of lines of files "/Library/NessusAgent/run/var/nessus/logs/nessusd.messages"
A: 6.11.1
T: 2155

Q: maximum of ((matches (regex("[0-9]{1,2}[.][0-9]{1,2}[.][0-9]{1,2}")) of ((lines containing "nessusd" of file "/Library/NessusAgent/run/var/nessus/logs/nessusd.messages") whose (it contains "build"))) as version)
A: 6.11.1
T: 1511

JasonWalker · October 4, 2021, 1:59am

Ok, the debugger exposed some additional complexities. I built your test file as “d:\temp\nessus.txt” and added in a few non-matching lines.

q: lines of file "d:\temp\nessus.txt"
A: [Wed May 26 13:10:49 2021][2466.1] nessusd 6.11.1 (build M20101) started 
A: [Wed May 26 13:10:49 2021][2466.1] nessusd 6.11.1 This line does not match
A: [Wed May 26 21:33:09 2021][258.1] nessusd 6.11.1 (build M20101) started 
A: [Wed Jun  2 14:59:29 2021][298.1] nessusd This line does not match either
A: [Tue Jun  8 11:21:22 2021][274.1] nessusd 6.11.1 (build M20101) started 
A: [Thu Jun 10 19:01:47 2021][272.1] nessusd 6.11.1 (build M20101) started 
A: [Wed Jun 16 21:08:08 2021][278.1] nessusd 6.11.1 (build M20101) started 
T: 4.435 ms

The good news is, my first case is borne out correctly:

q: maximum of (it as version) of preceding texts of firsts " (build " of following texts of firsts "nessusd " of lines of files "d:\temp\nessus.txt"
A: 6.11.1
T: 3.981 ms

It is also possible to set up a regex to match three parenthesized parts - "start of the line through nessusd ", “version”, and “from (build through the end of the line”, and then take the second parenthesized part:

q: matches(regex("(^.*nessusd[[:space:]])(.+)([(]build .*$)")) of lines of files "d:\temp\nessus.txt"
A: [Wed May 26 13:10:49 2021][2466.1] nessusd 6.11.1 (build M20101) started 
A: [Wed May 26 21:33:09 2021][258.1] nessusd 6.11.1 (build M20101) started 
A: [Tue Jun  8 11:21:22 2021][274.1] nessusd 6.11.1 (build M20101) started 
A: [Thu Jun 10 19:01:47 2021][272.1] nessusd 6.11.1 (build M20101) started 
A: [Wed Jun 16 21:08:08 2021][278.1] nessusd 6.11.1 (build M20101) started 
T: 3.374 ms

q: parenthesized parts 2 of matches(regex("(^.*nessusd[[:space:]])(.+)([(]build .*$)")) of lines of files "d:\temp\nessus.txt"
A: 6.11.1 
A: 6.11.1 
A: 6.11.1 
A: 6.11.1 
A: 6.11.1 
T: 2.736 ms

Building the tuple string is a bit more complex:

q: (tuple string of substrings separated by " " of it) of lines of file "d:\temp\nessus.txt"
A: [Wed, May, 26, 13:10:49, 2021][2466.1], nessusd, 6.11.1, (build, M20101), started, 
A: [Wed, May, 26, 13:10:49, 2021][2466.1], nessusd, 6.11.1, This, line, does, not, match
A: [Wed, May, 26, 21:33:09, 2021][258.1], nessusd, 6.11.1, (build, M20101), started, 
A: [Wed, Jun, , 2, 14:59:29, 2021][298.1], nessusd, This, line, does, not, match, either
A: [Tue, Jun, , 8, 11:21:22, 2021][274.1], nessusd, 6.11.1, (build, M20101), started, 
A: [Thu, Jun, 10, 19:01:47, 2021][272.1], nessusd, 6.11.1, (build, M20101), started, 
A: [Wed, Jun, 16, 21:08:08, 2021][278.1], nessusd, 6.11.1, (build, M20101), started,

That’s mostly right, but the fact that “Jun 10” has one space between the month and the day, while “Jun 2” has two spaces, kind of throws it all off.

We can mostly get around that by throwing out the “tuple string” parts that evaluate to an empty string. This works-around the “extra space”, but some of the otherwise not-matching lines may still be included in the result:

q: tuple string items 6 of (tuple string of substrings separated by " " whose (it as trimmed string != "") of it) of lines of file "d:\temp\nessus.txt"
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: This
A: 6.11.1
A: 6.11.1
A: 6.11.1
T: 2.220 ms

So we have to go back to filtering for only the lines we want to inspect:

q: tuple string items 6 of (tuple string of substrings separated by " " whose (it as trimmed string != "") of it) of lines whose (it contains "nessusd" and it contains "(build ") of file "d:\temp\nessus.txt"
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
A: 6.11.1
T: 1.024 ms

JasonWalker · October 4, 2021, 2:01am

How’s the speed if you filter for " (build " first?

Q: maximum of (it as version) of following texts of firsts "nessusd " of preceding texts of firsts " (build " of lines of files “/Library/NessusAgent/run/var/nessus/logs/nessusd.messages”

(supposing, “nessusd” may appear more frequently in the file than " (build" does, so maybe it’s worth limiting to the lines containing ‘build’ first?)

atlauren · October 4, 2021, 2:01am

Combining your preceding with my lines filter:

Q: maximum of (it as version) of preceding texts of firsts " (build " of following texts of firsts "nessusd " of (lines containing "nessusd" of file "/Library/NessusAgent/run/var/nessus/logs/nessusd.messages") whose (it contains "build")
A: 6.11.1
T: 1297

atlauren · October 4, 2021, 2:06am

Q: maximum of (tuple string items 6 of it as version) of (tuple string of substrings separated by " " whose (it as trimmed string != "") of it) of ((lines of file "/Library/NessusAgent/run/var/nessus/logs/nessusd.messages") whose (it contains "build" and it contains "nessusd"))
A: 6.11.1
T: 1664

atlauren · October 4, 2021, 2:11am

Q: maximum of (it as version) of following texts of firsts "nessusd " of preceding texts of firsts " (build " of lines of files "/Library/NessusAgent/run/var/nessus/logs/nessusd.messages"
A: 6.11.1
T: 1841

JasonWalker · October 4, 2021, 2:11am

One more approach that may be worthwhile - I don’t really expect much better performance but it’s worth checking that instead of finding “the highest version in the file”, you could also try to read it from “the last line in the file that contains a version”

q: (it as version) of preceding texts of firsts " (build " of following texts of firsts "nessusd " of lines (maximum of line numbers of lines whose (it contains "(build" and it contains "nessusd") of it) of files "d:\temp\nessus.txt"
A: 6.11.1
T: 5.197 ms

On my very small, limited file, this is about 500 ms faster than

q: maximum of (it as version) of preceding texts of firsts " (build " of following texts of firsts "nessusd " of lines of files "d:\temp\nessus.txt"

A: 6.11.1
T: 5.797 ms

AlanM · October 4, 2021, 4:01pm

Remember your times will be much longer running in the agent due to the sleeping so that 500ms might be significant

atlauren · October 4, 2021, 10:26pm

FWIW, my testing found that no other variant was faster than this one:

maximum of (parenthesized parts 2 of matches(regex("(^.*nessusd[[:space:]])(.+)([[:space:]][(]build .*$)")) of lines of file "/Library/NessusAgent/run/var/nessus/logs/nessusd.messages" as version)

jgstew · October 21, 2021, 2:58pm

I don’t think I’ve ever seen this before, neat:

I have a love/hate/love relationship with RegEx.

I’m curious about the speed of this: (the version inspector has built in string parsing for version numbers)

maxima of (it as version) of following texts of firsts " nessusd " of lines containing "build" of files "/Library/NessusAgent/run/var/nessus/logs/nessusd.messages"

I ran both with a test file that I added a ton of lines to, and I can’t tell if the regex is faster or this.