Relevance Optimization (2)

ANaik · May 30, 2021, 3:15pm

Hi,

This is in continuation with the below article:

https://forum.bigfix.com/t/relevance-optimization/37987

The relevance worked fine, until there were multiple entries of Test1*, that is Test1, Test11, Test123 etc. All these entries were picked.

I have modified the relevance slightly, to fetch the exact keyword.

Below is the relevance:

q: (elements of item 0 of it, elements of item 1 of it) whose (item 1 of it contains (following text of first "RESULT:" of item 0 of it as trimmed string) & " ") of (set of lines of file "test.txt" of it, set of lines of file "test.log" of it) of folder "/tmp"

Added " " to differentiate between the multiple entries. In this case, it will pick only “Test1”.
However, the evaluation time has become too large if there are more than 70 entries of “Test1*”, and “Inspector Interrupted” messages are displayed.

Evaluation Time : "T: 12516639"

Below is the content of test.log file:

Testing
RESULT: Test1
Data1
Data2

Content of /tmp/test.txt

Test1 abc
Test2 123
Test3 !@#
Test4 ,./
Test12 abcd
Test13 abce
Test16 abcf
Test17 abcg

Can someone help me in optimizing this? It would be a great help.
Thanks in advance!!

JasonWalker · May 30, 2021, 3:48pm

Comparing every line of two files is going to be intrinsically an expensive operation. This may be better left to something that runs periodically in an Action, outputting only the matching results into a file you could retrieve in an Analysis later.

That said, I only see a couple of further optimizations that can be made. In my admittedly-small test set this evaluates about 25% faster.

The math on this is that there will be (size of set 0 X size of set 1) comparisons to make. The options available to us are to reduce the sizes of the sets, or increase the speed of the comparison.

I’ve increased the sample a bit to two files of ten lines each:

q: lines of file "d:\temp\test.log"
A: Testing
A: RESULT: Test1
A: Data1
A: Data2
A: RESULT: Test2
A: Data3
A: Data4
A: RESULT: Test12
A: Data5
A: Data6
T: 0.449 ms

q: lines of file "d:\temp\test.txt"
A: Test1 abc
A: Test2 123
A: Test3 !@#
A: Test4 ,./
A: Test5 ,./
A: Test6 ,./
A: Test12 abcd
A: Test13 abce
A: Test16 abcf
A: Test17 abcg
T: 0.458 ms

This would result in 10 X 10 comparisons, or 100 comparisons to evaluate.

The optimization so far has increased the speed of the comparison, by reducing the number of file reads. We read the lines of the files one time, bring the results into string sets, and use those for comparisons. So at least we don’t read each file one hundred times, but we still have to make 100 comparisons.

One thing we can do to reduce the size of one of the string sets is to filter the read on “test.log”, keeping only the lines that start with "RESULT: ":

q: set of (lines starting with "RESULT: " of file "test.log" of it) of folder "d:\temp"

Rather than keeping (and comparing) all ten lines, the string set will instead only have 3 elements to compare against. The three elements of test.log X the ten elements of test.txt now has 30 comparisons to make, rather than a hundred.

One more thing we can do to increase the comparison speed is to avoid the “contains”, “begins with”, or “ends with” comparisons. An exact match is a faster comparison than “begins with” or “ends with”, and it looks like exact match comparisons are appropriate here. It appears to me the comparison could be

( preceding text of first " " of (log file)  =  following text of first "RESULT: " of (text file) )

Putting that all together I end up with

// Original query
q: (elements of item 0 of it, elements of item 1 of it) whose (item 0 of it contains (following text of first "RESULT:" of item 1 of it as trimmed string) & " ") of (set of lines of file "test.txt" of it, set of lines of file "test.log" of it) of folder "d:\temp"
A: Test1 abc, RESULT: Test1
A: Test12 abcd, RESULT: Test12
A: Test2 123, RESULT: Test2
T: 2.290 ms

// Updated
q: (elements of item 0 of it, elements of item 1 of it) whose (preceding text of first " " of item 0 of it = following text of first " " of item 1 of it) of (set of lines of file "test.txt" of it, set of lines starting with "RESULT: " of file "test.log" of it) of folder "d:\temp"
A: Test1 abc, RESULT: Test1
A: Test12 abcd, RESULT: Test12
A: Test2 123, RESULT: Test2
T: 1.741 ms

ANaik · June 1, 2021, 9:33am

Thanks @JasonWalker… I shall try this one