Retrieving and Grouping multiple strings from 1 line of a text file

I have a number of analyses that are able to read lines form text files fine. I then had a challenge where i need to return the line of a text file that was a number of lines after a specific string but searching these forums gave me the answer to that but now i appear to be stuck and am hoping someone can advise?

The below correctly returns all the relevant lines from the set of log files in my folder.

concatenation ", " of (it as trimmed string) of (unique values of lines (elements of set of (it + 11) of line numbers of lines whose (it contains “* 1 ( APPROVED )”) of it) of files whose (name of it as string as lowercase contains “mylog.log”) of folder “c:\myfolder”)

Result:
22-Nov-2021 15:13:33,471 [ReqProc] ERROR [T:123456] C:/myfolder/logs - Output Responses (output.txt) :7=12345678 29=X12345678910111 6=abcd abcde 5=123456XXXXXX1234 4=12345 14=1 1=1 22=1 Street , Streeville , XX1 1XX 12=12345678 23=1234 - xxxxxx 13=12345678 8=12345678 3=1 21=123 9=123456 2=1 34=XXXXXX XXXXX XX XXXXXXX 33=XXXXXX XXXX XXXX XXXXXXX XXX XXXX XXXXXXX 36=12345 37=1234 38=1234 41=1 30=1 31=X1 12 32=12 34 56 78 90 59=0000000000000000000000000000000000000000 76=123456789 60=123 80=123456789101 98=12345678912345678912 99=1

Naturally if there is more than one result it concatenates as expected. Where there are lots of results this results in a massive string of text that then can’t be returned in the console.

I only actually need a few strings from each line

preceding texts of firsts " 41=" of following texts of firsts "38=“
preceding texts of firsts " 4=” of following texts of firsts "5=“
preceding texts of firsts " 99=” of following texts of firsts “98=”

but no idea how to get these 3 items from each line and then concatenate all the lines together.

Thanks in advance

A really interesting query. There are several ways to do this, and my first inclination was to use regular expressions but I think we can build on what you already have to get there a bit more clearly.

First, a sample of the text file I’m using for testing…

q: lines of files whose (name of it as string as lowercase contains "test.txt") of folder "c:\temp"
A: line 1 * 1 ( APPROVED )
A: line 
A: line 
A: some line
A: some line
A: some line
A: some line
A: some line
A: some line
A: some line
A: some line
A: 2-Nov-2021 15:13:33,471 [ReqProc] ERROR [T:123456] C:/myfolder/logs - Output Responses (output.txt) :7=12345678 29=X12345678910111 6=abcd abcde 5=123456XXXXXX5678 4=12345 14=1 1=1 22=1 Street , Streeville , XX1 1XX 12=12345678 23=1234 - xxxxxx 13=12345678 8=12345678 3=1 21=123 9=123456 2=1 34=XXXXXX XXXXX XX XXXXXXX 33=XXXXXX XXXX XXXX XXXXXXX XXX XXXX XXXXXXX 36=12345 37=1234 38=9876 41=1 30=1 31=X1 12 32=12 34 56 78 90 59=0000000000000000000000000000000000000000 76=123456789 60=123 80=123456789101 98=98765432 99=1
A: some other line
A: line 1 * 1 ( APPROVED )
A: line 
A: line 
A: some line
A: some line
A: some line
A: some line
A: some line
A: some line
A: some line
A: some line
A: 3-Nov-2021 15:13:33,471 [ReqProc] ERROR [T:123456] C:/myfolder/logs - Output Responses (output.txt) :7=12345678 29=X12345678910111 6=abcd abcde 5=123456XXXXXX1234 4=12345 14=1 1=1 22=1 Street , Streeville , XX1 1XX 12=12345678 23=1234 - xxxxxx 13=12345678 8=12345678 3=1 21=123 9=123456 2=1 34=XXXXXX XXXXX XX XXXXXXX 33=XXXXXX XXXX XXXX XXXXXXX XXX XXXX XXXXXXX 36=12345 37=1234 38=1234 41=1 30=1 31=X1 12 32=12 34 56 78 90 59=0000000000000000000000000000000000000000 76=123456789 60=123 80=123456789101 98=12345678912345678912 99=1
A: some other line
T: 23.538 ms
I: plural file line

In your query, you’re pulling ‘unique values of lines’; I don’t think it’s worth pulling the unique values, since the lines appear to have timestamps on them I expect every line to be unique anyway.

I’d start by filtering down to the two lines that we want - that appear 11 lines after your “Approval” string

q: lines ((it + 11) of line numbers of lines whose (it contains "* 1 ( APPROVED )") of it) of files whose (name of it as string as lowercase contains "test.txt") of folder "c:\temp"
A: 2-Nov-2021 15:13:33,471 [ReqProc] ERROR [T:123456] C:/myfolder/logs - Output Responses (output.txt) :7=12345678 29=X12345678910111 6=abcd abcde 5=123456XXXXXX5678 4=12345 14=1 1=1 22=1 Street , Streeville , XX1 1XX 12=12345678 23=1234 - xxxxxx 13=12345678 8=12345678 3=1 21=123 9=123456 2=1 34=XXXXXX XXXXX XX XXXXXXX 33=XXXXXX XXXX XXXX XXXXXXX XXX XXXX XXXXXXX 36=12345 37=1234 38=9876 41=1 30=1 31=X1 12 32=12 34 56 78 90 59=0000000000000000000000000000000000000000 76=123456789 60=123 80=123456789101 98=98765432 99=1
A: 3-Nov-2021 15:13:33,471 [ReqProc] ERROR [T:123456] C:/myfolder/logs - Output Responses (output.txt) :7=12345678 29=X12345678910111 6=abcd abcde 5=123456XXXXXX1234 4=12345 14=1 1=1 22=1 Street , Streeville , XX1 1XX 12=12345678 23=1234 - xxxxxx 13=12345678 8=12345678 3=1 21=123 9=123456 2=1 34=XXXXXX XXXXX XX XXXXXXX 33=XXXXXX XXXX XXXX XXXXXXX XXX XXXX XXXXXXX 36=12345 37=1234 38=1234 41=1 30=1 31=X1 12 32=12 34 56 78 90 59=0000000000000000000000000000000000000000 76=123456789 60=123 80=123456789101 98=12345678912345678912 99=1
T: 20.534 ms
I: plural file line

Next, you’re pulling the strings between one “XX=” and the next “XX=”. I’d start by making sure I can correctly find the start of the strings. Here are a couple of queries that demonstrate finding the starting marks, and finding their numerical positions within the string…I ended up not using the positions, but maybe useful to think about it conceptually. The grouping here demonstrates pulling multiple “firsts X of it” for each line of input…I get 3 matches on each of 2 lines, for six results total.

q: ((firsts "38=" of it; firsts "5=" of it; firsts "98=" of it)) of lines ((it + 11) of line numbers of lines whose (it contains "* 1 ( APPROVED )") of it) of files whose (name of it as string as lowercase contains "test.txt") of folder "c:\temp"
A: 38=
A: 5=
A: 98=
A: 38=
A: 5=
A: 98=
T: 17.311 ms
I: plural substring

q: (starts of (firsts "38=" of it; firsts "5=" of it; firsts "98=" of it)) of lines ((it + 11) of line numbers of lines whose (it contains "* 1 ( APPROVED )") of it) of files whose (name of it as string as lowercase contains "test.txt") of folder "c:\temp"
A: 380
A: 144
A: 505
A: 380
A: 144
A: 505
T: 14.004 ms
I: plural string position

Next, we can pull “everything after” those starting points…again, six strings total, and each of these matches goes “from the string we want, to the end of the line”

q: (following texts of (firsts "38=" of it; firsts "5=" of it; firsts "98=" of it)) of lines ((it + 11) of line numbers of lines whose (it contains "* 1 ( APPROVED )") of it) of files whose (name of it as string as lowercase contains "test.txt") of folder "c:\temp"
A: 9876 41=1 30=1 31=X1 12 32=12 34 56 78 90 59=0000000000000000000000000000000000000000 76=123456789 60=123 80=123456789101 98=98765432 99=1
A: 123456XXXXXX5678 4=12345 14=1 1=1 22=1 Street , Streeville , XX1 1XX 12=12345678 23=1234 - xxxxxx 13=12345678 8=12345678 3=1 21=123 9=123456 2=1 34=XXXXXX XXXXX XX XXXXXXX 33=XXXXXX XXXX XXXX XXXXXXX XXX XXXX XXXXXXX 36=12345 37=1234 38=9876 41=1 30=1 31=X1 12 32=12 34 56 78 90 59=0000000000000000000000000000000000000000 76=123456789 60=123 80=123456789101 98=98765432 99=1
A: 98765432 99=1
A: 1234 41=1 30=1 31=X1 12 32=12 34 56 78 90 59=0000000000000000000000000000000000000000 76=123456789 60=123 80=123456789101 98=12345678912345678912 99=1
A: 123456XXXXXX1234 4=12345 14=1 1=1 22=1 Street , Streeville , XX1 1XX 12=12345678 23=1234 - xxxxxx 13=12345678 8=12345678 3=1 21=123 9=123456 2=1 34=XXXXXX XXXXX XX XXXXXXX 33=XXXXXX XXXX XXXX XXXXXXX XXX XXXX XXXXXXX 36=12345 37=1234 38=1234 41=1 30=1 31=X1 12 32=12 34 56 78 90 59=0000000000000000000000000000000000000000 76=123456789 60=123 80=123456789101 98=12345678912345678912 99=1
A: 12345678912345678912 99=1
T: 10.565 ms
I: plural substring

I notice from the input field order than in each of your cases above, you are looking for “the next field number”, but I think it’s easier to approach it by looking for “the next space after the field I’m looking at”. That way we don’t need to know the number of the next field, we just know that this field ends with a space.

q: (following texts of (firsts "38=" of it; firsts "5=" of it; firsts "98=" of it)) of lines ((it + 11) of line numbers of lines whose (it contains "* 1 ( APPROVED )") of it) of files whose (name of it as string as lowercase contains "test.txt") of folder "c:\temp"
A: 9876 41=1 30=1 31=X1 12 32=12 34 56 78 90 59=0000000000000000000000000000000000000000 76=123456789 60=123 80=123456789101 98=98765432 99=1
A: 123456XXXXXX5678 4=12345 14=1 1=1 22=1 Street , Streeville , XX1 1XX 12=12345678 23=1234 - xxxxxx 13=12345678 8=12345678 3=1 21=123 9=123456 2=1 34=XXXXXX XXXXX XX XXXXXXX 33=XXXXXX XXXX XXXX XXXXXXX XXX XXXX XXXXXXX 36=12345 37=1234 38=9876 41=1 30=1 31=X1 12 32=12 34 56 78 90 59=0000000000000000000000000000000000000000 76=123456789 60=123 80=123456789101 98=98765432 99=1
A: 98765432 99=1
A: 1234 41=1 30=1 31=X1 12 32=12 34 56 78 90 59=0000000000000000000000000000000000000000 76=123456789 60=123 80=123456789101 98=12345678912345678912 99=1
A: 123456XXXXXX1234 4=12345 14=1 1=1 22=1 Street , Streeville , XX1 1XX 12=12345678 23=1234 - xxxxxx 13=12345678 8=12345678 3=1 21=123 9=123456 2=1 34=XXXXXX XXXXX XX XXXXXXX 33=XXXXXX XXXX XXXX XXXXXXX XXX XXXX XXXXXXX 36=12345 37=1234 38=1234 41=1 30=1 31=X1 12 32=12 34 56 78 90 59=0000000000000000000000000000000000000000 76=123456789 60=123 80=123456789101 98=12345678912345678912 99=1
A: 12345678912345678912 99=1
T: 10.565 ms
I: plural substring

q: (preceding texts of firsts " " of following texts of (firsts "38=" of it; firsts "5=" of it; firsts "98=" of it)) of lines ((it + 11) of line numbers of lines whose (it contains "* 1 ( APPROVED )") of it) of files whose (name of it as string as lowercase contains "test.txt") of folder "c:\temp"
A: 9876
A: 123456XXXXXX5678
A: 98765432
A: 1234
A: 123456XXXXXX1234
A: 12345678912345678912
T: 7.195 ms
I: plural substring

And finally, we want to group all three results from each line together. We need to wrap our “line inspection” in another set of parentheses so the concatenation applies to “all the matches within this line” instead of “to all the matches found across all lines”.

q: (concatenation "; " of (preceding texts of firsts " " of following texts of (firsts "38=" of it; firsts "5=" of it; firsts "98=" of it))) of lines ((it + 11) of line numbers of lines whose (it contains "* 1 ( APPROVED )") of it) of files whose (name of it as string as lowercase contains "test.txt") of folder "c:\temp"
A: 9876; 123456XXXXXX5678; 98765432
A: 1234; 123456XXXXXX1234; 12345678912345678912
T: 3.523 ms
I: plural string

What if we wanted to keep the field labels with the results? Well, going way back to that “starts of ‘X’” query above, illustrating that the “firsts X” and “lasts X” actually returns string positions, we can modify this very slightly…

following texts of preceding texts of (firsts "38=" of it; firsts "5=" of it; firsts "98=" of it)
The preceding text of (firsts "X") gives the substring from the beginning of the string up to “38=”; the following text of that is the string that begins with “38=”, so we can check from there up to the next space to get the field value including the “38=” string.

q: (concatenation "; " of (preceding texts of firsts " " of following texts of preceding texts of (firsts "38=" of it; firsts "5=" of it; firsts "98=" of it))) of lines ((it + 11) of line numbers of lines whose (it contains "* 1 ( APPROVED )") of it) of files whose (name of it as string as lowercase contains "test.txt") of folder "c:\temp"
A: 38=9876; 5=123456XXXXXX5678; 98=98765432
A: 38=1234; 5=123456XXXXXX1234; 98=12345678912345678912
T: 3.556 ms
I: plural string
4 Likes

Wow! Thank you so much for such a clear explanation and the end result works perfectly.

Would it be feasible to add the date/time stamp from the line as well?

@JasonWalker

I’ve been able to add the date by adding firsts 11 of it to this section

(firsts 11 of it; firsts “38=” of it; firsts “5=” of it; firsts “98=” of it)

as it’s working up to the first space in the line but i can’t seem to figure out how to get the time stamp, would really appreciate any advice so i can end up with

A: 02-Nov-2021 15:13:33; 38=9876; 5=123456XXXXXX5678; 98=98765432

thanks again

Maybe

Preceding text of first " ["

Hi Jason,

Thanks for replying, unfortunately that still just gives me the date without the time, i think the space between the date and time is causing the problem but I can’t work out how to get around it

haha, i found it

following texts of position 12 of it;

concatenation ", " of (concatenation "; " of (preceding texts of firsts " " of following texts of preceding texts of (firsts 11 of it; following texts of position 12 of it; firsts “38=” of it; firsts “5=” of it; firsts “98=” of it))) of lines ((it + 11) of line numbers of lines whose (it contains “* 1 ( APPROVED )”) of it) of files whose (name of it as string as lowercase contains “mylog.log”) of folder “c:\myfolder”

Amazing, thank you so much for your help @JasonWalker I could definitely not have figured that out without you! :grinning:

1 Like