Relevance Challenge December 2019 BONUS: Parsing Paragraphs (answer provided)

JasonWalker · December 11, 2019, 10:10pm

This particular challenge is for some advanced file line parsing. This has come up in the context of the recent HP SSD Firmware issue, but could be generalized to apply to a number of file parsing scenarios.

Suppose one is presented with the following line output (in this case, a result of running the hpacucli utility on a Linux machine:

	Logical Drive: 1
	         Size: 1.5 TB
	         Fault Tolerance: 1
	         Heads: 255
	         Sectors Per Track: 32
	         Cylinders: 65535
	         Strip Size: 256 KB
	         Full Stripe Size: 256 KB
	         Status: OK
	         Caching:  Enabled
	         Unique Identifier: 600508B1001C266BD863ABFCB8EE253A
	         Disk Name: /dev/sda
	         Mount Points: /boot 500 MB Partition Number 2
	         OS Status: LOCKED
	         Logical Drive Label: A97762A550123456789ABCDEC78E
	         Mirror Group 1:
	            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 1600.3 GB, OK)
	         Mirror Group 2:
	            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SATA, 1600.3 GB, OK)
	         Drive Type: Data
	         LD Acceleration Method: Controller Cache
	
	      physicaldrive 1I:1:1
	         Port: 1I
	         Box: 1
	         Bay: 1
	         Status: OK
	         Drive Type: Data Drive
	         Interface Type: Solid State SATA
	         Size: 1600.3 GB
	         Native Block Size: 4096
	         Firmware Revision: G2010141
	         Serial Number: BTWD535000H51P6HGN
	         Model: ATA     INTEL SSDSC2BB02
	         SATA NCQ Capable: True
	         SATA NCQ Enabled: True
	         Current Temperature (C): 26
	         Maximum Temperature (C): 53
	         SSD Smart Trip Wearout: Not Supported
	         PHY Count: 1
	         PHY Transfer Rate: 3.0Gbps
	
	      physicaldrive 1I:1:2
	         Port: 1I
	         Box: 1
	         Bay: 2
	         Status: OK
	         Drive Type: Data Drive
	         Interface Type: Solid State SATA
	         Size: 1600.3 GB
	         Native Block Size: 4096
	         Firmware Revision: G2010140
	         Serial Number: BTWD531503EN1P6HGN
	         Model: ATA     INTEL SSDSC2BB01
	         SATA NCQ Capable: True
	         SATA NCQ Enabled: True
	         Current Temperature (C): 25
	         Maximum Temperature (C): 52
	         SSD Smart Trip Wearout: Not Supported
	         PHY Count: 1
	         PHY Transfer Rate: 3.0Gbps

We are only interested in the data for a “physicaldrive” entry, and there may be any number of “physicaldrive” entries. So we need to find each “stanza”, where a “stanza” begins with a “physicaldrive” entry and ends with a blank line (or, potentially, the end of the file).

Within the stanza, we wish to retrieve the “Status”, “Model” and “Firmware Revision” entries. Ideally our output would look like the following:

A: physicaldrive 1I:1:1, Status: OK, Model: ATA INTEL SSDSC2BB02, Firmware Revision: G2010141
A: physicaldrive 1I:1:2, Status: OK, Model: ATA INTEL SSDSC2BB01, Firmware Revision: G2010140

I have one solution to this, but it was a challenge to me and I’m interested in whether there are any simpler solutions. I’ve manipulated the data slightly to show a different drive and different firmware version for the first entry - the first entry is not an real revision – in the real data the two were the same, but this is to demonstrate a need to keep the correct model & firmware related to each physicaldrive entry.

Different versions of the ‘hpacucli’ utility might include more or fewer properties of the physicaldrive, so we cannot necessarily rely on the “Model” entry being 11 lines after “physicaldrive”, nor can we rely on the “Firmware Revision” entry being 9 lines after the “physicaldrive” entry.

There is a lot to unwind in parsing this one, so do please be patient as you try to work through it. I know that there are at least two different solutions, that can be generalized to a variety of file parsing needs so I hope the solutions we come up with will be useful for all of us for a long time to come!

Previous Challenges:

Results Post

Several great answers were given for this challenge. I always like to see when there are multiple solutions to a problem. It would be very interesting to compare these for performance on large vs small files, and files with more or fewer matching “stanzas”. In the original case for which I was working on this, the output of the HP RAID command-line utility has as many as 20 matching stanzas on some of my machines.

Click the “Details” items below to expand each section. I have expanded each of the Relevance clauses below - to use them in the debugger you should either evaluate in the “Single Clause” tab, or at least paste it there and hit the “Collapse Relevance” button to bring it back into a single line.

Solution 1: Make it a Big String

From @MattPeterson:

 (
   "physicaldrive " & preceding text of first "%09" of it & ", Status: " & preceding text of first "%09" of following text of first "Status: " of it & ", Model: " & preceding text of first "%09" of following text of first "Model: " of it & ", Firmware Revision: " & preceding text of first "%09" of following text of first "Firmware Revision: " of it
 )
 of 
 (
   substrings separated by "physicaldrive " 
   whose
   (
 it contains "Port: "
   )
   of concatenation of lines of file "c:\temp\test.txt"
 )

A: physicaldrive 1I:1:1, Status: OK, Model: ATA     INTEL SSDSC2BB02, Firmware Revision: G2010141
A: physicaldrive 1I:1:2, Status: OK, Model: ATA     INTEL SSDSC2BB01, Firmware Revision: G2010140
T: 1.420 ms
I: plural string

This is an interesting approach. First, all of the lines of the file are concatenated into a single long string.
Next, we split into separate strings based on the token “physicaldrive”, and keep only the substrings that contain a “Port:” entry - assuming that each PhysicalDrive stanza must contain a ‘Port’ entry, this helps reduce spurious matches. At this point we have two results in the sample data -

q: substrings separated by "physicaldrive " whose (it contains "Port: ") of concatenation of lines of file "c:\temp\test.txt" 
A: 1I:1:1%09         Port: 1I%09         Box: 1%09         Bay: 1%09         Status: OK%09         Drive Type: Data Drive%09         Interface Type: Solid State SATA%09         Size: 1600.3 GB%09         Native Block Size: 4096%09         Firmware Revision: G2010141%09         Serial Number: BTWD535000H51P6HGN%09         Model: ATA     INTEL SSDSC2BB02%09         SATA NCQ Capable: True%09         SATA NCQ Enabled: True%09         Current Temperature (C): 26%09         Maximum Temperature (C): 53%09         SSD Smart Trip Wearout: Not Supported%09         PHY Count: 1%09         PHY Transfer Rate: 3.0Gbps%09%09      
A: 1I:1:2%09         Port: 1I%09         Box: 1%09         Bay: 2%09         Status: OK%09         Drive Type: Data Drive%09         Interface Type: Solid State SATA%09         Size: 1600.3 GB%09         Native Block Size: 4096%09         Firmware Revision: G2010140%09         Serial Number: BTWD531503EN1P6HGN%09         Model: ATA     INTEL SSDSC2BB01%09         SATA NCQ Capable: True%09         SATA NCQ Enabled: True%09         Current Temperature (C): 25%09         Maximum Temperature (C): 52%09         SSD Smart Trip Wearout: Not Supported%09         PHY Count: 1%09         PHY Transfer Rate: 3.0Gbps

The next step is selecting the fields of interest. That selection on each of the substrings is performed via

     (
       "physicaldrive " & preceding text of first "%09" of it & ", Status: " & preceding text of first "%09" of following text of first "Status: " of it & ", Model: " & preceding text of first "%09" of following text of first "Model: " of it & ", Firmware Revision: " & preceding text of first "%09" of following text of first "Firmware Revision: " of it
     )

Basically this relies on the fact that each line in the sample output started with a TAB character ( %09 ). We can find each field that we want by its name, and know that its value is the next portion of the string up to the TAB character that began the next line of output.

Nice solution, Matt

Solution 2: Regex to the Rescue

What if you can’t depend on the TABs separating the lines? Well, separate them with your own delimeter, as in the next solution.

From @strawgate:

 (
   (
     it as trimmed string
   )
   of 
   (
     parenthesized part 1 of 
     (
       first match 
       (
         regex "\s*([^;]*);;"
       )
       of it
     )
   )
   , 
   (
     parenthesized part 1 of 
     (
       first match 
       (
         regex "(Status: \w*);;"
       )
       of it
     )
   )
   , 
   (
     (
       parenthesized part 1 of 
       (
         first match 
         (
           regex "(Model: [^;]*);;"
         )
         of it
       )
     )
   )
   , 
   (
     parenthesized part 1 of 
     (
       first match 
       (
         regex "(Firmware Revision: [^;]*);;"
       )
       of it
     )
   )
 )
 of substrings after ";;;;" of concatenations ";;" of 
 (
   lines of file "C:\temp\test.txt" as trimmed string
 )
 
physicaldrive 1I:1:1, Status: OK, Model: ATA     INTEL SSDSC2BB02, Firmware Revision: G2010141
physicaldrive 1I:1:2, Status: OK, Model: ATA     INTEL SSDSC2BB01, Firmware Revision: G2010140

This is another interesting solution. Like the first one, this begins by joining together all the lines of the file into one big string. In this case, though, we trim the lines first, and when concatenating them together we insert our own delimiter between them - a double semicolon ( ;; )

concatenations ";;" of 
     (
       lines of file "C:\temp\test.txt" as trimmed string
     )

(note - when we join values this way we want to ensure we use a delimiter that is not likely to appear in the original data. A single semicolon in the data is very common, but a double semicolon is much less likely. I often prefer to use the CR/LF combination - I know I have stripped out those characters when I joined the lines, so I can safely put them back in and then look for substrings separated by “%0d%0a” afterward. But if you figure out how to match the “%0d%0a” characters in a Regular Expression in the BES Client, please do let me know…)

At this point, the blank lines can be found where we have four semicolons together , so we have our two drive entries in separate results.

    q: substrings after ";;;;" of concatenations ";;" of (lines of file "C:\temp\test.txt" as trimmed string) 
    A: physicaldrive 1I:1:1;;Port: 1I;;Box: 1;;Bay: 1;;Status: OK;;Drive Type: Data Drive;;Interface Type: Solid State SATA;;Size: 1600.3 GB;;Native Block Size: 4096;;Firmware Revision: G2010141;;Serial Number: BTWD535000H51P6HGN;;Model: ATA     INTEL SSDSC2BB02;;SATA NCQ Capable: True;;SATA NCQ Enabled: True;;Current Temperature (C): 26;;Maximum Temperature (C): 53;;SSD Smart Trip Wearout: Not Supported;;PHY Count: 1;;PHY Transfer Rate: 3.0Gbps
    A: physicaldrive 1I:1:2;;Port: 1I;;Box: 1;;Bay: 2;;Status: OK;;Drive Type: Data Drive;;Interface Type: Solid State SATA;;Size: 1600.3 GB;;Native Block Size: 4096;;Firmware Revision: G2010140;;Serial Number: BTWD531503EN1P6HGN;;Model: ATA     INTEL SSDSC2BB01;;SATA NCQ Capable: True;;SATA NCQ Enabled: True;;Current Temperature (C): 25;;Maximum Temperature (C): 52;;SSD Smart Trip Wearout: Not Supported;;PHY Count: 1;;PHY Transfer Rate: 3.0Gbps

Finally, within each of the two results, we can use Regular Expressions to perform text matches, such as

 parenthesized part 1 of 
         (
           first match 
           (
             regex "(Firmware Revision: [^;]*);;"
           )
           of it
         )

This finds the literal string "Firmware Revision: ", followed by any number of characters except a semicolon, followed by two semicolons. Since we include parentheses around the portion we want to keep, and the double-semicolons are outside those parentheses, our regex matches the whole string (including the double semicolons), but we discard the double semicolons by looking only at ‘parenthesized parts 1’ of the regex match.

This is a good case for Regular Expressions, and a good example of concatenating data together with a known delimiter when we know we’ll need to split it on that delimiter later in the query. This is much simpler than some of the regex approaches I took, so thanks very much @strawgate !

Solution 3: For a Few Lines More...

This solution from @jgstew takes a different approach. Rather than concatenating all of the lines of the file into one string, JGStew builds a ‘set’ of lines, and preserves the line numbers:

 (
   concatenations ", " of 
   (
     (
       it as trimmed string
     )
     of tuple string items 0 of it; 
     (
       it as trimmed string
     )
     of 
     (
       following texts of firsts ": " of it
     )
     of tuple string items 
     whose
     (
       it contains "Status: " 
      OR
       it contains "Firmware Revision: " 
      OR
       it contains "Model: "
     )
     of it
   )
 )
 of substrings separated by "physicaldrive" 
 whose
 (
   it contains ","
 )
 of concatenations ", " of following texts of firsts ", " of items 0 of 
 (
   elements of it, elements 
   whose
   (
     it contains "physicaldrive" 
    and
     it does not contain "("
   )
   of it
 )
 whose
 (
   tuple string item 0 of item 0 of it as integer >= tuple string item 0 of item 1 of it as integer 
  AND
   tuple string item 0 of item 0 of it as integer <= tuple string item 0 of item 1 of it as integer + 18
 )
 of sets of 
 (
   line number of it as string & ", " & it
 )
 of lines of files "C:\temp\test.txt"

1I:1:1, OK, G2010141, ATA     INTEL SSDSC2BB02
1I:1:2, OK, G2010140, ATA     INTEL SSDSC2BB01

That’s a long query, so let’s break it down.

 sets of 
     (
       line number of it as string & ", " & it
     )
     of lines of files "C:\temp\test.txt"

This takes the file’s lines and puts them into a set. Using a ‘set’ would discard duplicates, and sort the results in text sort order; but, because JG prefixes each line with its unique line number, there are no duplicates. Note the ‘set’ of lines is not in the same order as the lines in the file though, because in text “2” sorts after “10”. If we look at the ‘elements of set’ for this so far, we have

1, %09Logical Drive: 1
10, %09         Caching:  Enabled
11, %09         Unique Identifier: 600508B1001C266BD863ABFCB8EE253A
12, %09         Disk Name: /dev/sda
13, %09         Mount Points: /boot 500 MB Partition Number 2
14, %09         OS Status: LOCKED
15, %09         Logical Drive Label: A97762A550123456789ABCDEC78E
16, %09         Mirror Group 1:
17, %09            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 1600.3 GB, OK)
18, %09         Mirror Group 2:
19, %09            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SATA, 1600.3 GB, OK)
2, %09         Size: 1.5 TB
20, %09         Drive Type: Data

Next, JGStew iterates through the elements in his set -

 (
   elements of it, elements 
   whose
   (
     it contains "physicaldrive" 
    and
     it does not contain "("
   )
   of it
 )
 whose
 (
   tuple string item 0 of item 0 of it as integer >= tuple string item 0 of item 1 of it as integer 
  AND
   tuple string item 0 of item 0 of it as integer <= tuple string item 0 of item 1 of it as integer + 18
 )

Here, ‘item 0’ will be each item in the set, while ‘item 1’ will be those items in the set that have a ‘physicaldrive’ entry. We keep only the results where the original line number, here represented by ‘tuple string item 0’, is greater than the line number that contained ‘physicaldrive’, but less than 18 greater than the ‘physicaldrive’ line number.
This works, but has a limitation in that a given section must be less than 18 lines long.

Next, JG keeps only the ‘items 0’ - the lines that came after ‘physicaldrive’, and concatenates them together with the ", " characters; at this point he would have one long string, that he separates again by finding the “physicaldrive” text.

substrings separated by "physicaldrive" 
     whose
     (
       it contains ","
     )
     of concatenations ", " of following texts of firsts ", " of items 0 of

So now we’re down to two blobs of text - one for each physicaldrive entry.
Finally, JG grabs the portions of the text that we care about - the “Status”, “Firmware revision”, and “Model”, by using the ‘tuple string items’ property to automatically split on the comma-space (", ") combination, and keep only the ones that match a field we want to keep. This discards the extraneous entries like “Box” and “Bay”.

tuple string items 
         whose
         (
           it contains "Status: " 
          OR
           it contains "Firmware Revision: " 
          OR
           it contains "Model: "
         )
         of it

This is a good example of using the ‘tuple string items’ inspector that was so helpful in the previous challenge. Great work @JGStew !

Solution 4 - Counting Lines

This one came from yours truly, @JasonWalker. This solution is based on finding the line numbers of a section's start, finding the line numbers of a section's end, and then reading the lines between the start and end of each section.

  (
   concatenation ", " of 
   (
 items 2 
 whose
 (
   it contains "physicaldrive" 
  or
   it contains "Status" 
  or
   it contains "Firmware" 
  or
   it contains "Model"
 )
 of it as trimmed string
   )
   of 
   (
 item 0 of it, item 1 of it, lines of item 2 of it
   )
   whose
   (
 line number of item 2 of it >= item 0 of it 
and
 line number of item 2 of it < item 1 of it
   )
 )
 of 
 (
   (
 item 0 of it, minimum of items 1 of 
 (
   item 0 of it, elements of item 1 of it
 )
 whose
 (
   item 0 of it < item 1 of it
 )
 , item 2 of it
   )
   of 
   (
 elements of items 0 of it, item 1 of it, item 2 of it
   )
   of 
   (
 /* start of stanza */ set of line numbers of lines 
 whose
 (
   it as trimmed string starts with "physicaldrive" 
  and
   it does not contain "("
 )
 of it, 
 
   /* end of stanza, or end of file */  set of (line numbers of lines 
   whose
   (
     it as trimmed string = ""
   )
   of it; number of lines of it)
 
 , it
   )
   of it
 )
 of 
 (
   files "c:\temp\test.txt"
 )

I define “beginning” a stanza as a line that, when trimmed, begins with “physicaldrive”. I define the “end” of a stanza as a line that is either a blank line, or the last line of the file.

We can list those two sets to get the list of potential starting and potential ending line numbers:

(elements of item 0 of it, elements of item 1 of it) of ((set of line numbers of lines whose (it as trimmed string starts with "physicaldrive" and it does not contain "(") of it, set of (line numbers of lines whose (it as trimmed string = "") of it; number of lines of it) , it) of it) of (files "c:\temp\test.txt")

23, 22
23, 42
23, 61
43, 22
43, 42
43, 61

Next, I iterate through all of the “starting” line numbers (elements of item 0 of it), and for each “start” I find the matching “end” line number - which is the lowest line number that is greater than the starting line.

 (
 item 0 of it, minimum of items 1 of 
 (
   item 0 of it, elements of item 1 of it
 )
 whose
 (
   item 0 of it < item 1 of it
 )
 , item 2 of it
   )
   of 
   (
 elements of items 0 of it, item 1 of it, item 2 of it
   )

At this point the result is

23, 42, "test.txt" "" "" "" ""
43, 61, "test.txt" "" "" "" ""

So I know we have one stanza beginning on line 23 and ending on line 42, and another starting on line 43 and ending on line 61. From there, I can retrieve the lines of the file that are equal to or greater than the start of the stanza, and lower than the line number of the end of the stanza:

    (
 item 0 of it, item 1 of it, lines of item 2 of it
   )
   whose
   (
 line number of item 2 of it >= item 0 of it 
and
 line number of item 2 of it < item 1 of it
   )
 )

Finally, I only keep the lines (item 2 of it), discarding the start & ending line numbers, and additionally filter with whose() to include only the fields that are needed:

  concatenation ", " of 
   (
     items 2 
     whose
     (
       it contains "physicaldrive" 
      or
       it contains "Status" 
      or
       it contains "Firmware" 
      or
       it contains "Model"
     )
     of it as trimmed string
   )

I like this form of query because it is easy to swap out the values of “what the start of a stanza looks like” and “what the end of a stanza looks like”, but I also dislike this query because it could be less efficient on large files with lots of matching stanzas - because we have to re-read the lines of the file for each matching paragraph, and each time we throw out all the lines except those that are in the paragraph.

In this example we read the entire file once to find the starting line numbers, re-read it again to find the ending line numbers, re-read it a third time to find all the lines that are in the first paragraph, and then re-read it a fourth time to find all the lines that are in the second paragraph.

Much thanks to everyone who took a stab at this challenge, it was definitely not an easy one. Now that some of the solutions are out, please do give your comments, questions, and discussions on these, and on any new inspired ideas you may have on how to approach this problem - there do seem to be a variety of use-cases where this can be helpful.

Jonathan · April 3, 2020, 12:21am

I don’t see how to do this…

Since I’m buggin’ you anyway I’m on the hunt for what the percent encoding value is that I can plug into a query to isolate “DEBUG” in the following:

<configuration>
	<configSections>
		<section name="log4net" type="log4net.Config.Log4NetConfigurationSectionHandler, log4net"/>
	</configSections>
	<log4net>
		<root>
			<level value="DEBUG"/>
			<appender-ref ref="ServerAppender" />

where there is only one “<appender-ref ref="ServerAppender" />” in the file.

Whitespace at the front of a given line are tabs.

Mightn’t you offer any words of wisdom por favor?

JasonWalker · April 3, 2020, 1:22am

On a phone (as I usually browse the forum) I tap on the word “Details” in the post, and it expands the…details. like so

For your case, that looks like XML…I’d use the XML inspectors (I’ll need to get back to a computer for that though)

JasonWalker · April 3, 2020, 1:27am

Some links that may help