Inspector performance for different data types (CSV, XML, JSON)

atlauren · February 7, 2025, 8:13pm

Has anyone interrogated inspector performance for the various methods of extracting data from files?

I have a usecase in which I will be storing data, distributing it to agents, and then parsing the data via relevance. I have the freedom of choosing the data format. This inspires questions of, of all of BigFix relevance’s data munging inspectors, which are notably faster or slower than others?

Of the structured data inspectors, there are:

For somewhat less structured data, we always have:

strings

Has anyone done bakeoffs?

Thanks,
Andrew

atlauren · February 7, 2025, 8:20pm

Personal experience informs that xml, plist, and regex are slower.

Strings is plenty fast. On brief experiments with json, it is also fast.

But this is just informal impressions from working with the product over the years.

orbiton · February 7, 2025, 9:10pm

Most of the time I’ve used to extract data from XML files or Text files.

So I’ve used XML, Regex and Strings.

Having the same experience as you mentioned

atlauren · February 8, 2025, 11:42pm

Playing around a bit, I’m using the student-scores.[ext] files available here:

If nothing else, they are well-structured examples of the same data across different formats, and of a decent size.

atlauren · February 8, 2025, 11:44pm

1. Count the total items

CSV

q: (number of lines of file "c:\users\atlauren\downloads\sling-academy\student-scores.csv") - 1
A: 2000
T: 19.845 ms
I: singular integer

XML

q: number of (xpaths "/students/item" of xml documents of file "c:\users\atlauren\downloads\sling-academy\student-scores.xml") 
A: 2000
T: 42.530 ms
I: singular integer

JSON

q: number of (elements of jsons of file "c:\users\atlauren\downloads\sling-academy\student-scores.json")
A: 2000
T: 99.210 ms
I: singular integer

Speed advantage: Strings (CSV)

atlauren · February 9, 2025, 12:00am

2. Count a filtered subset

CSV

q: (number of (lines of file "c:\users\atlauren\downloads\sling-academy\student-scores.csv") whose (it contains "Lawyer"))
A: 138
T: 20.234 ms
I: singular integer

Incremental speed cost: minimal

Structured CSV
Borrowing from our great saint @JasonWalker in 2016, we use regex, concatenation, and tuples to turn the CSV into something more useful, and a structured filter.

q: number of (it) whose (tuple string item 9 of it = "Lawyer") of ((( "( " & concatenation " ), ( " of parenthesized parts 3 of matches (regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)") of it & " )")) of lines of file "c:\users\atlauren\downloads\sling-academy\student-scores.csv") 
A: 138
T: 568.094 ms
I: singular integer

Incremental speed cost (structured): ~2800%

XML

q: number of xpaths "/students/item/career_aspiration/text()" whose (node value of it = "Lawyer") of (xml documents of file "c:\users\atlauren\downloads\sling-academy\student-scores.xml")
A: 138
T: 47.386 ms
I: singular integer

Incremental speed cost: ~10%

JSON

q: number of (elements of jsons of file "c:\users\atlauren\downloads\sling-academy\student-scores.json") whose (value of key "career_aspiration" of it as string = "Lawyer")
A: 138
T: 116.858 ms
I: singular integer

Incremental speed ocst: ~18%

Raw speed advantage: Strings (CSV)
Incremental speed advantage: Strings (CSV)
Holy cow the speed degredation: CSV (structured)

atlauren · February 9, 2025, 12:10am

However, the above examples are optimized only for speed, not further flexibility of actually using the data. (There might also be still-faster variations that I have not found.)

atlauren · February 9, 2025, 12:26am

3. Extract fields from structured filter

CSV
(No entry)

Structured CSV

q: (tuple string item 3 of it) of (it) whose (tuple string item 9 of it = "Lawyer") of ((( "( " & concatenation " ), ( " of parenthesized parts 3 of matches (regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)") of it & " )")) of lines of file "c:\users\atlauren\downloads\sling-academy\student-scores.csv") 
[SNIP]
T: 605.300 ms
I: plural string

Incremental speed cost: 25%

XML

q: node values of xpaths "email/text()" of parent nodes of parent nodes of xpaths "/students/item/career_aspiration/text()" whose (node value of it = "Lawyer") of (xml documents of file "c:\users\atlauren\downloads\sling-academy\student-scores.xml")
[SNIP]
T: 60.292 ms
I: plural string

Incremental speed cost: 42%
(There’s probably a more effiicent way of wrangling the XML.)

JSON

q: (value of keys "email" of it as string) of (elements of jsons of file "student-scores.json" of folder "c:\users\atlauren\downloads\sling-academy\") whose (value of key "career_aspiration" of it as string = "Lawyer")
[SNIP]
T: 117.187 ms
I: plural string

Incremental speed cost: insignificant

atlauren · February 9, 2025, 3:11am

Lessons:

CSV: No one should ever do this. Run away.

XML: Fastest, but annoying.

JSON: Decently fast, winner for simplicity.

atlauren · February 9, 2025, 3:16am

Having run down this rabbit hole, I’m curious if others can wrangle notably different results?

Now in 2025, can pcre regex be faster? (I spent a great deal of time searching for PCRE options, but didn’t succeed in smashing them into BigFix.)

Instead of concatentation and tuples, maybe sets would be more efficient? ( @brolly33 ?)

Is there a more efficient way of handling XML?

JasonWalker · February 9, 2025, 5:24am

Be aware of a weird behavior in the Fixlet Debugger - when you run multiple queries in one QNA tab, the time seems to “add up” from the bottom to the top.

I.e. the time listed for a query, is the sum of that query’s time and every query that falls below it on the page. I regard it as a bug, but I’ve heard it argued that it’s expected behavior.

In any case, when comparing times of queries be sure to have each query by itself on separate tabs.

For the XML query, its possible to express the whole thing as an XPath expression, but I found the time difference negligible (42 ms for pure XPath vs 47 ms for your original query on my system):

q: node values of xpaths "/students/item/career_aspiration[text()='Lawyer']/../email/text()"  of (xml documents of file "C:\temp\student-scores.xml")
<snip>
T: 42.764 ms

q: node values of xpaths "email/text()" of parent nodes of parent nodes of xpaths "/students/item/career_aspiration/text()" whose (node value of it = "Lawyer") of (xml documents of file "C:\temp\student-scores.xml")
<snip>
T: 47.115 ms

I found a couple of other ways to express the XPath after I started writing this, but they all had similar evaluation times; I’m posting them here just in case they’re helpful refs

Original:
q: node values of xpaths "/students/item/career_aspiration[text()='Lawyer']/../email/text()" of (xml documents of file "C:\temp\student-scores.xml")

Explanation: Start at studens/item/career_aspiration with a text value of “Lawyer”, then move up one node (to ‘item’) and then down to ‘email’ and return the text() value.

q: node values of xpaths "/students/item/email[../career_aspiration/text()='Lawyer']/text()" of (xml documents of file "C:\temp\student-scores.xml")

Explanation: Find the stuends/item/email node, where the parent node has a career_aspiration node with a text() value of ‘Lawyer’; from that email node return the text() value.

q: node values of xpaths "/students/item[career_aspiration/text()='Lawyer']/email/text()" of (xml documents of file "C:\temp\student-scores.xml")
Explanation: Find the item nodes which contain child node of ‘career_aspiration’ with a text() value of ‘Lawyer’; from that item node, find the email child node and return the text() value of it.

This last form I think is the cleanest but it’s all a matter of preference and different queries can make more sense in different situations.

atlauren · February 9, 2025, 10:56pm

atlauren:

Structured CSV

q: (tuple string item 3 of it) of (it) whose (tuple string item 9 of it = "Lawyer") of ((( "( " & concatenation " ), ( " of parenthesized parts 3 of matches (regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)") of it & " )")) of lines of file "c:\users\atlauren\downloads\sling-academy\student-scores.csv") 
[SNIP]
T: 605.300 ms
I: plural string

Incremental speed cost: 25%

Swapping in Jason’s 2025 regex pcre, I see a performance gain of about 5%.

q: (tuple string item 3 of it) of (it) whose (tuple string item 9 of it = "Lawyer") of ((( "( " & concatenation " ), ( " of parenthesized parts 2 of matches (perl regex "(?:^|,)(?=[^%22]|(%22)?)%22?((?(1)[^%22]*|[^,%22]*))%22?(?=,|$)") of it & " )")) of lines of file "c:\users\atlauren\downloads\sling-academy\student-scores.csv")

atlauren · February 9, 2025, 11:06pm

FWIW, I find that all variations of putting the query/node navigation inside of xpath are about 20% faster than my original parent nodes of parent nodes relevance.

Thanks for this! I had a feeling that there were useful depths to xpath. Glad to see them in action.