There have been lots of postings over the years about parsing CSV content in Relevance, several of them by me. Up to now I never had a satisfactory answer; the Regular Expressions that have been posted seemed to choke on formatting items like embedded quotes or commas in the CSV content, as did the repetitions on "following text of first “, " of precedings of following of preceding” did not lend itself to changes in the CSV columns list.
I think I’ve finally come up with a Regular Expression that handles CSV content (except for embedded newlines in a field, but I’m going to ignore that edge case for now). I’d like to get some community feedback to see if I’m missing anything, and wanted to share this in case it proves helpful for anyone else.
Given a CSV-formatted computer inventory, we might have lines like
123456,Hostname1,"COMPUTER, LAPTOP",INTERNATIONAL BUSINESS MACHINE,9545-308,SN123456,Location1,"Public, John Q"
234567,Hostname2,"COMPUTER, Desktop",HP,xw6400,SN234567,Location2,"Doe, Jane "
Which I’ll represent in this relevance with the string set
("123456,Hostname1,%22COMPUTER, LAPTOP%22,INTERNATIONAL BUSINESS MACHINE,9545-308,SN123456,Location1,%22Public, John Q%22"; "234567,Hostname2,%22COMPUTER, Desktop%22,HP,xw6400,SN234567,Location2,%22Doe, Jane %22" )
Step 1: Use a Regular Expression to split out the CSV content into separate matches. The trick to not-matching on an embedded comma (as in “Computer, Laptop” or “Doe, Jane”), that I’ve always missed, is that a CSV field should be preceded by a valid CSV field.
I ended up with the regex
((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)
Described as
(
(^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$)
/* From the start of a line or a comma, matching either a quote, followed by anything thats not a quote or doubled set of quotes, followed by a quote; OR anything that's not a quote, until a comma or end of line (A single CSV field) - 0 or more times */
)
((%22([^%22]|%22%22)*%22)|[^,]*)
/* Followed by a second CSV field (Quote followed by any number of (non-quote OR doubled quotes), OR any other characters except comma) - a *second* matching CSV field */
(,|$)
/* Followed by a comma or end-of-line */
From this, we’ll take parenthesized match 3 - the second CSV field match for each regex match. Seen in action, this gives
q: (parenthesized parts 3 of matches (regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)") of it) of ("123456,Hostname1,%22COMPUTER, LAPTOP%22,INTERNATIONAL BUSINESS MACHINE,9545-308,SN123456,Location1,%22Public, John Q%22"; "234567,Hostname2,%22COMPUTER, Desktop%22,HP,xw6400,SN234567,Location2,%22Doe, Jane %22" )
A: 123456
A: Hostname1
A: "COMPUTER, LAPTOP"
A: INTERNATIONAL BUSINESS MACHINE
A: 9545-308
A: SN123456
A: Location1
A: "Public, John Q"
A: 234567
A: Hostname2
A: "COMPUTER, Desktop"
A: HP
A: xw6400
A: SN234567
A: Location2
A: "Doe, Jane "
T: 0.220 ms
I: plural substring
/* Format the results into Tuple Strings by escaping them - enclosing each item in "( )" and separating by ", " -- so an embedded comma doesn't resolve as an excess tuple string item */
q: (("( " & concatenation " ), ( " of parenthesized parts 3 of matches (regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)") of it & " )")) of ("123456,Hostname1,%22COMPUTER, LAPTOP%22,INTERNATIONAL BUSINESS MACHINE,9545-308,SN123456,Location1,%22Public, John Q%22"; "234567,Hostname2,%22COMPUTER, Desktop%22,HP,xw6400,SN234567,Location2,%22Doe, Jane %22" )
A: ( 123456 ), ( Hostname1 ), ( "COMPUTER, LAPTOP" ), ( INTERNATIONAL BUSINESS MACHINE ), ( 9545-308 ), ( SN123456 ), ( Location1 ), ( "Public, John Q")
A: ( 234567 ), ( Hostname2 ), ( "COMPUTER, Desktop" ), ( HP ), ( xw6400 ), ( SN234567 ), ( Location2 ), ( "Doe, Jane ")
T: 0.234 ms
I: plural string
/* Read the results as tuple string items */
q: tuple string items 0 of (("( " & concatenation " ), ( " of parenthesized parts 3 of matches (regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)") of it & " )")) of ("123456,Hostname1,%22COMPUTER, LAPTOP%22,INTERNATIONAL BUSINESS MACHINE,9545-308,SN123456,Location1,%22Public, John Q%22"; "234567,Hostname2,%22COMPUTER, Desktop%22,HP,xw6400,SN234567,Location2,%22Doe, Jane %22" )
A: 123456
A: 234567
T: 0.277 ms
I: plural string
Find the asset tag ...
q: it whose (tuple string item 0 of it = "234567" ) of (("( " & concatenation " ), ( " of parenthesized parts 3 of matches (regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)") of it & " )")) of ("123456,Hostname1,%22COMPUTER, LAPTOP%22,INTERNATIONAL BUSINESS MACHINE,9545-308,SN123456,Location1,%22Public, John Q%22"; "234567,Hostname2,%22COMPUTER, Desktop%22,HP,xw6400,SN234567,Location2,%22Doe, Jane %22" )
A: ( 234567 ), ( Hostname2 ), ( "COMPUTER, Desktop" ), ( HP ), ( xw6400 ), ( SN234567 ), ( Location2 ), ( "Doe, Jane ")
T: 0.301 ms
I: plural string
Put into practice, I can retrieve all of the asset tag numbers from a CSV of 8,249 lines in about a half-second:
q: number of tuple string items 0 of (("( " & concatenation " ), ( " of parenthesized parts 3 of matches (regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)") of it & " )")) of (lines of file "c:\temp\computers.csv" )
A: 8249
q: tuple string items 0 of (("( " & concatenation " ), ( " of parenthesized parts 3 of matches (regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)") of it & " )")) of (lines of file "c:\temp\computers.csv" )
A: ECN
A: 0038904
A: 0038939
A: 0039632
A: 0040535
A: 0076434
A: 0086809
/* snipped */
T: 532.061 ms
I: plural string
Edit: The closing parenthesis that is appended needs a space before it.