CSV, Regular Expression, and Tuple String Items

JasonWalker · March 8, 2016, 4:31pm

There have been lots of postings over the years about parsing CSV content in Relevance, several of them by me. Up to now I never had a satisfactory answer; the Regular Expressions that have been posted seemed to choke on formatting items like embedded quotes or commas in the CSV content, as did the repetitions on "following text of first “, " of precedings of following of preceding” did not lend itself to changes in the CSV columns list.

I think I’ve finally come up with a Regular Expression that handles CSV content (except for embedded newlines in a field, but I’m going to ignore that edge case for now). I’d like to get some community feedback to see if I’m missing anything, and wanted to share this in case it proves helpful for anyone else.

Given a CSV-formatted computer inventory, we might have lines like

123456,Hostname1,"COMPUTER, LAPTOP",INTERNATIONAL BUSINESS MACHINE,9545-308,SN123456,Location1,"Public, John Q"

234567,Hostname2,"COMPUTER, Desktop",HP,xw6400,SN234567,Location2,"Doe, Jane "

Which I’ll represent in this relevance with the string set

("123456,Hostname1,%22COMPUTER, LAPTOP%22,INTERNATIONAL BUSINESS MACHINE,9545-308,SN123456,Location1,%22Public, John Q%22"; "234567,Hostname2,%22COMPUTER, Desktop%22,HP,xw6400,SN234567,Location2,%22Doe, Jane %22" )

Step 1: Use a Regular Expression to split out the CSV content into separate matches. The trick to not-matching on an embedded comma (as in “Computer, Laptop” or “Doe, Jane”), that I’ve always missed, is that a CSV field should be preceded by a valid CSV field.

I ended up with the regex

((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)

Described as

(
   (^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$)
          /* From the start of a line or a comma, matching either a quote, followed by anything thats not a quote or doubled set of quotes, followed by a quote; OR anything that's not a quote, until a comma or end of line (A single CSV field) - 0 or more times */
 )
 ((%22([^%22]|%22%22)*%22)|[^,]*)
           /* Followed by a second CSV field (Quote followed by any number of (non-quote OR doubled quotes), OR any other characters except comma) - a *second* matching CSV field */
(,|$)   
         /* Followed by a comma or end-of-line */

From this, we’ll take parenthesized match 3 - the second CSV field match for each regex match. Seen in action, this gives

q: (parenthesized parts 3 of matches (regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)") of it) of ("123456,Hostname1,%22COMPUTER, LAPTOP%22,INTERNATIONAL BUSINESS MACHINE,9545-308,SN123456,Location1,%22Public, John Q%22"; "234567,Hostname2,%22COMPUTER, Desktop%22,HP,xw6400,SN234567,Location2,%22Doe, Jane %22" )
A: 123456
A: Hostname1
A: "COMPUTER, LAPTOP"
A: INTERNATIONAL BUSINESS MACHINE
A: 9545-308
A: SN123456
A: Location1
A: "Public, John Q"
A: 234567
A: Hostname2
A: "COMPUTER, Desktop"
A: HP
A: xw6400
A: SN234567
A: Location2
A: "Doe, Jane "
T: 0.220 ms
I: plural substring
    
  /*  Format the results into Tuple Strings by escaping them - enclosing each item in "( )" and separating by ", "  -- so an embedded comma doesn't resolve as an excess tuple string item */
    q:  (("( " & concatenation " ), ( " of parenthesized parts 3 of matches (regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)") of it & " )")) of ("123456,Hostname1,%22COMPUTER, LAPTOP%22,INTERNATIONAL BUSINESS MACHINE,9545-308,SN123456,Location1,%22Public, John Q%22"; "234567,Hostname2,%22COMPUTER, Desktop%22,HP,xw6400,SN234567,Location2,%22Doe, Jane %22" )
    A: ( 123456 ), ( Hostname1 ), ( "COMPUTER, LAPTOP" ), ( INTERNATIONAL BUSINESS MACHINE ), ( 9545-308 ), ( SN123456 ), ( Location1 ), ( "Public, John Q")
    A: ( 234567 ), ( Hostname2 ), ( "COMPUTER, Desktop" ), ( HP ), ( xw6400 ), ( SN234567 ), ( Location2 ), ( "Doe, Jane ")
    T: 0.234 ms
    I: plural string

/* Read the results as tuple string items */ 
q:  tuple string items 0 of (("( " & concatenation " ), ( " of parenthesized parts 3 of matches (regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)") of it & " )")) of ("123456,Hostname1,%22COMPUTER, LAPTOP%22,INTERNATIONAL BUSINESS MACHINE,9545-308,SN123456,Location1,%22Public, John Q%22"; "234567,Hostname2,%22COMPUTER, Desktop%22,HP,xw6400,SN234567,Location2,%22Doe, Jane %22" )
A: 123456
A: 234567
T: 0.277 ms
I: plural string

Find the asset tag ...
q: it whose (tuple string item 0 of it = "234567" ) of (("( " & concatenation " ), ( " of parenthesized parts 3 of matches (regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)") of it & " )")) of ("123456,Hostname1,%22COMPUTER, LAPTOP%22,INTERNATIONAL BUSINESS MACHINE,9545-308,SN123456,Location1,%22Public, John Q%22"; "234567,Hostname2,%22COMPUTER, Desktop%22,HP,xw6400,SN234567,Location2,%22Doe, Jane %22" )
A: ( 234567 ), ( Hostname2 ), ( "COMPUTER, Desktop" ), ( HP ), ( xw6400 ), ( SN234567 ), ( Location2 ), ( "Doe, Jane ")
T: 0.301 ms
I: plural string

Put into practice, I can retrieve all of the asset tag numbers from a CSV of 8,249 lines in about a half-second:

q:  number of tuple string items 0 of (("( " & concatenation " ), ( " of parenthesized parts 3 of matches (regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)") of it & " )")) of (lines of file "c:\temp\computers.csv" )
A: 8249

q:  tuple string items 0 of (("( " & concatenation " ), ( " of parenthesized parts 3 of matches (regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)") of it & " )")) of (lines of file "c:\temp\computers.csv" )
A: ECN
A: 0038904
A: 0038939
A: 0039632
A: 0040535
A: 0076434
A: 0086809
/* snipped */
T: 532.061 ms
I: plural string

Edit: The closing parenthesis that is appended needs a space before it.

JasonWalker · March 8, 2016, 5:58pm

And finally…fix the quoting back to human-readable

q:  (("( " & concatenation " ), ( " of (if it starts with "%22" then (concatenation "%22" of substrings separated by "%22%22" of first (length of it - 1) of last (length of it - 1) of it) else it ) of parenthesized parts 3 of matches (regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)") of it & " )")) of ("123456,Hostname1,%22COMPUTER, LAPTOP%22,INTERNATIONAL BUSINESS MACHINE,9545-308,SN123456,Location1,%22Public, John Q%22"; "234567,Hostname2,%22COMPUTER, Desktop%22,HP,xw6400,SN234567,Location2,%22Doe, Jane %22%22Jen%22%22 %22" )
A: ( 123456 ), ( Hostname1 ), ( COMPUTER, LAPTOP ), ( INTERNATIONAL BUSINESS MACHINE ), ( 9545-308 ), ( SN123456 ), ( Location1 ), ( Public, John Q)
A: ( 234567 ), ( Hostname2 ), ( COMPUTER, Desktop ), ( HP ), ( xw6400 ), ( SN234567 ), ( Location2 ), ( Doe, Jane "Jen" )
T: 0.436 ms
I: plural string

Edit: The closing parenthesis that is appended needs a space before it.

JasonWalker · March 9, 2016, 8:46pm

One more revision. If a CSV field contains an unbalanced set of parentheses, the “tuple string item” inspector chokes. This version will check for unbalanced parentheses within a field, and if there is an imbalance the parentheses will be replaced with a period:

tuple string items of (("( " & concatenation " ), ( " of (if it starts with "%22" then (concatenation "%22" of substrings separated by "%22%22" of first (length of it - 1) of last (length of it - 1) of it) else it ) of (if (number of characters whose (it="(") of it = number of characters whose (it=")") of it) then it else concatenation "." of substrings separated by "(" of concatenation "." of substrings separated by ")" of it) of parenthesized parts 3 of matches (regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)") of it & " )")) of lines of file "c:\temp\computers.csv"

Sean · March 11, 2016, 4:35pm

Nice work, Jason. This is awesome. I tested out on a csv I had to struggle with in the past and it worked great. Thanks for the contribution!

jgstew · March 24, 2016, 9:45pm

I agree, excellent work.

JasonWalker · March 25, 2016, 2:50am

Glad to hear it’s helpful for others.

This was difficult enough that I want to make sure it works with all the cases I might throw at it, so please, if you find any examples where this doesn’t work the way I expect, post here so we can all improve on it.

JasonWalker · March 29, 2016, 1:39am

I’m checking out a slightly modified version. It’s slightly more readable, but performs marginally worse ( ~50 ms slower) on my test of two strings. I’ll try it against real data tomorrow. Posting this partly because I took the time to comment it better, partly because I haven’t seen much “concatenation of characters (X) of ‘y’”, and partly so I don’t forget it

/* Could also check "tuple string item 0", "tuple string item 1", or "number of tuple string items of it" */
tuple string items of 
 (
   (    
/* Format results as tuple string items by surrounding each in parentheses, and separating with comma-space combination */
     "( " & concatenation " ), ( " of 
     (
       if 
/* De-quote the field if it's a quoted field - remove the wrapping quotes, and change embedded doubled doublequotes to single doublequotes */
         it starts with "%22" 
       then
         (
           concatenation "%22" of substrings separated by "%22%22" of concatenation of characters
           (
/* Remove the wrapping quotes - take all but the first and last character of the string */
             integers in 
             (
               1, length of it - 2
             )
           )
           of it
         )
       else
/* wasn't a quoted field, no need to unwrap quotes */
         it
     )
     of 
     (
       if
/* Check for unbalanced parentheses in the field, and if present replace them; the 'tuple string item' inspector would fail */
         (
           number of characters 
           whose
           (
             it="("
           )
           of it = number of characters 
           whose
           (
             it=")"
           )
           of it
         )
       then
/* no parentheses imbalance - keep it */
         it 
       else
/* parentheses imbalanced, replace them */
         concatenation of 
         (
           (
             if
               (
                 it="(" 
                or
                 it=")"
               )
             then
               "." 
             else
               it
           )
           of characters of it
         )
     )
/* Regular expression to find CSV fields */
     of parenthesized parts 3 of matches 
     (
       regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)"
     )
     of it & " )"
   )
 )
 of 
 (
/* sample data */

   "123456,Hostname1,%22COMPUTER, LAPTOP%22,INTERNATIONAL BUSINESS MACHINE,9545-308,SN123456,Location1,%22(Public, John Q%22"; "234567,Hostname2,%22COMPUTER, Desktop%22,HP,xw6400,SN234567,Location2,%22Doe, Jane %22%22Jen%22%22 %22"
 )

JasonWalker · February 8, 2017, 3:26am

Update: A forum poster has come across a case where this doesn’t work with more than a single comma embedded within a field.

q: tuple string items of (("( " & concatenation " ), ( " of (if it starts with "%22" then (concatenation "%22" of substrings separated by "%22%22" of first (length of it - 1) of last (length of it - 1) of it) else it ) of (if (number of characters whose (it="(") of it = number of characters whose (it=")") of it) then it else concatenation "." of substrings separated by "(" of concatenation "." of substrings separated by ")" of it) of parenthesized parts 3 of matches (regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)") of it & " )")) of "1,%223,4,5%22,6"
A: 1
A: 3,4,5
A: 4
A: 6
T: 0.580 ms
I: plural string

Ideally this should have returned

A: 1
A: 3,4,5
A: 6

Not sure whether there is any way to fix this. A field with a single embedded comma works ok, but with more than one embedded comma, the field data also makes its own match to the regex.

gupnitin · June 15, 2017, 8:09am

One concern on which i require help.

When i am trying to pull this data from Web Report it is getting extracted in one cell.

Is there a way to get each data in each cell.

JasonWalker · June 15, 2017, 5:42pm

What is a “cell”? An HTML Table cell? If that’s the case, you should be able to do something like
table of trs of (concatenation of tds of (tuple string items ...

atlauren · February 9, 2025, 2:51am

Greetings from 2025! This is exactly what I’m looking for!

JasonWalker:

I ended up with the regex

((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)

[snip]

  /*  Format the results into Tuple Strings by escaping them - enclosing each item in "( )" and separating by ", "  -- so an embedded comma doesn't resolve as an excess tuple string item */

[snip]

Put into practice, I can retrieve all of the asset tag numbers from a CSV of 8,249 lines in about a half-second:

    q:  number of tuple string items 0 of (("( " & concatenation " ), ( " of parenthesized parts 3 of matches (regex "((^|,)((%22([^%22]|%22%22)*%22)|[^%22,]*)(,|$))((%22([^%22]|%22%22)*%22)|[^,]*)(,|$)") of it & " )")) of (lines of file "c:\temp\computers.csv" )

See application over here.

JasonWalker · February 9, 2025, 4:51am

Greetings, 2025!

With the march of progress we can now do a slightly better expression using the ‘perl regex’ inspector that doesn’t suffer from some of those edge cases I noted above. Both of these complex cases are handled with the perl-compatible regex.

Given the regex

perl regex "(?:^|,)(?=[^%22]|(%22)?)%22?((?(1)[^%22]*|[^,%22]*))%22?(?=,|$)"

the second matching group, i.e. ‘parenthesized parts 2’, should contain the CSV fields - whether they’re quoted, contain commas, etc.

q: parenthesized parts 2 of matches(perl regex "(?:^|,)(?=[^%22]|(%22)?)%22?((?(1)[^%22]*|[^,%22]*))%22?(?=,|$)") of "123,2.99,AMO024,Title,%22Description, more info%22,,123987564"
A: 123
A: 2.99
A: AMO024
A: Title
A: Description, more info
A:  more info
A: 
A: 123987564
T: 0.317 ms



q: parenthesized parts 2 of matches(perl regex "(?:^|,)(?=[^%22]|(%22)?)%22?((?(1)[^%22]*|[^,%22]*))%22?(?=,|$)") of "1,%223,4,5%22,6"
A: 1
A: 3,4,5
A: 4
A: 5
A: 6
T: 0.162 ms

There’s still another case I haven’t been able to solve though, where a field contains embedded quotes. Seems that it’s always something…
I also found another expression that has different edge cases, but so far it’s still a ‘pick your poison’ issue.

perl regex "(?:,%22|^%22)(%22%22|[\w\W]*?)(?=%22,|%22$)|(?:,(?!%22)|^(?!%22))([^,]*?)(?=$|,)|(\r\n|\n)"

Refs Regex for parsing Microsoft-style CSV data · GitHub and regex - Split a CSV where some entries have double quotes - Stack Overflow