LPeg for CSV parsing

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

LPeg for CSV parsing

Ken Smith-2
I have been using LPeg happily for quite a while for parsing comma
separated value (CSV) data and would like to share a bit of experience
I've gained which resulted in a few small but important tweaks to the
example grammar given on the LPeg page [1].  The first was a
recommendation I'll reiterate from Duncan Cross about a year ago [2]
to change the definition of record listed at [1] to this.

local record = lpeg.Ct(field * (',' * field)^0) * (lpeg.P'\n' + -1)

This ensures the fields are returned as a list.  To enable parsing of
tab separated value (TSV) data, both field and record change to this.

local field = '"' * lpeg.Cs(((lpeg.P(1) - '"') + lpeg.P'""' / '"')^0) *
             lpeg.C((1 - lpeg.S',\t\n"')^0)
local record = lpeg.Ct(field * ((lpeg.P(',') + lpeg.P('\t')) *
field)^0) * (lpeg.P'\n' + -1)

Finally, to enable space around quoted fields, change field to this.

local field = lpeg.P(' ')^0
             * '"' * lpeg.Cs(((lpeg.P(1) - '"') + lpeg.P'""' / '"')^0) * '"'
             * lpeg.P(' ')^0
             + lpeg.C((1 - lpeg.S',\t\n"')^0)

Here's an example that puts it all together including three sample
strings that the original grammar doesn't recognize.  (Well, it
recognizes the first one but it doesn't break it in to a table.)

require('lpeg')

local field =
  lpeg.P(' ')^0
  * '"' * lpeg.Cs(((lpeg.P(1) - '"') + lpeg.P'""' / '"')^0) * '"'
  * lpeg.P(' ')^0
  + lpeg.C((1 - lpeg.S',\t\n"')^0)

local record =
  lpeg.Ct(field * ((lpeg.P(',') + lpeg.P('\t')) * field)^0)
  * (lpeg.P'\n' + -1)

function csv (s)
 return lpeg.match(record, s)
end

local ex1 = '"a field, containing a comma",123,3.14,2.717'
local ex2 = 'val1\tval2\tval3\t'
local ex3 = '1, "a field, containing two commas, surrounded by space" , 3, 4'

local line = csv(ex1)
assert(line[1] == 'a field, containing a comma')

line = csv(ex2)
assert(line[1] == 'val1')

line = csv(ex3)
assert(line[2] == 'a field, containing two commas, surrounded by space')
EOF (tested with Lua 5.1.3)

In the definition for field that I use every day, where the value for
field has this,

lpeg.Cs(((lpeg.P(1) - '"') + lpeg.P'""' / '"')^0)

I have lpeg.C instead of lpeg.Cs.  I don't remember why I did this but
it seems to work either way.  Feel free to update the LPeg site with
these amendments.

   Ken

1. http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html
2. http://lua-users.org/lists/lua-l/2007-11/msg00358.html

Reply | Threaded
Open this post in threaded view
|

Re: LPeg for CSV parsing

Roberto Ierusalimschy
> I have been using LPeg happily for quite a while for parsing comma
> separated value (CSV) data and would like to share a bit of experience
> I've gained which resulted in a few small but important tweaks to the
> example grammar given on the LPeg page [1]. [...]

Many thanks for your suggestions. (My implementation follows
RFC 4180, which does not allow tabs as separators or spaces around
quoted fields.)


> In the definition for field that I use every day, where the value for
> field has this,
> 
> lpeg.Cs(((lpeg.P(1) - '"') + lpeg.P'""' / '"')^0)
> 
> I have lpeg.C instead of lpeg.Cs.  I don't remember why I did this but
> it seems to work either way.

The lpeg.Cs substitutes single quotes for double quotes, therefore
unescaping quotes. Try this example:

local ex1 = '"a ""field"" containing quotes",123,3.14,2.717'

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: LPeg for CSV parsing

Ken Smith-2
On Thu, Oct 30, 2008 at 9:37 AM, Roberto Ierusalimschy <[hidden email]> wrote:

> In the definition for field that I use every day, where the value for
> field has this,
>
> lpeg.Cs(((lpeg.P(1) - '"') + lpeg.P'""' / '"')^0)
>
> I have lpeg.C instead of lpeg.Cs.  I don't remember why I did this but
> it seems to work either way.

The lpeg.Cs substitutes single quotes for double quotes, therefore
unescaping quotes. Try this example:

local ex1 = '"a ""field"" containing quotes",123,3.14,2.717'

Ah, thanks for the explanaation.  I've added the lpeg.Cs to my parser.

   Ken