Replace specific comma's in a string.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Replace specific comma's in a string.

Russell Haley
Since my match and capture understanding in Lua is somewhat weak I am
looking for opportunities to improve my understanding. There is a SO
question about parsing a file here:

https://unix.stackexchange.com/questions/422526/remove-comma-outside-quotes

The crux of the question is to leave the commas within a quoted items
and replace all the outer "separator" commas with tilde (~). So parse
this:

123,"ABC, DEV 23",345,534.202,NAME

and return this:

123~"ABC, DEV 23"~345~534.202~NAME.

I've put together a couple of pieces but can't find any way to make it
into an answer without resorting to loops.

s = '123,"ABC, DEV 23",345,534.202,NAME'

print(s:match('".*,.*"'))
"ABC, DEV 23"

print(s:gsub('".*(,).*"','~'))
123,~,345,534.202,NAME  1


>From what i can tell, there is no way to add exclusions to patterns so
at this point I'm stumped. Instead of asking for the solution and
studying it, I'd like to first ask for a hint from the mailing list.
My questions are:

- Is it possible to do this in a single call to gsub (I'm hoping yes)?
If not, I will look first at one or two calls (i.e. match and then
gsub) and using a loop.
- Is this something that would be better done with LPEG?

I'll likely ask is people can share possible answers later to see if
my solution is at all reasonable.

Thanks

Russ

Reply | Threaded
Open this post in threaded view
|

Re: Replace specific comma's in a string.

Sean Conner
It was thus said that the Great Russell Haley once stated:
> Since my match and capture understanding in Lua is somewhat weak I am
> looking for opportunities to improve my understanding. There is a SO
> question about parsing a file here:
>
> https://unix.stackexchange.com/questions/422526/remove-comma-outside-quotes
>
> The crux of the question is to leave the commas within a quoted items
> and replace all the outer "separator" commas with tilde (~).

  [ snip ]

> - Is it possible to do this in a single call to gsub (I'm hoping yes)?
> If not, I will look first at one or two calls (i.e. match and then
> gsub) and using a loop.
> - Is this something that would be better done with LPEG?

  Yes.  Also, a few more examples that probably need to be addressed:

        123,"ABC, DEV 23",345,,202,NAME
        ,"ABC, DEV 23",345,534,202,NAME
        123,"ABC, DEV 23",345,534.202,
        123 , "ABC, DEV 23" , 345 , 534 , 202 , NAME
        123,"ABC, \"hello\", DEV 23",,,,NAME
        123,'ABC, "hello", DEV 23',342,534,202,NAME
        ,,,,,,

  Which are allowed?  Which are disallowed?  Are these errors?  What?

  -spc

Reply | Threaded
Open this post in threaded view
|

Re: Replace specific comma's in a string.

Coda Highland
In reply to this post by Russell Haley
On Fri, Feb 9, 2018 at 5:08 PM, Russell Haley <[hidden email]> wrote:

> Since my match and capture understanding in Lua is somewhat weak I am
> looking for opportunities to improve my understanding. There is a SO
> question about parsing a file here:
>
> https://unix.stackexchange.com/questions/422526/remove-comma-outside-quotes
>
> The crux of the question is to leave the commas within a quoted items
> and replace all the outer "separator" commas with tilde (~). So parse
> this:
>
> 123,"ABC, DEV 23",345,534.202,NAME
>
> and return this:
>
> 123~"ABC, DEV 23"~345~534.202~NAME.
>
> I've put together a couple of pieces but can't find any way to make it
> into an answer without resorting to loops.
>
> s = '123,"ABC, DEV 23",345,534.202,NAME'
>
> print(s:match('".*,.*"'))
> "ABC, DEV 23"
>
> print(s:gsub('".*(,).*"','~'))
> 123,~,345,534.202,NAME  1
>
>
> >From what i can tell, there is no way to add exclusions to patterns so
> at this point I'm stumped. Instead of asking for the solution and
> studying it, I'd like to first ask for a hint from the mailing list.
> My questions are:
>
> - Is it possible to do this in a single call to gsub (I'm hoping yes)?
> If not, I will look first at one or two calls (i.e. match and then
> gsub) and using a loop.
> - Is this something that would be better done with LPEG?
>
> I'll likely ask is people can share possible answers later to see if
> my solution is at all reasonable.
>
> Thanks
>
> Russ

LPEG could do it pretty trivially, yes.

Your discovery that it can't be done without loops is also fairly
accurate. CSV parsing is one of the classic examples of "you really
shouldn't try to do that with a regexp". If it's possible for values
to CONTAIN quotes (i.e. by escaping) instead of just being DELIMITED
by them, it's actually impossible (unless you use some Perlisms that
go beyond the technical formalism of regular expressions). Meanwhile,
gsub is LESS expressive than regexps.

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: Replace specific comma's in a string.

Paul K-2
In reply to this post by Russell Haley
Hi Russ,

> The crux of the question is to leave the commas within a quoted items
> and replace all the outer "separator" commas with tilde (~). So parse this:
> 123,"ABC, DEV 23",345,534.202,NAME
> and return this:
> 123~"ABC, DEV 23"~345~534.202~NAME.

Something like the following may work:

local list = {
  '123,"ABC, DEV 23",345,,202,NAME',
  ',"ABC, DEV 23",345,534,202,NAME',
  '123,"ABC, DEV 23",345,534.202,',
  '123 , "ABC, DEV 23" , 345 , 534 , 202 , NAME',
  [[123,"ABC, \"he,llo\", DEV 23",,,,NAME]],
  [[123,"ABC, \\",llo\", DEV 23,,,,NAME]],
  [[123,'ABC, "hello", DEV 23',342,534,202,NAME]],
}
local escaped, instring, pos = false, false, -1
for _,s in ipairs(list) do
  s = s:gsub([[()(["',\\])]], function(p, s)
      if p > pos+1 then escaped = false end
      if s == '\\' then escaped, pos = not escaped, p end
      if (s == '"' or s == "'") and not escaped then instring = not instring end
      if s == ',' and not instring then s = '~' end
      return s
    end)
  print(s)
end

This prints:

123~"ABC, DEV 23"~345~~202~NAME
~"ABC, DEV 23"~345~534~202~NAME
123~"ABC, DEV 23"~345~534.202~
123 ~ "ABC, DEV 23" ~ 345 ~ 534 ~ 202 ~ NAME
123~"ABC, \"he,llo\", DEV 23"~~~~NAME
123~"ABC, \\"~llo\"~ DEV 23~~~~NAME
123~'ABC, "hello", DEV 23'~342~534~202~NAME

Paul.

On Fri, Feb 9, 2018 at 3:08 PM, Russell Haley <[hidden email]> wrote:

> Since my match and capture understanding in Lua is somewhat weak I am
> looking for opportunities to improve my understanding. There is a SO
> question about parsing a file here:
>
> https://unix.stackexchange.com/questions/422526/remove-comma-outside-quotes
>
> The crux of the question is to leave the commas within a quoted items
> and replace all the outer "separator" commas with tilde (~). So parse
> this:
>
> 123,"ABC, DEV 23",345,534.202,NAME
>
> and return this:
>
> 123~"ABC, DEV 23"~345~534.202~NAME.
>
> I've put together a couple of pieces but can't find any way to make it
> into an answer without resorting to loops.
>
> s = '123,"ABC, DEV 23",345,534.202,NAME'
>
> print(s:match('".*,.*"'))
> "ABC, DEV 23"
>
> print(s:gsub('".*(,).*"','~'))
> 123,~,345,534.202,NAME  1
>
>
> >From what i can tell, there is no way to add exclusions to patterns so
> at this point I'm stumped. Instead of asking for the solution and
> studying it, I'd like to first ask for a hint from the mailing list.
> My questions are:
>
> - Is it possible to do this in a single call to gsub (I'm hoping yes)?
> If not, I will look first at one or two calls (i.e. match and then
> gsub) and using a loop.
> - Is this something that would be better done with LPEG?
>
> I'll likely ask is people can share possible answers later to see if
> my solution is at all reasonable.
>
> Thanks
>
> Russ
>

Reply | Threaded
Open this post in threaded view
|

Re: Replace specific comma's in a string.

Coda Highland
On Fri, Feb 9, 2018 at 5:40 PM, Paul K <[hidden email]> wrote:

> Hi Russ,
>
>> The crux of the question is to leave the commas within a quoted items
>> and replace all the outer "separator" commas with tilde (~). So parse this:
>> 123,"ABC, DEV 23",345,534.202,NAME
>> and return this:
>> 123~"ABC, DEV 23"~345~534.202~NAME.
>
> Something like the following may work:
>
> local list = {
>   '123,"ABC, DEV 23",345,,202,NAME',
>   ',"ABC, DEV 23",345,534,202,NAME',
>   '123,"ABC, DEV 23",345,534.202,',
>   '123 , "ABC, DEV 23" , 345 , 534 , 202 , NAME',
>   [[123,"ABC, \"he,llo\", DEV 23",,,,NAME]],
>   [[123,"ABC, \\",llo\", DEV 23,,,,NAME]],
>   [[123,'ABC, "hello", DEV 23',342,534,202,NAME]],
> }
> local escaped, instring, pos = false, false, -1
> for _,s in ipairs(list) do
>   s = s:gsub([[()(["',\\])]], function(p, s)
>       if p > pos+1 then escaped = false end
>       if s == '\\' then escaped, pos = not escaped, p end
>       if (s == '"' or s == "'") and not escaped then instring = not instring end
>       if s == ',' and not instring then s = '~' end
>       return s
>     end)
>   print(s)
> end
>
> This prints:
>
> 123~"ABC, DEV 23"~345~~202~NAME
> ~"ABC, DEV 23"~345~534~202~NAME
> 123~"ABC, DEV 23"~345~534.202~
> 123 ~ "ABC, DEV 23" ~ 345 ~ 534 ~ 202 ~ NAME
> 123~"ABC, \"he,llo\", DEV 23"~~~~NAME
> 123~"ABC, \\"~llo\"~ DEV 23~~~~NAME
> 123~'ABC, "hello", DEV 23'~342~534~202~NAME
>
> Paul.

Indeed, the trivial little state machine has ALWAYS been my preferred
way to solve this class of problem.

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: Replace specific comma's in a string.

Paul K-2
> Indeed, the trivial little state machine has ALWAYS been my preferred
way to solve this class of problem.

One case it doesn't handle is the mix of ' and " in the string (as the
string should only be closed with the same quote is was opened with,
assuming both " and ' are allowed).

This code should handle this case (also added examples):

local list = {
  '123,"ABC, DEV 23",345,,202,NAME',
  ',"ABC, DEV 23",345,534,202,NAME',
  '123,"ABC, DEV 23",345,534.202,',
  '123 , "ABC, DEV 23" , 345 , 534 , 202 , NAME',
  [[123,"ABC, \"he,llo\", DEV 23",,,,NAME]],
  [[123,"ABC, \\",llo\", DEV 23,,,,NAME]],
  [[123,'ABC, "hel,lo", DEV 23',342,534,202,NAME]],
  [[123,"ABC, 'hel,lo, DEV 23",342,534,202,NAME]],
}
local escaped, instring, pos, start = false, false, -1, ""
for _,s in ipairs(list) do
  s = s:gsub([[()(["',\\])]], function(p, s)
      if p > pos+1 then escaped = false end
      if s == '\\' then escaped, pos = not escaped, p end
      if (s == '"' or s == "'") and not escaped
      and (not instring or s == start) then instring, start = not
instring, s end
      if s == ',' and not instring then s = '~' end
      return s
    end)
  print(s)
end

Paul.

On Fri, Feb 9, 2018 at 5:02 PM, Coda Highland <[hidden email]> wrote:

> On Fri, Feb 9, 2018 at 5:40 PM, Paul K <[hidden email]> wrote:
>> Hi Russ,
>>
>>> The crux of the question is to leave the commas within a quoted items
>>> and replace all the outer "separator" commas with tilde (~). So parse this:
>>> 123,"ABC, DEV 23",345,534.202,NAME
>>> and return this:
>>> 123~"ABC, DEV 23"~345~534.202~NAME.
>>
>> Something like the following may work:
>>
>> local list = {
>>   '123,"ABC, DEV 23",345,,202,NAME',
>>   ',"ABC, DEV 23",345,534,202,NAME',
>>   '123,"ABC, DEV 23",345,534.202,',
>>   '123 , "ABC, DEV 23" , 345 , 534 , 202 , NAME',
>>   [[123,"ABC, \"he,llo\", DEV 23",,,,NAME]],
>>   [[123,"ABC, \\",llo\", DEV 23,,,,NAME]],
>>   [[123,'ABC, "hello", DEV 23',342,534,202,NAME]],
>> }
>> local escaped, instring, pos = false, false, -1
>> for _,s in ipairs(list) do
>>   s = s:gsub([[()(["',\\])]], function(p, s)
>>       if p > pos+1 then escaped = false end
>>       if s == '\\' then escaped, pos = not escaped, p end
>>       if (s == '"' or s == "'") and not escaped then instring = not instring end
>>       if s == ',' and not instring then s = '~' end
>>       return s
>>     end)
>>   print(s)
>> end
>>
>> This prints:
>>
>> 123~"ABC, DEV 23"~345~~202~NAME
>> ~"ABC, DEV 23"~345~534~202~NAME
>> 123~"ABC, DEV 23"~345~534.202~
>> 123 ~ "ABC, DEV 23" ~ 345 ~ 534 ~ 202 ~ NAME
>> 123~"ABC, \"he,llo\", DEV 23"~~~~NAME
>> 123~"ABC, \\"~llo\"~ DEV 23~~~~NAME
>> 123~'ABC, "hello", DEV 23'~342~534~202~NAME
>>
>> Paul.
>
> Indeed, the trivial little state machine has ALWAYS been my preferred
> way to solve this class of problem.
>
> /s/ Adam
>

Reply | Threaded
Open this post in threaded view
|

Re: Replace specific comma's in a string.

Andrew Gierth
In reply to this post by Coda Highland
>>>>> "Coda" == Coda Highland <[hidden email]> writes:

 Coda> Your discovery that it can't be done without loops is also fairly
 Coda> accurate. CSV parsing is one of the classic examples of "you
 Coda> really shouldn't try to do that with a regexp". If it's possible
 Coda> for values to CONTAIN quotes (i.e. by escaping) instead of just
 Coda> being DELIMITED by them, it's actually impossible (unless you use
 Coda> some Perlisms that go beyond the technical formalism of regular
 Coda> expressions).

Nonsense; CSV is clearly a regular language even when allowing quotes
inside the values.

Here is the definition from RFC4180 (excluding the obvious terminals):

  file = [header CRLF] record *(CRLF record) [CRLF]
  header = name *(COMMA name)
  record = field *(COMMA field)
  name = field
  field = (escaped / non-escaped)
  escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
  non-escaped = *TEXTDATA

which corresponds to this regexp (assuming newlines match [^] except
where explicitly excluded):

^(("([^"]|"")*"|[^",\r\n]*)(,"([^"]|"")*"|,[^",\r\n]*)*(\r\n|$))*$

 Code> Meanwhile, gsub is LESS expressive than regexps.

Indeed.

--
Andrew.

Reply | Threaded
Open this post in threaded view
|

Re: Replace specific comma's in a string.

Sean Conner
It was thus said that the Great Andrew Gierth once stated:

> >>>>> "Coda" == Coda Highland <[hidden email]> writes:
>
>  Coda> Your discovery that it can't be done without loops is also fairly
>  Coda> accurate. CSV parsing is one of the classic examples of "you
>  Coda> really shouldn't try to do that with a regexp". If it's possible
>  Coda> for values to CONTAIN quotes (i.e. by escaping) instead of just
>  Coda> being DELIMITED by them, it's actually impossible (unless you use
>  Coda> some Perlisms that go beyond the technical formalism of regular
>  Coda> expressions).
>
> Nonsense; CSV is clearly a regular language even when allowing quotes
> inside the values.
>
> Here is the definition from RFC4180 (excluding the obvious terminals):
>
>   file = [header CRLF] record *(CRLF record) [CRLF]
>   header = name *(COMMA name)
>   record = field *(COMMA field)
>   name = field
>   field = (escaped / non-escaped)
>   escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
>   non-escaped = *TEXTDATA

  Well, TEXTDATA wasn't all that obvious to me---turns out it excludes the
double quote and comma.

  But the above could be dropped in nearly verbatim (some minimal
translation required) into re (a part of LPeg) making it more readable than:

> ^(("([^"]|"")*"|[^",\r\n]*)(,"([^"]|"")*"|,[^",\r\n]*)*(\r\n|$))*$

  Gesundheit [1].

  -spc

[1] For non-US readers, it is customary when someone sneezes to say
        "gesundheit" [2].  No, I don't know why or where that comes from.

[2] German for "healthyness" with the meaning "good health".

Reply | Threaded
Open this post in threaded view
|

Re: Replace specific comma's in a string.

Egor Skriptunoff-2
In reply to this post by Paul K-2
On Sat, Feb 10, 2018 at 4:18 AM, Paul K wrote:

One case it doesn't handle is the mix of ' and " in the string (as the
string should only be closed with the same quote is was opened with,
assuming both " and ' are allowed).

This code should handle this case (also added examples):

local list = {
  '123,"ABC, DEV 23",345,,202,NAME',
  ',"ABC, DEV 23",345,534,202,NAME',
  '123,"ABC, DEV 23",345,534.202,',
  '123 , "ABC, DEV 23" , 345 , 534 , 202 , NAME',
  [[123,"ABC, \"he,llo\", DEV 23",,,,NAME]],
  [[123,"ABC, \\",llo\", DEV 23,,,,NAME]],
  [[123,'ABC, "hel,lo", DEV 23',342,534,202,NAME]],
  [[123,"ABC, 'hel,lo, DEV 23",342,534,202,NAME]],
}
local escaped, instring, pos, start = false, false, -1, ""
for _,s in ipairs(list) do
  s = s:gsub([[()(["',\\])]], function(p, s)
      if p > pos+1 then escaped = false end
      if s == '\\' then escaped, pos = not escaped, p end
      if (s == '"' or s == "'") and not escaped
      and (not instring or s == start) then instring, start = not
instring, s end
      if s == ',' and not instring then s = '~' end
      return s
    end)
  print(s)
end


Exactly the same could be done using Lua patterns:

for _,s in ipairs(list) do
 
s = s:gsub("\\?.", {['"']='\0\0"', ["'"]="\0\0'", [","]="\0,"})
       :gsub("((%z%z.).-)%2", function(a,b) return a:gsub("%z,", ",")..b end)
       :gsub("%z,", "~")
       :gsub("%z", "")
  print(s)
end
Reply | Threaded
Open this post in threaded view
|

Re: Replace specific comma's in a string.

albertmcchan
In reply to this post by Andrew Gierth
On Feb 10, 2018, at 3:56 AM, Andrew Gierth <[hidden email]> wrote:

>>>>>> "Coda" == Coda Highland <[hidden email]> writes:
>
> Coda> Your discovery that it can't be done without loops is also fairly
> Coda> accurate. CSV parsing is one of the classic examples of "you
> Coda> really shouldn't try to do that with a regexp". If it's possible
> Coda> for values to CONTAIN quotes (i.e. by escaping) instead of just
> Coda> being DELIMITED by them, it's actually impossible (unless you use
> Coda> some Perlisms that go beyond the technical formalism of regular
> Coda> expressions).
>
> Nonsense; CSV is clearly a regular language even when allowing quotes
> inside the values.
>
> Here is the definition from RFC4180 (excluding the obvious terminals):
>
>  file = [header CRLF] record *(CRLF record) [CRLF]
>  header = name *(COMMA name)
>  record = field *(COMMA field)
>  name = field
>  field = (escaped / non-escaped)
>  escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
>  non-escaped = *TEXTDATA
>
> which corresponds to this regexp (assuming newlines match [^] except
> where explicitly excluded):
>
> ^(("([^"]|"")*"|[^",\r\n]*)(,"([^"]|"")*"|,[^",\r\n]*)*(\r\n|$))*$
>
> Code> Meanwhile, gsub is LESS expressive than regexps.
>
> Indeed.
>
> --
> Andrew.

just curious, what will lpeg re pattern look like ?
Reply | Threaded
Open this post in threaded view
|

Re: Replace specific comma's in a string.

Andrew Gierth
>>>>> "albertmcchan" == albertmcchan  <[hidden email]> writes:

 albertmcchan> just curious, what will lpeg re pattern look like ?

This version is given as an example in the re docs:

record = re.compile[[
  record <- {| field (',' field)* |} (%nl / !.)
  field <- escaped / nonescaped
  nonescaped <- { [^,"%nl]* }
  escaped <- '"' {~ ([^"] / '""' -> '"')* ~} '"'
]]

though of course this processes only one record at a time (and assumes
that newline conversion has already been done; the RFC's definition is
for use in MIME bodies with canonical CRLF newlines).

--
Andrew.

Reply | Threaded
Open this post in threaded view
|

Re: Replace specific comma's in a string.

dyngeccetor8
In reply to this post by Egor Skriptunoff-2

On 02/10/2018 12:30 PM, Egor Skriptunoff wrote:

> Exactly the same could be done using Lua patterns:
>
> for _,s in ipairs(list) do
>   s = s:gsub("\\?.", {['"']='\0\0"', ["'"]="\0\0'", [","]="\0,"})
>        :gsub("((%z%z.).-)%2", function(a,b) return a:gsub("%z,", ",")..b
> end)
>        :gsub("%z,", "~")
>        :gsub("%z", "")
>   print(s)
> end
>

Looks like an entry for International Obfuscated Lua Code Contest for me.

-- Martin

Reply | Threaded
Open this post in threaded view
|

Re: Replace specific comma's in a string.

Russell Haley
In reply to this post by Egor Skriptunoff-2
On Sat, Feb 10, 2018 at 1:30 AM, Egor Skriptunoff
<[hidden email]> wrote:

> On Sat, Feb 10, 2018 at 4:18 AM, Paul K wrote:
>>
>>
>> One case it doesn't handle is the mix of ' and " in the string (as the
>> string should only be closed with the same quote is was opened with,
>> assuming both " and ' are allowed).
>>
>> This code should handle this case (also added examples):
>>
>> local list = {
>>   '123,"ABC, DEV 23",345,,202,NAME',
>>   ',"ABC, DEV 23",345,534,202,NAME',
>>   '123,"ABC, DEV 23",345,534.202,',
>>   '123 , "ABC, DEV 23" , 345 , 534 , 202 , NAME',
>>   [[123,"ABC, \"he,llo\", DEV 23",,,,NAME]],
>>   [[123,"ABC, \\",llo\", DEV 23,,,,NAME]],
>>   [[123,'ABC, "hel,lo", DEV 23',342,534,202,NAME]],
>>   [[123,"ABC, 'hel,lo, DEV 23",342,534,202,NAME]],
>> }
>> local escaped, instring, pos, start = false, false, -1, ""
>> for _,s in ipairs(list) do
>>   s = s:gsub([[()(["',\\])]], function(p, s)
>>       if p > pos+1 then escaped = false end
>>       if s == '\\' then escaped, pos = not escaped, p end
>>       if (s == '"' or s == "'") and not escaped
>>       and (not instring or s == start) then instring, start = not
>> instring, s end
>>       if s == ',' and not instring then s = '~' end
>>       return s
>>     end)
>>   print(s)
>> end
>>
>
> Exactly the same could be done using Lua patterns:
>
> for _,s in ipairs(list) do
>   s = s:gsub("\\?.", {['"']='\0\0"', ["'"]="\0\0'", [","]="\0,"})
>        :gsub("((%z%z.).-)%2", function(a,b) return a:gsub("%z,", ",")..b
> end)
>        :gsub("%z,", "~")
>        :gsub("%z", "")
>   print(s)
> end

Ugh, I'm so intellectually lazy I had to read your answers. (rolling
of eyes at myself). This is where I was going with it last night but
couldn't figure it out. I agree with Martin that it's not very pretty.

Russ

Reply | Threaded
Open this post in threaded view
|

Re: Replace specific comma's in a string.

Andrew Gierth
>>>>> "Russell" == Russell Haley <[hidden email]> writes:

 Russell> Ugh, I'm so intellectually lazy I had to read your answers.
 Russell> (rolling of eyes at myself). This is where I was going with it
 Russell> last night but couldn't figure it out. I agree with Martin
 Russell> that it's not very pretty.

This is the LPeg version, though it's stricter about following the
actual CSV spec than the solutions above:

  local re = require 're'

  local csv_sub_re = re.compile [[
  file       <- {~ record (%nl record)* (%nl / !.) ~}
  record     <- field ((',' -> '~') field)*
  field      <- escaped / nonescaped
  nonescaped <- [^,"%nl]*
  escaped    <- '"' ([^"] / '""' )* '"'
  ]]

  newdata = csv_sub_re:match(data)

--
Andrew.

Reply | Threaded
Open this post in threaded view
|

Re: Replace specific comma's in a string.

Russell Haley
In reply to this post by Paul K-2
On Fri, Feb 9, 2018 at 5:18 PM, Paul K <[hidden email]> wrote:

>> Indeed, the trivial little state machine has ALWAYS been my preferred
> way to solve this class of problem.
>
> One case it doesn't handle is the mix of ' and " in the string (as the
> string should only be closed with the same quote is was opened with,
> assuming both " and ' are allowed).
>
> This code should handle this case (also added examples):
>
> local list = {
>   '123,"ABC, DEV 23",345,,202,NAME',
>   ',"ABC, DEV 23",345,534,202,NAME',
>   '123,"ABC, DEV 23",345,534.202,',
>   '123 , "ABC, DEV 23" , 345 , 534 , 202 , NAME',
>   [[123,"ABC, \"he,llo\", DEV 23",,,,NAME]],
>   [[123,"ABC, \\",llo\", DEV 23,,,,NAME]],
>   [[123,'ABC, "hel,lo", DEV 23',342,534,202,NAME]],
>   [[123,"ABC, 'hel,lo, DEV 23",342,534,202,NAME]],
> }
> local escaped, instring, pos, start = false, false, -1, ""
> for _,s in ipairs(list) do
>   s = s:gsub([[()(["',\\])]], function(p, s)
>       if p > pos+1 then escaped = false end
>       if s == '\\' then escaped, pos = not escaped, p end
>       if (s == '"' or s == "'") and not escaped
>       and (not instring or s == start) then instring, start = not
> instring, s end
>       if s == ',' and not instring then s = '~' end
>       return s
>     end)
>   print(s)
> end
>
> Paul.
This is great, thanks Paul. I'll have to study the match pattern.

> On Fri, Feb 9, 2018 at 5:02 PM, Coda Highland <[hidden email]> wrote:
>> On Fri, Feb 9, 2018 at 5:40 PM, Paul K <[hidden email]> wrote:
>>> Hi Russ,
>>>
>>>> The crux of the question is to leave the commas within a quoted items
>>>> and replace all the outer "separator" commas with tilde (~). So parse this:
>>>> 123,"ABC, DEV 23",345,534.202,NAME
>>>> and return this:
>>>> 123~"ABC, DEV 23"~345~534.202~NAME.
>>>
>>> Something like the following may work:
>>>
>>> local list = {
>>>   '123,"ABC, DEV 23",345,,202,NAME',
>>>   ',"ABC, DEV 23",345,534,202,NAME',
>>>   '123,"ABC, DEV 23",345,534.202,',
>>>   '123 , "ABC, DEV 23" , 345 , 534 , 202 , NAME',
>>>   [[123,"ABC, \"he,llo\", DEV 23",,,,NAME]],
>>>   [[123,"ABC, \\",llo\", DEV 23,,,,NAME]],
>>>   [[123,'ABC, "hello", DEV 23',342,534,202,NAME]],
>>> }
>>> local escaped, instring, pos = false, false, -1
>>> for _,s in ipairs(list) do
>>>   s = s:gsub([[()(["',\\])]], function(p, s)
>>>       if p > pos+1 then escaped = false end
>>>       if s == '\\' then escaped, pos = not escaped, p end
>>>       if (s == '"' or s == "'") and not escaped then instring = not instring end
>>>       if s == ',' and not instring then s = '~' end
>>>       return s
>>>     end)
>>>   print(s)
>>> end
>>>
>>> This prints:
>>>
>>> 123~"ABC, DEV 23"~345~~202~NAME
>>> ~"ABC, DEV 23"~345~534~202~NAME
>>> 123~"ABC, DEV 23"~345~534.202~
>>> 123 ~ "ABC, DEV 23" ~ 345 ~ 534 ~ 202 ~ NAME
>>> 123~"ABC, \"he,llo\", DEV 23"~~~~NAME
>>> 123~"ABC, \\"~llo\"~ DEV 23~~~~NAME
>>> 123~'ABC, "hello", DEV 23'~342~534~202~NAME
>>>
>>> Paul.
>>
>> Indeed, the trivial little state machine has ALWAYS been my preferred
>> way to solve this class of problem.
>>
>> /s/ Adam
>>
>