Matching sequences of identical characters with lpeg.

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Matching sequences of identical characters with lpeg.

Magicks M
Hi, I'm having some trouble with this.
I want to capture seqences on identical characters in a string: "aaabbbcccd" -> "aaa" "bbb" "ccc" "d"
I attempted to use the regular pattern "((.)%1*)" and found that quantifiers didn't work. I then switched to lpeg and attempted to use:
Cg(P(1), "char") * Cb'char'^0
Which errors, and thus I cannot think of a good way to match this. 
Reply | Threaded
Open this post in threaded view
|

Re: Matching sequences of identical characters with lpeg.

Andrew Gierth
>>>>> "Magicks" == Magicks M <[hidden email]> writes:

 Magicks> Hi, I'm having some trouble with this.
 Magicks> I want to capture seqences on identical characters in a
 Magicks> string: "aaabbbcccd" -> "aaa" "bbb" "ccc" "d"

 Magicks> I attempted to use the regular pattern "((.)%1*)" and found
 Magicks> that quantifiers didn't work. I then switched to lpeg and
 Magicks> attempted to use: Cg(P(1), "char") * Cb'char'^0 Which errors,
 Magicks> and thus I cannot think of a good way to match this.

Doing this in lpeg will, I believe, either require Cmt or pre-generating
a pattern for every possible character. (Your attempt above seems to
misunderstand what Cb does - it just fetches and returns the value from
Cg, it does not attempt to match it against the subject string;
backreference matches (as with =foo in the lpeg "re" module, or the "Lua
long strings" example in the lpeg docs) require Cmt.)

It's not immediately clear that doing it with Cmt would be in any way
better than just open-coding the search in lua, since you'd be calling
the capture function at every character position. With pre-generated
patterns it might look like this:

local lpeg = require "lpeg"
local P, C = lpeg.P, lpeg.C
local subpat = P(false)
for i = 0,255 do
  subpat = subpat + P(string.char(i))^2    -- 2 or more occurrences
end
local pat = (C(subpat) + P(1))^0

print(pat:match("abcaabbccaaabbbcccd"))
-- output: aa  bb  cc  aaa  bbb  ccc

--
Andrew.

Reply | Threaded
Open this post in threaded view
|

Re: Matching sequences of identical characters with lpeg.

Gabriel Bertilson
In reply to this post by Magicks M
Here's a solution using Cmt. As Andrew pointed out, Cb just inserts a
capture into the list of captures returned by the current pattern, it
doesn't match anything.

Cg(1, "char") * Cmt(C(1) * Cb'char', function (_, _, char1, char2)
return char1 == char2 end)^0

It matches one character, labels it as "char", then matches further
characters if they are equal to "char". To use it on "aaabbbcccd" in
the Lua interpreter (with LPeg functions available as variables):

> patt = Cg(1, "char") * Cmt(C(1) * Cb'char', function (_, _, cur, prev) return cur == prev end)^0
> (C(patt)^1):match "aaabbbcccd"
aa    bbb    ccc    d

— Gabriel
— Gabriel



On Tue, Nov 20, 2018 at 4:17 PM Magicks M <[hidden email]> wrote:
>
> Hi, I'm having some trouble with this.
> I want to capture seqences on identical characters in a string: "aaabbbcccd" -> "aaa" "bbb" "ccc" "d"
> I attempted to use the regular pattern "((.)%1*)" and found that quantifiers didn't work. I then switched to lpeg and attempted to use:
> Cg(P(1), "char") * Cb'char'^0
> Which errors, and thus I cannot think of a good way to match this.

Reply | Threaded
Open this post in threaded view
|

Re: Matching sequences of identical characters with lpeg.

Sean Conner
It was thus said that the Great Gabriel Bertilson once stated:

> Here's a solution using Cmt. As Andrew pointed out, Cb just inserts a
> capture into the list of captures returned by the current pattern, it
> doesn't match anything.
>
> Cg(1, "char") * Cmt(C(1) * Cb'char', function (_, _, char1, char2)
> return char1 == char2 end)^0
>
> It matches one character, labels it as "char", then matches further
> characters if they are equal to "char". To use it on "aaabbbcccd" in
> the Lua interpreter (with LPeg functions available as variables):
>
> > patt = Cg(1, "char") * Cmt(C(1) * Cb'char', function (_, _, cur, prev) return cur == prev end)^0
> > (C(patt)^1):match "aaabbbcccd"
> aa    bbb    ccc    d

  And it can be further extended to UTF-8:

local char = R"\1\127"
           + R"\194\244" * R"\128\191"^1
local seq  = Cg(char,'char')
           * Cmt(
                  C(char) * Cb'char',
                  function(_,_,cur,prev)
                    return cur == prev
                  end
             )^0
local patt = C(seq)^1

print(patt:match "aaabbbcccd")
print(patt:match "###aaaa©©©©####bbbbb####")

  -spc