LPEG documentation needs more clarification

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

LPEG documentation needs more clarification

Sean Conner

  I managed to generate a segfault with LPEG and I can reproduce the issue
with this code [1]:

local lpeg = require "lpeg"
local Cg = lpeg.Cg
local Cc = lpeg.Cc
local Cb = lpeg.Cb
local P  = lpeg.P

local cnt = Cg(Cc(0),'count')
          * (P(1) * Cg(Cb'count' / function(c) return c + 1 end,'count'))^0
          * Cb'count'

print(cnt:match(string.rep("x",512)))
print(cnt:match(string.rep("x",512+128))) -- CRASH at some point past this line
print(cnt:match(string.rep("x",512+128+32)))
print(cnt:match(string.rep("x",512+128+32+16)))
print(cnt:match(string.rep("x",512+128+32+16+4)))
print(cnt:match(string.rep("x",512+128+32+16+8)))
print(cnt:match(string.rep("x",512+128+64)))
print(cnt:match(string.rep("x",512+256)))
print(cnt:match(string.rep("x",1024)))

What I did not expect was the 2,900 C callstack entries this produced (thus
causing the crash).  The LPEG documentation is terse on the subject.  My
intent was to count a series of captures while avoiding an external
variable.  Yes, I could use a state table with lpeg.Carg() but I wasy trying
to avoid that [2].  When reading the LPEG documentation trying to find a
warning about this (especially the large call stack) I didn't find much.
There is this line:

        Therefore, captures should avoid side effects.

but the previous parenthetical gives an example of a "side effect":

        (As an example, consider the pattern lpeg.P"a" / func / 0. Because
        the "division" by 0 instructs LPeg to throw away the results from
        the pattern, LPeg may or may not call func.)

So I'm taking this to mean the normal types of side effects in programming.
Then there's this bit about lpeg.Cb():

        Creates a *back capture*. This pattern matches the empty string and
        produces the values produced by the *most recent* group capture
        named name (where name can be any Lua value).

        *Most recent* means the last *complete outermost group* capture with
        the given name. A *Complete* capture means that the entire pattern
        corresponding to the capture has matched. An *Outermost* capture
        means that the capture is not inside another complete capture.

        In the same way that LPeg does not specify when it evaluates
        captures, it does not specify whether it reuses values previously
        produced by the group or re-evaluates them.

That last bit *could* mean (in the sample code above) that lpeg.Cb()
re-evaluates the lpeg.Cc(0) each instance and thus, my attempt to count
using lpeg.Cg() and lpeg.Cb() could return 0 or 1 just as well as the actual
count [4].

  Yes, I admit to not reading the manual with a magnifying glass and
tweezers, but this terseness of the manual language (of both Lua and LPEG)
is a recurring theme on this list.  Perhaps a few examples could help
clarify things?

  -spc (Will do stupid things in code ... )

[1] This just reproduces the issue---I know there are better ways to get
        the length of a string.  

[2] It requires additional parameters to lpeg.match(), thus complicating
        the call.

[3] Usually with a pattern like:

                pattern = Cg(Cc"unknown",'name')
                        * Cg(R("az","AZ")^1,'name')^-1
                parse   = Ct(pattern)

        to define a default value in case a pattern doesn't exist.

[4] I would be interested in knowing of other LPEG implementations.

Reply | Threaded
Open this post in threaded view
|

Re: LPEG documentation needs more clarification

Pierre-Yves Gérardy
On Wed, Jan 30, 2019 at 4:58 AM Sean Conner <[hidden email]> wrote:

>
>
>   I managed to generate a segfault with LPEG and I can reproduce the issue
> with this code [1]:
>
> local lpeg = require "lpeg"
> local Cg = lpeg.Cg
> local Cc = lpeg.Cc
> local Cb = lpeg.Cb
> local P  = lpeg.P
>
> local cnt = Cg(Cc(0),'count')
>           * (P(1) * Cg(Cb'count' / function(c) return c + 1 end,'count'))^0
>           * Cb'count'
>
> print(cnt:match(string.rep("x",512+128))) -- CRASH at some point past this line

LuLPeg also crashes, but for a larger string (between 512 * 51 and 512
* 52 with Lua 5.3).

Just in case, your problem can be solved with a folding capture and no
temp Lua variable:

    local cnt = Cf(
      Cp() * P(1)^0 * Cp(),
      function(first, last) return last - first end
    )

—Pierre-Yves

Reply | Threaded
Open this post in threaded view
|

Re: LPEG documentation needs more clarification

Pierre-Yves Gérardy
Actually, I read i too fastt :-)

Still, the folding capture would also handle counting arbitrary
captures more cleanly:

    cnt = Cf(Cc(0) * C(1)^0 , function(total) return total +1  end)

—Pierre-Yves

On Thu, Feb 14, 2019 at 8:00 PM <[hidden email]> wrote:

>
> On Wed, Jan 30, 2019 at 4:58 AM Sean Conner <[hidden email]> wrote:
> >
> >
> >   I managed to generate a segfault with LPEG and I can reproduce the issue
> > with this code [1]:
> >
> > local lpeg = require "lpeg"
> > local Cg = lpeg.Cg
> > local Cc = lpeg.Cc
> > local Cb = lpeg.Cb
> > local P  = lpeg.P
> >
> > local cnt = Cg(Cc(0),'count')
> >           * (P(1) * Cg(Cb'count' / function(c) return c + 1 end,'count'))^0
> >           * Cb'count'
> >
> > print(cnt:match(string.rep("x",512+128))) -- CRASH at some point past this line
>
> LuLPeg also crashes, but for a larger string (between 512 * 51 and 512
> * 52 with Lua 5.3).
>
> Just in case, your problem can be solved with a folding capture and no
> temp Lua variable:
>
>     local cnt = Cf(
>       Cp() * P(1)^0 * Cp(),
>       function(first, last) return last - first end
>     )
>
> —Pierre-Yves

Reply | Threaded
Open this post in threaded view
|

Re: LPEG documentation needs more clarification

Sean Conner
In reply to this post by Pierre-Yves Gérardy
It was thus said that the Great [hidden email] once stated:

> On Wed, Jan 30, 2019 at 4:58 AM Sean Conner <[hidden email]> wrote:
> >
> >
> >   I managed to generate a segfault with LPEG and I can reproduce the issue
> > with this code [1]:
> >
> > local lpeg = require "lpeg"
> > local Cg = lpeg.Cg
> > local Cc = lpeg.Cc
> > local Cb = lpeg.Cb
> > local P  = lpeg.P
> >
> > local cnt = Cg(Cc(0),'count')
> >           * (P(1) * Cg(Cb'count' / function(c) return c + 1 end,'count'))^0
> >           * Cb'count'
> >
> > print(cnt:match(string.rep("x",512+128))) -- CRASH at some point past this line
>
> LuLPeg also crashes, but for a larger string (between 512 * 51 and 512
> * 52 with Lua 5.3).
>
> Just in case, your problem can be solved with a folding capture and no
> temp Lua variable:
>
>     local cnt = Cf(
>       Cp() * P(1)^0 * Cp(),
>       function(first, last) return last - first end
>     )

  Cool solution, but it won't work for my use case.  Sigh.

  So here's the actual issue I was trying to solve.  I'm dealing with UTF-8
text in a terminal (xterm).  I want to trim a line of text to fit a line.
utf8.len() won't work because it counts code points and not the actual
number of characters that will be drawn.  For examle, for the string

        x = "Spin̈al Tap"

string.len(x) returns 12, utf8.len(x) returns 11, but it takes 10 character
positions (that's a "Combining Diaeresis" over the 'n' character).  So there
are certain Unicode codepoints I want to skip counting---I want a "display
length", not a "codepoint length".  The example I gave was a bad example in
this case.

  Anyway, I do have code that works using lpeg.Carg():

local cutf8 = R" ~"                               -- ASCII minus C0 control set
            + lpeg.P"\194"     * lpeg.R"\160\191" -- UTF minus C1 control set [1]
            + lpeg.R"\195\223" * lpeg.R"\128\191"
            + lpeg.P"\224"     * lpeg.R"\160\191" * lpeg.R"\128\191"
            + lpeg.R"\225\236" * lpeg.R"\128\191" * lpeg.R"\128\191"
            + lpeg.P"\237"     * lpeg.R"\128\159" * lpeg.R"\128\191"
            + lpeg.R"\238\239" * lpeg.R"\128\191" * lpeg.R"\128\191"
            + lpeg.P"\240"     * lpeg.R"\144\191" * lpeg.R"\128\191" * lpeg.R"\128\191"
            + lpeg.R"\241\243" * lpeg.R"\128\191" * lpeg.R"\128\191" * lpeg.R"\128\191"
            + lpeg.P"\244"     * lpeg.R"\128\143" * lpeg.R"\128\191" * lpeg.R"\128\191"

local nc   = P"\204"     * R"\128\191" -- combining chars
           + P"\205"     * R"\128\175" -- combining chars
           + P"\225\170" * R"\176\190" -- combining chars
           + P"\225\183" * R"\128\191" -- combining chars
           + P"\226\131" * R"\144\176" -- combining chars
           + P"\239\184" * R"\160\175" -- combining chars
           + P"\u{00AD}"               -- shy hyphen
           + P"\u{1806}"               -- Mongolian TODO soft hyphen
           + P"\u{200B}"               -- zero width space
           + P"\u{200C}"               -- zero-width nonjoiner space
           + P"\u{200D}"               -- zero-width joiner space
local cnt  = (nc + cutf8 * Carg(1) / function(s) s.cnt = s.cnt + 1 end)^0
           * Carg(1) / function(s) return s.cnt end

  It's not 100% perfect [2][3] but for what I'm doing, it works.

  -spc (I should mention that the string have had all control codes and
        sequences removed, so I do not need concern myself with that ...)

[1] Definition of UTF-8 I'm using comes from RFC-3629

[2] Doesn't handle RtL text; and there are other 0-width characters I'm
        missing.

[3] I could repalce the \u{hhh} construction with something that works
        for Lua 5.1.


Reply | Threaded
Open this post in threaded view
|

Re: LPEG documentation needs more clarification

Sean Conner
In reply to this post by Pierre-Yves Gérardy
It was thus said that the Great [hidden email] once stated:
> Actually, I read i too fastt :-)
>
> Still, the folding capture would also handle counting arbitrary
> captures more cleanly:
>
>     cnt = Cf(Cc(0) * C(1)^0 , function(total) return total +1  end)

  I think *that* would work.

  -spc


Reply | Threaded
Open this post in threaded view
|

Re: LPEG documentation needs more clarification

Dirk Laurie-2
In reply to this post by Sean Conner
Op Do. 14 Feb. 2019 om 21:37 het Sean Conner <[hidden email]> geskryf:

>            + P"\u{1806}"               -- Mongolian TODO soft hyphen

Ah, we're in the company of uF9F3 and u101Dx now :-)

Reply | Threaded
Open this post in threaded view
|

Re: LPEG documentation needs more clarification

Xavier Wang
In reply to this post by Sean Conner


Sean Conner <[hidden email]>于2019年2月15日 周五03:37写道:
It was thus said that the Great [hidden email] once stated:
> On Wed, Jan 30, 2019 at 4:58 AM Sean Conner <[hidden email]> wrote:
> >
> >
> >   I managed to generate a segfault with LPEG and I can reproduce the issue
> > with this code [1]:
> >
> > local lpeg = require "lpeg"
> > local Cg = lpeg.Cg
> > local Cc = lpeg.Cc
> > local Cb = lpeg.Cb
> > local P  = lpeg.P
> >
> > local cnt = Cg(Cc(0),'count')
> >           * (P(1) * Cg(Cb'count' / function(c) return c + 1 end,'count'))^0
> >           * Cb'count'
> >
> > print(cnt:match(string.rep("x",512+128))) -- CRASH at some point past this line
>
> LuLPeg also crashes, but for a larger string (between 512 * 51 and 512
> * 52 with Lua 5.3).
>
> Just in case, your problem can be solved with a folding capture and no
> temp Lua variable:
>
>     local cnt = Cf(
>       Cp() * P(1)^0 * Cp(),
>       function(first, last) return last - first end
>     )

  Cool solution, but it won't work for my use case.  Sigh.

  So here's the actual issue I was trying to solve.  I'm dealing with UTF-8
text in a terminal (xterm).  I want to trim a line of text to fit a line.
utf8.len() won't work because it counts code points and not the actual
number of characters that will be drawn.  For examle, for the string

        x = "Spin̈al Tap"

string.len(x) returns 12, utf8.len(x) returns 11, but it takes 10 character
positions (that's a "Combining Diaeresis" over the 'n' character).  So there
are certain Unicode codepoints I want to skip counting---I want a "display
length", not a "codepoint length".  The example I gave was a bad example in
this case.

Maybe you just need a wcwidth routine in luautf8 module 😃

  Anyway, I do have code that works using lpeg.Carg():

local cutf8 = R" ~"                               -- ASCII minus C0 control set
            + lpeg.P"\194"     * lpeg.R"\160\191" -- UTF minus C1 control set [1]
            + lpeg.R"\195\223" * lpeg.R"\128\191"
            + lpeg.P"\224"     * lpeg.R"\160\191" * lpeg.R"\128\191"
            + lpeg.R"\225\236" * lpeg.R"\128\191" * lpeg.R"\128\191"
            + lpeg.P"\237"     * lpeg.R"\128\159" * lpeg.R"\128\191"
            + lpeg.R"\238\239" * lpeg.R"\128\191" * lpeg.R"\128\191"
            + lpeg.P"\240"     * lpeg.R"\144\191" * lpeg.R"\128\191" * lpeg.R"\128\191"
            + lpeg.R"\241\243" * lpeg.R"\128\191" * lpeg.R"\128\191" * lpeg.R"\128\191"
            + lpeg.P"\244"     * lpeg.R"\128\143" * lpeg.R"\128\191" * lpeg.R"\128\191"

local nc   = P"\204"     * R"\128\191" -- combining chars
           + P"\205"     * R"\128\175" -- combining chars
           + P"\225\170" * R"\176\190" -- combining chars
           + P"\225\183" * R"\128\191" -- combining chars
           + P"\226\131" * R"\144\176" -- combining chars
           + P"\239\184" * R"\160\175" -- combining chars
           + P"\u{00AD}"               -- shy hyphen
           + P"\u{1806}"               -- Mongolian TODO soft hyphen
           + P"\u{200B}"               -- zero width space
           + P"\u{200C}"               -- zero-width nonjoiner space
           + P"\u{200D}"               -- zero-width joiner space
local cnt  = (nc + cutf8 * Carg(1) / function(s) s.cnt = s.cnt + 1 end)^0
           * Carg(1) / function(s) return s.cnt end

  It's not 100% perfect [2][3] but for what I'm doing, it works.

  -spc (I should mention that the string have had all control codes and
        sequences removed, so I do not need concern myself with that ...)

[1]     Definition of UTF-8 I'm using comes from RFC-3629

[2]     Doesn't handle RtL text; and there are other 0-width characters I'm
        missing.

[3]     I could repalce the \u{hhh} construction with something that works
        for Lua 5.1.


--
regards,
Xavier Wang.
Reply | Threaded
Open this post in threaded view
|

Re: LPEG documentation needs more clarification

Sam Putman
It would seem there is a pure Lua wcwidth already: https://github.com/aperezdc/lua-wcwidth

Sam Atman
Principal Agent, Special Circumstances
~wep

On Feb 15, 2019, at 10:24 AM, Xavier Wang <[hidden email]> wrote:



Sean Conner <[hidden email]>于2019年2月15日 周五03:37写道:
It was thus said that the Great [hidden email] once stated:
> On Wed, Jan 30, 2019 at 4:58 AM Sean Conner <[hidden email]> wrote:
> >
> >
> >   I managed to generate a segfault with LPEG and I can reproduce the issue
> > with this code [1]:
> >
> > local lpeg = require "lpeg"
> > local Cg = lpeg.Cg
> > local Cc = lpeg.Cc
> > local Cb = lpeg.Cb
> > local P  = lpeg.P
> >
> > local cnt = Cg(Cc(0),'count')
> >           * (P(1) * Cg(Cb'count' / function(c) return c + 1 end,'count'))^0
> >           * Cb'count'
> >
> > print(cnt:match(string.rep("x",512+128))) -- CRASH at some point past this line
>
> LuLPeg also crashes, but for a larger string (between 512 * 51 and 512
> * 52 with Lua 5.3).
>
> Just in case, your problem can be solved with a folding capture and no
> temp Lua variable:
>
>     local cnt = Cf(
>       Cp() * P(1)^0 * Cp(),
>       function(first, last) return last - first end
>     )

  Cool solution, but it won't work for my use case.  Sigh.

  So here's the actual issue I was trying to solve.  I'm dealing with UTF-8
text in a terminal (xterm).  I want to trim a line of text to fit a line.
utf8.len() won't work because it counts code points and not the actual
number of characters that will be drawn.  For examle, for the string

        x = "Spin̈al Tap"

string.len(x) returns 12, utf8.len(x) returns 11, but it takes 10 character
positions (that's a "Combining Diaeresis" over the 'n' character).  So there
are certain Unicode codepoints I want to skip counting---I want a "display
length", not a "codepoint length".  The example I gave was a bad example in
this case.

Maybe you just need a wcwidth routine in luautf8 module 😃

  Anyway, I do have code that works using lpeg.Carg():

local cutf8 = R" ~"                               -- ASCII minus C0 control set
            + lpeg.P"\194"     * lpeg.R"\160\191" -- UTF minus C1 control set [1]
            + lpeg.R"\195\223" * lpeg.R"\128\191"
            + lpeg.P"\224"     * lpeg.R"\160\191" * lpeg.R"\128\191"
            + lpeg.R"\225\236" * lpeg.R"\128\191" * lpeg.R"\128\191"
            + lpeg.P"\237"     * lpeg.R"\128\159" * lpeg.R"\128\191"
            + lpeg.R"\238\239" * lpeg.R"\128\191" * lpeg.R"\128\191"
            + lpeg.P"\240"     * lpeg.R"\144\191" * lpeg.R"\128\191" * lpeg.R"\128\191"
            + lpeg.R"\241\243" * lpeg.R"\128\191" * lpeg.R"\128\191" * lpeg.R"\128\191"
            + lpeg.P"\244"     * lpeg.R"\128\143" * lpeg.R"\128\191" * lpeg.R"\128\191"

local nc   = P"\204"     * R"\128\191" -- combining chars
           + P"\205"     * R"\128\175" -- combining chars
           + P"\225\170" * R"\176\190" -- combining chars
           + P"\225\183" * R"\128\191" -- combining chars
           + P"\226\131" * R"\144\176" -- combining chars
           + P"\239\184" * R"\160\175" -- combining chars
           + P"\u{00AD}"               -- shy hyphen
           + P"\u{1806}"               -- Mongolian TODO soft hyphen
           + P"\u{200B}"               -- zero width space
           + P"\u{200C}"               -- zero-width nonjoiner space
           + P"\u{200D}"               -- zero-width joiner space
local cnt  = (nc + cutf8 * Carg(1) / function(s) s.cnt = s.cnt + 1 end)^0
           * Carg(1) / function(s) return s.cnt end

  It's not 100% perfect [2][3] but for what I'm doing, it works.

  -spc (I should mention that the string have had all control codes and
        sequences removed, so I do not need concern myself with that ...)

[1]     Definition of UTF-8 I'm using comes from RFC-3629

[2]     Doesn't handle RtL text; and there are other 0-width characters I'm
        missing.

[3]     I could repalce the \u{hhh} construction with something that works
        for Lua 5.1.


--
regards,
Xavier Wang.
Reply | Threaded
Open this post in threaded view
|

Re: LPEG documentation needs more clarification

Adrian Perez de Castro
Hello!

On Fri, 15 Feb 2019 21:27:19 +0800, Sam Atman <[hidden email]> wrote:

> It would seem there is a pure Lua wcwidth already: https://github.com/aperezdc/lua-wcwidth

Author of lua-wcwdith here! Reading this thread I just remembered that the
module needed an update to the latest Unicode version, so I just pushed
version 0.3 earlier today :)

If you end up using it, feel free to ask about about it. I mostly lurk
in lua-l nowadays, but I do read most of the threads anyway and I am
happy to help out.

Cheers,

-Adrián

> Sam Atman
> Principal Agent, Special Circumstances
> ~wep
>
> > On Feb 15, 2019, at 10:24 AM, Xavier Wang <[hidden email]> wrote:
> >
> >
> >
> > Sean Conner <[hidden email]>于2019年2月15日 周五03:37写道:
> >> It was thus said that the Great [hidden email] once stated:
> >> > On Wed, Jan 30, 2019 at 4:58 AM Sean Conner <[hidden email]> wrote:
> >> > >
> >> > >
> >> > >   I managed to generate a segfault with LPEG and I can reproduce the issue
> >> > > with this code [1]:
> >> > >
> >> > > local lpeg = require "lpeg"
> >> > > local Cg = lpeg.Cg
> >> > > local Cc = lpeg.Cc
> >> > > local Cb = lpeg.Cb
> >> > > local P  = lpeg.P
> >> > >
> >> > > local cnt = Cg(Cc(0),'count')
> >> > >           * (P(1) * Cg(Cb'count' / function(c) return c + 1 end,'count'))^0
> >> > >           * Cb'count'
> >> > >
> >> > > print(cnt:match(string.rep("x",512+128))) -- CRASH at some point past this line
> >> >
> >> > LuLPeg also crashes, but for a larger string (between 512 * 51 and 512
> >> > * 52 with Lua 5.3).
> >> >
> >> > Just in case, your problem can be solved with a folding capture and no
> >> > temp Lua variable:
> >> >
> >> >     local cnt = Cf(
> >> >       Cp() * P(1)^0 * Cp(),
> >> >       function(first, last) return last - first end
> >> >     )
> >>
> >>   Cool solution, but it won't work for my use case.  Sigh.
> >>
> >>   So here's the actual issue I was trying to solve.  I'm dealing with UTF-8
> >> text in a terminal (xterm).  I want to trim a line of text to fit a line.
> >> utf8.len() won't work because it counts code points and not the actual
> >> number of characters that will be drawn.  For examle, for the string
> >>
> >>         x = "Spin̈al Tap"
> >>
> >> string.len(x) returns 12, utf8.len(x) returns 11, but it takes 10 character
> >> positions (that's a "Combining Diaeresis" over the 'n' character).  So there
> >> are certain Unicode codepoints I want to skip counting---I want a "display
> >> length", not a "codepoint length".  The example I gave was a bad example in
> >> this case.
> >
> > Maybe you just need a wcwidth routine in luautf8 module 😃
> >>
> >>   Anyway, I do have code that works using lpeg.Carg():
> >>
> >> local cutf8 = R" ~"                               -- ASCII minus C0 control set
> >>             + lpeg.P"\194"     * lpeg.R"\160\191" -- UTF minus C1 control set [1]
> >>             + lpeg.R"\195\223" * lpeg.R"\128\191"
> >>             + lpeg.P"\224"     * lpeg.R"\160\191" * lpeg.R"\128\191"
> >>             + lpeg.R"\225\236" * lpeg.R"\128\191" * lpeg.R"\128\191"
> >>             + lpeg.P"\237"     * lpeg.R"\128\159" * lpeg.R"\128\191"
> >>             + lpeg.R"\238\239" * lpeg.R"\128\191" * lpeg.R"\128\191"
> >>             + lpeg.P"\240"     * lpeg.R"\144\191" * lpeg.R"\128\191" * lpeg.R"\128\191"
> >>             + lpeg.R"\241\243" * lpeg.R"\128\191" * lpeg.R"\128\191" * lpeg.R"\128\191"
> >>             + lpeg.P"\244"     * lpeg.R"\128\143" * lpeg.R"\128\191" * lpeg.R"\128\191"
> >>
> >> local nc   = P"\204"     * R"\128\191" -- combining chars
> >>            + P"\205"     * R"\128\175" -- combining chars
> >>            + P"\225\170" * R"\176\190" -- combining chars
> >>            + P"\225\183" * R"\128\191" -- combining chars
> >>            + P"\226\131" * R"\144\176" -- combining chars
> >>            + P"\239\184" * R"\160\175" -- combining chars
> >>            + P"\u{00AD}"               -- shy hyphen
> >>            + P"\u{1806}"               -- Mongolian TODO soft hyphen
> >>            + P"\u{200B}"               -- zero width space
> >>            + P"\u{200C}"               -- zero-width nonjoiner space
> >>            + P"\u{200D}"               -- zero-width joiner space
> >> local cnt  = (nc + cutf8 * Carg(1) / function(s) s.cnt = s.cnt + 1 end)^0
> >>            * Carg(1) / function(s) return s.cnt end
> >>
> >>   It's not 100% perfect [2][3] but for what I'm doing, it works.
> >>
> >>   -spc (I should mention that the string have had all control codes and
> >>         sequences removed, so I do not need concern myself with that ...)
> >>
> >> [1]     Definition of UTF-8 I'm using comes from RFC-3629
> >>
> >> [2]     Doesn't handle RtL text; and there are other 0-width characters I'm
> >>         missing.
> >>
> >> [3]     I could repalce the \u{hhh} construction with something that works
> >>         for Lua 5.1.
> >>
> >>
> > --
> > regards,
> > Xavier Wang.
Non-text part: text/html

attachment0 (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: LPEG documentation needs more clarification

Sean Conner
It was thus said that the Great Adrian Perez de Castro once stated:

> Hello!
>
> On Fri, 15 Feb 2019 21:27:19 +0800, Sam Atman <[hidden email]> wrote:
>
> > It would seem there is a pure Lua wcwidth already: https://github.com/aperezdc/lua-wcwidth
>
> Author of lua-wcwdith here! Reading this thread I just remembered that the
> module needed an update to the latest Unicode version, so I just pushed
> version 0.3 earlier today :)
>
> If you end up using it, feel free to ask about about it. I mostly lurk
> in lua-l nowadays, but I do read most of the threads anyway and I am
> happy to help out.

  What's the license?  It seems to be lacking one ...

  -spc


Reply | Threaded
Open this post in threaded view
|

Re: LPEG documentation needs more clarification

Roberto Ierusalimschy
In reply to this post by Sean Conner
> local lpeg = require "lpeg"
> local Cg = lpeg.Cg
> local Cc = lpeg.Cc
> local Cb = lpeg.Cb
> local P  = lpeg.P
>
> local cnt = Cg(Cc(0),'count')
>           * (P(1) * Cg(Cb'count' / function(c) return c + 1 end,'count'))^0
>           * Cb'count'
>
> print(cnt:match(string.rep("x",512)))
> print(cnt:match(string.rep("x",512+128))) -- CRASH at some point past this line
> [...]
>
> What I did not expect was the 2,900 C callstack entries this produced (thus
> causing the crash).

You are doing a kind of lazy evaluation. Each new definition of 'count'
depends on the previous one. When LPeg goes on to evaluate the last
definition, it needs to evaluate (using recursion in the C code)
the previous one, and so and so. So, you have many recursion levels
multiplied by the number of intermediate functions inside the indirect
recursion.

Nevertheless, this is a bug in LPeg. It should limit its recursion level
and give a decent error. (This is always a nightmare in C.)

As a side note, folding was created exaclty for that kind of use:

  local cnt = lpeg.Cf(Cc(0) * lpeg.C(1)^0, function(c) return c + 1 end)


> When reading the LPEG documentation trying to find a
> warning about this (especially the large call stack) I didn't find much.
> There is this line:
>
> Therefore, captures should avoid side effects.
>
> but the previous parenthetical gives an example of a "side effect":
>
> (As an example, consider the pattern lpeg.P"a" / func / 0. Because
> the "division" by 0 instructs LPeg to throw away the results from
> the pattern, LPeg may or may not call func.)

That is exactly the point: to show that if there is a side effect, final
result may be undefined.


> Then there's this bit about lpeg.Cb():
>
> Creates a *back capture*. This pattern matches the empty string and
> produces the values produced by the *most recent* group capture
> named name (where name can be any Lua value).
>
> *Most recent* means the last *complete outermost group* capture with
> the given name. A *Complete* capture means that the entire pattern
> corresponding to the capture has matched. An *Outermost* capture
> means that the capture is not inside another complete capture.
>
> In the same way that LPeg does not specify when it evaluates
> captures, it does not specify whether it reuses values previously
> produced by the group or re-evaluates them.
>
> That last bit *could* mean (in the sample code above) that lpeg.Cb()
> re-evaluates the lpeg.Cc(0) each instance and thus, my attempt to count
> using lpeg.Cg() and lpeg.Cb() could return 0 or 1 just as well as the actual
> count [4].

Note the "previously produced by *the* group". Two groups named 'count'
are not the same group, so there is nothing previously produced by each
group there. Each new group will use the previous one, which is the most
recent from its point of view. Your original pattern has a well-defined
behavior, without any side effect. The problem is that it creates a long
list of group captures, all called 'count' and each one depending on the
previous one.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

LPEG bugs (was Re: LPEG documentation needs more clarification)

Sean Conner
Back from November, was this from you:

> > I replied:
> > It was thus said that the Great Andrew Gierth once stated:
> > > >>>>> "Andrew" == Andrew Gierth <[hidden email]> writes:
> > >
> > >  Andrew> This code thinks it can create an anonymous group capture with
> > >  Andrew> no valid .s pointer (to contain the results of a match-time
> > >  Andrew> capture). I _think_ the failure is normally hidden by the fact
> > >  Andrew> that base[0] is _usually_ a reuse of the same Capture that was
> > >  Andrew> used for the match-time capture itself - except that the array
> > >  Andrew> of Capture structures might get reallocated between the two
> > >  Andrew> uses, without preserving the old value of the (now unused)
> > >  Andrew> Capture.
> > >
> > > Further analysis shows that my original idea was wrong; correct
> > > substitution _requires_ that the value of base[0].s actually be
> > > preserved rather than initialized in adddyncaptures. So the fix seems to
> > > be to ensure that it is preserved across expansion of the captures
> > > array:
> > >
> > > --- lpvm.c.orig 2018-11-10 09:37:29 UTC
> > > +++ lpvm.c
> > > @@ -311,7 +311,7 @@ const char *match (lua_State *L, const c
> > >            if (fr + n >= SHRT_MAX)
> > >              luaL_error(L, "too many results in match-time capture");
> > >            if ((captop += n + 2) >= capsize) {
> > > -            capture = doublecap(L, capture, captop, n + 2, ptop);
> > > +            capture = doublecap(L, capture, captop, n + 1, ptop);
> > >              capsize = 2 * captop;
> > >            }
> > >            /* add new captures to 'capture' list */
> > >
> > > By telling doublecap that one less Capture structure is "new", we make
> > > it preserve one more of the existing values, which is the one we need.
> > >
> > > This patch fixes both my testcase and the original report, as far as I
> > > can tell.
> >
> >   Just in case Roberto & Co. did not see this thread---is this an actual bug
> > in LPeg?
>
> It seems so.

and now this:

> > What I did not expect was the 2,900 C callstack entries this produced (thus
> > causing the crash).
>
> You are doing a kind of lazy evaluation. Each new definition of 'count'
> depends on the previous one. When LPeg goes on to evaluate the last
> definition, it needs to evaluate (using recursion in the C code)
> the previous one, and so and so. So, you have many recursion levels
> multiplied by the number of intermediate functions inside the indirect
> recursion.
>
> Nevertheless, this is a bug in LPeg. It should limit its recursion level
> and give a decent error. (This is always a nightmare in C.)

  Any chances for these bugs to be fixed and included in a new version of
LPEG?

  -spc

Reply | Threaded
Open this post in threaded view
|

Re: LPEG bugs (was Re: LPEG documentation needs more clarification)

Roberto Ierusalimschy
>   Any chances for these bugs to be fixed and included in a new version of
> LPEG?

Sure.

-- Roberto