Elegant design for creating error messages in LPEG parser

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Elegant design for creating error messages in LPEG parser

joy mondal
What is the idiomatic way to create error messages in LPEG ?

To be specific I would like to create error messages for incomplete bracket completion.

Example :

> [ 1 2 3     -- ERROR ! MISSING ] AT LINE 20

> ( 1 2 3     -- ERROR ! MISSING ) AT LINE 35

> { 1 2 3     -- ERROR ! MISSING } AT LINE 12

>  1 2 3 ]    -- ERROR ! EXTRA   ] AT LINE 42

> [ ( 1 2 3 ] -- ERROR ! MISSING ) AT LINE 71

I have figured out that I need to use lpeg.Carg to pass a table to log line number but how can a pattern fail in a certain way as to let its error be know ?

```lua

local show = function(...)

  {tab,cap} = arg

  print (cap)

end


local inner = C(1 - P(']'))^0

local patt = P('[') * inner * P(']')

local test = Cmt((patt * (Carg(1))), show)

local out = test:match('[hello', 1, { })

return log(out) -- nil ??

```

The problem is the function `show` does not run if the pattern fails.

It also does not not run if I use `Cmt`, but I think I am missing the larger picture.

I also created a markdown gist to make the code easier to read

best wishes,
Joy
Reply | Threaded
Open this post in threaded view
|

Re: Elegant design for creating error messages in LPEG parser

Sean Conner
It was thus said that the Great joy mondal once stated:
> What is the idiomatic way to create error messages in LPEG ?

  I don't think there *is* an idiomatic way to create error messages.

> To be specific I would like to create error messages for incomplete
> bracket completion.
>
> Example :
>
> > [ 1 2 3     -- ERROR ! MISSING ] AT LINE 20
>
> > ( 1 2 3     -- ERROR ! MISSING ) AT LINE 35
>
> > { 1 2 3     -- ERROR ! MISSING } AT LINE 12
>
> >  1 2 3 ]    -- ERROR ! EXTRA   ] AT LINE 42
>
> > [ ( 1 2 3 ] -- ERROR ! MISSING ) AT LINE 71
>
> I have figured out that I need to use lpeg.Carg to pass a table to log
> line number but how can a pattern fail in a certain way as to let its
> error be know ?

  Okay, so we're parsing stuff like:

        [ 1 2 3 4 ]
        ( 1 2 3 4 )
        [ 1 2 ( 3 4 ) 5 6 ]

The idea (I think) is to keep track of not only which bracket we're using,
but the depth as well and report back any errors.  The trick to this is to
include code that explicitely checks for the error conditions and record
them.  Here's a solution:

local lpeg = require "lpeg"
 
local WS = lpeg.P" "
         + lpeg.P"\t"
         + lpeg.Cmt(lpeg.P"\n" * lpeg.Carg(1),function(_,position,state)
             -- ---------------------------------------------------------
             -- Track what line we're on.  At match time, bump the line
             -- number stored in our state data.  We return the position
             -- to carry on but not actually capture anything as we don't
             -- want to capture the newline.
             -- ---------------------------------------------------------

             state.line = state.line + 1
             return position
           end)

-- -----------------------------------------------------------------------
-- For each opening brace (square or paren), track the stating line in a
-- stack.  As we close each brace, the stack is popped, but when we expect
-- an ending brace and don't get one, we know the starting line. Again, this
-- is done at match time.
-- ----------------------------------------------------------------------

local SO = lpeg.Cmt(lpeg.P"[" * lpeg.Carg(1),function(_,position,state)
  table.insert(state.stack,state.line)
  return position,{}
end)

local PO = lpeg.Cmt(lpeg.P"(" * lpeg.Carg(1),function(_,position,state)
  table.insert(state.stack,state.line)
  return position,{}
end)

-- -------------------------------------------------------------------------
-- For each closing brace, pop the line-tracking stack.  If we get the wrong
-- ending brace, record the error in our state table.  Multiple errors won't
-- be reported, but it would be easy enough to stack error messages in a
-- stack if need be.  Again, all done at match time.
-- -------------------------------------------------------------------------

local SC = lpeg.Cmt(lpeg.P"]" * lpeg.Carg(1),function(_,position,state)
             table.remove(state.stack)
             return position
           end)
         + lpeg.Cmt(lpeg.Carg(1),function(_,position,state)
             state.error = string.format("Missing ']' at line %d (start line %d)",state.line,state.stack[#state.stack])
             table.remove(state.stack)
             return position
           end)

local PC = lpeg.Cmt(lpeg.P")" * lpeg.Carg(1),function(_,position,state)
             table.remove(state.stack)
             return position
           end)
         + lpeg.Cmt(lpeg.Carg(1),function(_,position,state)
             state.error = string.format("Missing ')' at line %d (start line %d)",state.line,state.stack[#state.stack])
             table.remove(state.stack)
             return position
           end)
 
local number = lpeg.R"09"^1 / tonumber

local function accumulate(acc,i)
  table.insert(acc,i)
  return acc
end

local item = lpeg.P {
  'item',
  item   = WS^0 * (lpeg.V"square" + lpeg.V"paren" + number) * WS^0,
  square = lpeg.Cf(SO * lpeg.V"item"^0,accumulate) * SC,  
  paren  = lpeg.Cf(PO * lpeg.V"item"^0,accumulate) * PC,          
}

local list = lpeg.Cf(lpeg.Ct"" * item^0,accumulate)

local state = { line = 1 , stack = {}}
local test  = [[
[ 1 2 3 4 ]
( 1 2 3 4 )
[ 1 2 ( 3 4 ) 5 6 ]
1 2 3 4
]]

local x = list:match(test,1,state)                    

state = { line = 1 , stack = {}}
test  = [[    
[1 2 3 4]
[1 2 3
[1 2 3 4]
]]

x = list:match(test,1,state)

After each match, you can check if state.error exists.  So for that last
example, it would be:

        Missing ']' at line 4 (start line 2)

x may not necessarily be nil, so the best bet is to check for state.error.

  Yes, it's kind of annoying to add proper error checking, but it can be
done.

  -spc (Yes, there's some repetition that could be removed, but I wanted to
        make the example as explicit as possible)

Reply | Threaded
Open this post in threaded view
|

Re: Elegant design for creating error messages in LPEG parser

Hugo Musso Gualandi
In reply to this post by joy mondal
> Example :
>
> > [ 1 2 3     -- ERROR ! MISSING ] AT LINE 20
>
> > ( 1 2 3     -- ERROR ! MISSING ) AT LINE 35
>
> > { 1 2 3     -- ERROR ! MISSING } AT LINE 12
>
> >  1 2 3 ]    -- ERROR ! EXTRA   ] AT LINE 42
>
> > [ ( 1 2 3 ] -- ERROR ! MISSING ) AT LINE 71

You may want to check out LPegLabel, a fork of LPeg with additional
error-handling features:

https://github.com/sqmedeiros/lpeglabel/


Reply | Threaded
Open this post in threaded view
|

Re: Elegant design for creating error messages in LPEG parser

joy mondal

The reason I bring up the question of good design is because you (Sean) have mentioned that there are

infinite ways to design parser.

I though about the state variable approach but it feels working against the design principals of LPEG.

Since when you try to create a parser by hand you end up in state machine hell.

Later I understood that LPEG was created to not have to deal with creating a state machine by hand.

unless the author's of LPEG want error to be handled like that !

For example an alternative approach:

```

inner = (.....) -- inner recursive

matched = (.....) -- function for matching

unmatched = (.....) -- function for dealing with error

patt = ((P '[')*inner*(P ']'))/matched + ((P '[')*inner)/unmatched

```

> the trick to this is to include code that explicitly checks for the error conditions and record them.

Both your approach and mine requires thinking ahead of time all the ways the user can create an error.

What about situations of unknown unknown ? where we haven't explicitly modeled for an error ?

What about the minimum viable error method ? that just prints line number ?

would that just be to record line number as you go along using Carg(1) and just print out the line where your parser fails ?

I feel like the minimum viable error would handle unknown unknowns without being completely useless like nil, while keeping

parser code simple. ( I do not want to end up in a situation where the code is 50% error handling ).

Sorry about my noob questions.

best wishes,

Joy

On Wed, Apr 3, 2019 at 2:32 AM Hugo Musso Gualandi <[hidden email]> wrote:
> Example :
>
> > [ 1 2 3     -- ERROR ! MISSING ] AT LINE 20
>
> > ( 1 2 3     -- ERROR ! MISSING ) AT LINE 35
>
> > { 1 2 3     -- ERROR ! MISSING } AT LINE 12
>
> >  1 2 3 ]    -- ERROR ! EXTRA   ] AT LINE 42
>
> > [ ( 1 2 3 ] -- ERROR ! MISSING ) AT LINE 71

You may want to check out LPegLabel, a fork of LPeg with additional
error-handling features:

https://github.com/sqmedeiros/lpeglabel/


Reply | Threaded
Open this post in threaded view
|

Re: Elegant design for creating error messages in LPEG parser

Sean Conner
It was thus said that the Great joy mondal once stated:
> The reason I bring up the question of good design is because you (Sean)
> have mentioned that there are
>
> infinite ways to design parser.

  Pretty much.

> I though about the state variable approach but it feels working against the
> design principals of LPEG.

  It's supported, so I don't think it's necessarily against the principles
of LPEG.  It may look weird because it's not used often, but I'm glad it
does exist.

> Both your approach and mine requires thinking ahead of time all the ways
> the user can create an error.
>
> What about situations of unknown unknown ? where we haven't explicitly
> modeled for an error ?
>
> What about the minimum viable error method ? that just prints line number ?

  Unfortunately, LPEG can be used to parse binary data so a "line number"
might not make any sense.

  I'm thinking that the minimum to return on a failed match is both nil
*and* the position of the failure.  So something like:

        pattern = lpeg.P"a"^1 * lpeg.P"b"^1 * lpeg.P(-1)
        print(pattern:match "aabb")
        print(pattern:match "aacbb")

would output:

        4
        nil 3

  Any failed pattern would return nil,position, but as long as there are
alternatives, it won't matter.  But on a failure, at least you get the
offset into the string being parsed where the error was.

  That's about the minimum I could see for LPEG.  How about it Roberto?

> would that just be to record line number as you go along using Carg(1) and
> just print out the line where your parser fails ?
>
> I feel like the minimum viable error would handle unknown unknowns without
> being completely useless like nil, while keeping
>
> parser code simple. ( I do not want to end up in a situation where the code
> is 50% error handling ).

  For the majority of my LPEG programs, I've been able to get away with
parsed vs. failed, as there wasn't much I could do about a failed parse
(especially when parsing SIP messages---log the failure, drop it and move on
to the next message).  But having a position of failure would be nice.

> Sorry about my noob questions.

  Not a noob question at all.  It's been something I've thought about for
some time, which is why I was able to answer your question with code.

  -spc

Reply | Threaded
Open this post in threaded view
|

Re: Elegant design for creating error messages in LPEG parser

joy mondal
Just a side question.

Under what design consideration do we use 'Cmt(patt,f) ' vs 'patt/f' ?

I noticed you used Cmt , why not 'patt/f' ?

On Wed, Apr 3, 2019 at 10:56 PM Sean Conner <[hidden email]> wrote:
It was thus said that the Great joy mondal once stated:
> The reason I bring up the question of good design is because you (Sean)
> have mentioned that there are
>
> infinite ways to design parser.

  Pretty much.

> I though about the state variable approach but it feels working against the
> design principals of LPEG.

  It's supported, so I don't think it's necessarily against the principles
of LPEG.  It may look weird because it's not used often, but I'm glad it
does exist.

> Both your approach and mine requires thinking ahead of time all the ways
> the user can create an error.
>
> What about situations of unknown unknown ? where we haven't explicitly
> modeled for an error ?
>
> What about the minimum viable error method ? that just prints line number ?

  Unfortunately, LPEG can be used to parse binary data so a "line number"
might not make any sense.

  I'm thinking that the minimum to return on a failed match is both nil
*and* the position of the failure.  So something like:

        pattern = lpeg.P"a"^1 * lpeg.P"b"^1 * lpeg.P(-1)
        print(pattern:match "aabb")
        print(pattern:match "aacbb")

would output:

        4
        nil 3

  Any failed pattern would return nil,position, but as long as there are
alternatives, it won't matter.  But on a failure, at least you get the
offset into the string being parsed where the error was.

  That's about the minimum I could see for LPEG.  How about it Roberto?

> would that just be to record line number as you go along using Carg(1) and
> just print out the line where your parser fails ?
>
> I feel like the minimum viable error would handle unknown unknowns without
> being completely useless like nil, while keeping
>
> parser code simple. ( I do not want to end up in a situation where the code
> is 50% error handling ).

  For the majority of my LPEG programs, I've been able to get away with
parsed vs. failed, as there wasn't much I could do about a failed parse
(especially when parsing SIP messages---log the failure, drop it and move on
to the next message).  But having a position of failure would be nice.

> Sorry about my noob questions.

  Not a noob question at all.  It's been something I've thought about for
some time, which is why I was able to answer your question with code.

  -spc

Reply | Threaded
Open this post in threaded view
|

lpeg.cut? (Re: Elegant design for creating error messages in LPEG parser)

nobody
In reply to this post by Sean Conner
On 03/04/2019 20.56, Sean Conner wrote:

>    Any failed pattern would return nil,position, but as long as there are
> alternatives, it won't matter.  But on a failure, at least you get the
> offset into the string being parsed where the error was.
>
>    That's about the minimum I could see for LPEG.  How about it Roberto?
>
>> would that just be to record line number as you go along using Carg(1) and
>> just print out the line where your parser fails ?
>>
>> I feel like the minimum viable error would handle unknown unknowns without
>> being completely useless like nil, while keeping
>>
>> parser code simple. ( I do not want to end up in a situation where the code
>> is 50% error handling ).
>
>    For the majority of my LPEG programs, I've been able to get away with
> parsed vs. failed, as there wasn't much I could do about a failed parse
> (especially when parsing SIP messages---log the failure, drop it and move on
> to the next message).  But having a position of failure would be nice.

## background

So far, almost every single time I used LPEG, I spent upwards of an hour
(sometimes 6+ hours) on debugging.  If you have a grammar and "only"
have to translate it to LPEG, identifying a problem is usually
manageable.  But if you're trying to incrementally reconstruct a grammar
from a bunch of known samples, this is really really painful.  With very
large or otherwise hard to inspect files (binary etc.), if the 1234th
repetition of some structure has an extra field, the only way I know to
identify the problem is to do lots of match time print()ing…

As far as I can tell, part of the problem is that all branches are tried
recursively – i.e. match failures at any point are expected and don't
mean there's actually a problem, and so there's no hard information
available that could be printed after all branches failed.

My observation is that very often, there are points in the grammar /
pattern, where trying alternatives is known to be useless. (If there was
a match for '<entity ' and now 'id=' is expected but 'ref=' is found, it
doesn't make sense to backtrack and try '<message ' etc. – they
certainly won't match – but LPEG doesn't know that.)


## proposal / question

Would it make sense to add a way to tell LPEG "do not backtrack past
this point" – e.g. by 'lpeg.cut( )'? (I'm taking the name from Prolog –
maybe there's a better name?)  With cuts, there *would* be hard known
information that could be printed:  The position in the input when LPEG
attempted to backtrack over the cut.

Going a step further, `lpeg.cut( [name] )` could (maybe) be used to
produce something like a stack trace?  (I haven't looked at the LPEG
internals, don't know how hard/easy this would be.)

With cut, I could have

   entity = lpeg.P "<entity" * WS^1 * lpeg.cut( "tag:entity" ) * "id=" …

and then LPEG has enough information to tell me that in 'tag:entity' at
position 12345 (in 'tag:state' at position 123 in 'tag:savegame' at
position 10) no alternative matched, and by using the position I can
grab the next couple of lexemes (or bytes) from the file, and then I
know that there was 'ref=' instead of 'id=' and debugging would be *so*
much easier.


At least that's the dream…  Would that actually work?  And is this
sufficiently compatible with LPEG's internals?  Or is that maybe
possible already?

-- nobody

Reply | Threaded
Open this post in threaded view
|

Re: lpeg.cut? (Re: Elegant design for creating error messages in LPEG parser)

joy mondal
I have zero knowlegde about LPEG internals.

BUT dont you think there is some beauty is having 0,1 signal ( matched , not matched ( nil ) ) ?

Even if personally I would  really love it if LPEG had more functions to do more complicated stuff. Its simplicity is what drew me towards it.

If it was more complicated than that, as a new user I might have not used LPEG.

Now that I am more comfortable with LPEG constructs it makes sense to find out more complicated error handling.

IMO it should happen at a library level - as in module written on top of LPEG. If nobody has done it properly then that is just a lack of sufficient demand from the rest of the community.

My biggest issue is lack of good example docs - which I hope to someday fix, once I deal with my own problems.

Just my opinion which shouldn't be taken seriously.

best wishes,

Joy

Reply | Threaded
Open this post in threaded view
|

Re: Elegant design for creating error messages in LPEG parser

Sean Conner
In reply to this post by joy mondal
It was thus said that the Great joy mondal once stated:
> Just a side question.
>
> Under what design consideration do we use 'Cmt(patt,f) ' vs 'patt/f' ?
>
> I noticed you used Cmt , why not 'patt/f' ?

  The docs say this of lpeg.Cmt():

        Creates a match-time capture. Unlike all other captures, this one is
        evaluated immediately when a match occurs (***even if it is part of
        a larger pattern that fails later***). It forces the immediate
        evaluation of all its nested captures and then calls function.

        (emphasis added)

  If I didn't, the error checking wouldn't work as expected.  Changing the
WS production to:

        local WS = lpeg.P" "
                 + lpeg.P"\t"
                 + lpeg.P"\n" * lpeg.Carg(1)
                 / function(state)
                     print(">>>",state.line)
                     state.line = state.line + 1
                   end

causes my code to produce the following output as an error:

        Missing ']' at line 1 (start line 1)

even though I see the diagnostic print() statements printed three times:

        >>>     1
        >>>     2
        >>>     3

  I think it has something to do with lazy evaluation in LPEG in this case.

  I've also used it to fail a pattern that would otherwise match.  It is
possible to come up with a pattern that matches the numbers between 0 and
255 but it's quite involved and part of it looks like:

        dec_octet = P"25" * R"05"
                  + P"2" * P"04" * P"09"
                  + P"1" * ...

I found it easier to use Cmt() instead:

        local dec_octet = Cmt(DIGIT^1,function(_,position,capture)
          local n = tonumber(capture)
          if n < 256 then
            return position
          end
        end)

When the string of digits is greater than 255, this returns nil, which
causes the match to fail.  Doing this:

        dec_octet = DIGIT^1
                  / function(c)
                      local n = tonumber(c)
                      if n < 256 then
                        return c
                      end
                    end

won't cause the match to fail---instead it will return nil as the captured
data.

  -spc


Reply | Threaded
Open this post in threaded view
|

Re: Elegant design for creating error messages in LPEG parser

joy mondal
Hi Spc,

So essentially what you are saying is the '/' function syntax is just syntax sugar ? without having much value to creating a parser ?

I got introduced to '/' first which caused a lot of frustration, I only understood how to use Cmt later on.

Cmt is so much better since it behaves the way you want it to. What is the return to '/' ?

Is there a function equivalent to '/' ?

When would you use / INSTEAD of Cmt ?

I am really enjoying your stack technique using Cmt , thanks for showing me !

I was stuck trying to use Cb ( back referencing ) and Cg - which are confusing.

Then I read that Cb is experimental.

Cmt seems to be the main way to run function in your LPEG code.

Cheers !! and thanks



On Thu, Apr 4, 2019 at 4:51 AM Sean Conner <[hidden email]> wrote:
It was thus said that the Great joy mondal once stated:
> Just a side question.
>
> Under what design consideration do we use 'Cmt(patt,f) ' vs 'patt/f' ?
>
> I noticed you used Cmt , why not 'patt/f' ?

  The docs say this of lpeg.Cmt():

        Creates a match-time capture. Unlike all other captures, this one is
        evaluated immediately when a match occurs (***even if it is part of
        a larger pattern that fails later***). It forces the immediate
        evaluation of all its nested captures and then calls function.

        (emphasis added)

  If I didn't, the error checking wouldn't work as expected.  Changing the
WS production to:

        local WS = lpeg.P" "
                 + lpeg.P"\t"
                 + lpeg.P"\n" * lpeg.Carg(1)
                 / function(state)
                     print(">>>",state.line)
                     state.line = state.line + 1
                   end

causes my code to produce the following output as an error:

        Missing ']' at line 1 (start line 1)

even though I see the diagnostic print() statements printed three times:

        >>>     1
        >>>     2
        >>>     3

  I think it has something to do with lazy evaluation in LPEG in this case.

  I've also used it to fail a pattern that would otherwise match.  It is
possible to come up with a pattern that matches the numbers between 0 and
255 but it's quite involved and part of it looks like:

        dec_octet = P"25" * R"05"
                  + P"2" * P"04" * P"09"
                  + P"1" * ...

I found it easier to use Cmt() instead:

        local dec_octet = Cmt(DIGIT^1,function(_,position,capture)
          local n = tonumber(capture)
          if n < 256 then
            return position
          end
        end)

When the string of digits is greater than 255, this returns nil, which
causes the match to fail.  Doing this:

        dec_octet = DIGIT^1
                  / function(c)
                      local n = tonumber(c)
                      if n < 256 then
                        return c
                      end
                    end

won't cause the match to fail---instead it will return nil as the captured
data.

  -spc


Reply | Threaded
Open this post in threaded view
|

Re: Elegant design for creating error messages in LPEG parser

Sean Conner
It was thus said that the Great joy mondal once stated:
> Hi Spc,

  Hi.

> So essentially what you are saying is the '/' function syntax is just
> syntax sugar ? without having much value to creating a parser ?

  Not necessarily.

  First off, the documentation for LPEG [1] does document all of LPEG but
like the Lua documentation, it can be terse.  

  Second, '/' is documented in the Capture subsection, so the result of '/'
is to produce a capture.  The expression:

        num = lpeg.R"09"^1 / tonumber

will match digits, then those digits are passed to the function tonumber(),
which converts a string to a number.  It's this number that is returned.  An
example:

        num  = lpeg.R"09"^1
        SP   = lpeg.P" "
        patt = lpeg.Ct((num * SP^-1)^0)

        dump('result',patt:match"1 2 3 4") -- just dumps a table
        result =
        {
        }

num doesn't return any captures, so nothing is captured into the table
returned by lpeg.Ct().  Now, let's capture the output of num (I'm only
changing the rule for num---the rest stays the same, except for the output
which I'm showing):

        num = lpeg.C(lpeg.R"09"^1)

        result =
        {
          [1] = "1",
          [2] = "2",
          [3] = "3",
          [4] = "4",
        }

This captures the digits as strings.  If we wanted to convert these to
numbers, that's when '/' comes in:

        num = lpeg.R"09"^1 / tonumber

        result =
        {
          [1] = 1.000000,
          [2] = 2.000000,
          [3] = 3.000000,
          [4] = 4.000000,
        }

We now get actual numbers.  You *can* do the same thing with lpeg.Cmt():

        num = lpeg.Cmt(lpeg.R"09"^1,function(_,position,capture)
          return position,tonumber(capture)
        end)

        result =
        {
          [1] = 1.000000,
          [2] = 2.000000,
          [3] = 3.000000,
          [4] = 4.000000,
        }

but you aren't really buying anything in this example, other than being a
bit more verbose (or explicit).

  Here's another example of using '/':

        char = lpeg.P"\n" / "\\n"
             + lpeg.P"\t" / "\\t"
             + lpeg.P(1)
        safe = lpeg.Cs(char^0)

  Here I'm doing a substitution capture on the input string.  For each
character in the string, if it's a newline character, replace it with the
escaped version '\n'; the same for the tab character.  Here, the newline
character is replaced with a string using the '/' operator.  Again, you
could do this with lpeg.Cmt() but it would lose some clarity:

        char = lpeg.Cmt(lpeg.P"\n",function(_,position) return position,"\\n" end)
             + lpeg.Cmt(lpeg.P"\t",function(_,position) return position,"\\t" end)
             + lpeg.P(1)
        safe = lpeg.Cs(char^0)

  So I suppose you could say that '/' is syntatic surgar for lpeg.Cmt(), in
that everything you can do with '/' you can do with lpeg.Cmt().  But I find
using '/' clearer than using lpeg.Cmt().  It's not to say I don't use
lpeg.Cmt(), but only when I need to do some other processing at match time.

> I was stuck trying to use Cb ( back referencing ) and Cg - which are
> confusing.
>
> Then I read that Cb is experimental.

  It was at one point, but that doesn't seem to be the case anymore.  I
generally use Cg() in conjunction with Ct(); I think I've used Cb() once
when parsing text that had variable delimeters.

  -spc

[1] http://www.inf.puc-rio.br/~roberto/lpeg/

Reply | Threaded
Open this post in threaded view
|

Re: lpeg.cut? (Re: Elegant design for creating error messages in LPEG parser)

Sean Conner
In reply to this post by joy mondal
It was thus said that the Great joy mondal once stated:
>
> My biggest issue is lack of good example docs - which I hope to someday
> fix, once I deal with my own problems.

  There are examples out there for varying complexity.  You have Dirk
Laurie's notes about LPEG:

        https://github.com/dlaurie/lua-notes/blob/master/lpeg-brief.md

Then daurnimator has a collection of LPEG patterns:

        https://github.com/daurnimator/lpeg_patterns

as well as myself:

        https://github.com/spc476/LPeg-Parsers

There are probably others as well.  If you search this mailing list archive,
you can find some pretty hairy LPEG examples.

  -spc

Reply | Threaded
Open this post in threaded view
|

Re: lpeg.cut? (Re: Elegant design for creating error messages in LPEG parser)

Philipp Janda
In reply to this post by nobody
Am 04.04.19 um 00:28 schröbte nobody:

>
> Would it make sense to add a way to tell LPEG "do not backtrack past
> this point" – e.g. by 'lpeg.cut( )'? (I'm taking the name from Prolog –
> maybe there's a better name?)  With cuts, there *would* be hard known
> information that could be printed:  The position in the input when LPEG
> attempted to backtrack over the cut.
>
> Going a step further, `lpeg.cut( [name] )` could (maybe) be used to
> produce something like a stack trace?  (I haven't looked at the LPEG
> internals, don't know how hard/easy this would be.)
>
> With cut, I could have
>
>    entity = lpeg.P "<entity" * WS^1 * lpeg.cut( "tag:entity" ) * "id=" …
>
> and then LPEG has enough information to tell me that in 'tag:entity' at
> position 12345 (in 'tag:state' at position 123 in 'tag:savegame' at
> position 10) no alternative matched, and by using the position I can
> grab the next couple of lexemes (or bytes) from the file, and then I
> know that there was 'ref=' instead of 'id=' and debugging would be *so*
> much easier.
>
>
> At least that's the dream…  Would that actually work?  And is this
> sufficiently compatible with LPEG's internals?  Or is that maybe
> possible already?


I don't know whether I understand your proposal correctly, but I think I
have used something similar before to raise parse errors: A custom LPeg
operator that, when matched, raises an error with line number
information and source code snippet.

     `E( [error-message] )`

You can find the implementation on github[1]. (I hope it still works, I
haven't used it in a while ...).

>
> -- nobody
>


Philipp

   [1]:  http://siffiejoe.github.io/lua-luaepnf/



Reply | Threaded
Open this post in threaded view
|

Re: Elegant design for creating error messages in LPEG parser

Soni "They/Them" L.
In reply to this post by Sean Conner


On 2019-04-03 9:50 p.m., Sean Conner wrote:

[snip]

>    I've also used it to fail a pattern that would otherwise match.  It is
> possible to come up with a pattern that matches the numbers between 0 and
> 255 but it's quite involved and part of it looks like:
>
> dec_octet = P"25" * R"05"
>          + P"2" * P"04" * P"09"
>          + P"1" * ...
>
> I found it easier to use Cmt() instead:
>
> local dec_octet = Cmt(DIGIT^1,function(_,position,capture)
>  local n = tonumber(capture)
>  if n < 256 then
>    return position
>  end
> end)
>
> When the string of digits is greater than 255, this returns nil, which
> causes the match to fail.  Doing this:
>
> dec_octet = DIGIT^1
>                    / function(c)
>                        local n = tonumber(c)
>                        if n < 256 then
>                          return c
>                        end
>                      end
>
> won't cause the match to fail---instead it will return nil as the captured
> data.

Unrelated, but, those aren't strictly equivalent:

tonumber("000001") --> 1
1 < 256, but "000001" doesn't follow the
(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]) pattern.

This may or may not be a problem depending on your use-case. (e.g. \000
escapes in Lua take 3 digits and fail on >255, but some hypothetical
language could take up to 3 digits less than 255 such that e.g. \999
would be \99 followed by an literal 9)

>
>    -spc
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Elegant design for creating error messages in LPEG parser

joy mondal
Hi Sean,

It seems everything lpeg.Cf can do can be done with lpeg.Cmt.

Under what circumstances would using lpeg.Cmt INSTEAD of lpeg.Cf be considered a severe design failure ?

I tried using lpeg.Cf recursively and its quite convoluted.

For parsing thing like this:

[[[]]]

Its quite a bit easier with Cmt since I can create an empty table ( state ) at the start of the loop. with Cf you are not sure if you at the start,middle or end of the loop.

I had a look at the moon-script code base ( written using LPEG ) and there seems no usage of lpeg.Cf.

What I am trying to find is the minimum number of functions that is needed for LPEG.

Up until now I haven't found a function that cannot be used uniquely in a given situation, so I am quite curious to be proven wrong.

best wishes,

Joy

On Fri, Apr 5, 2019 at 1:16 AM Soni "They/Them" L. <[hidden email]> wrote:


On 2019-04-03 9:50 p.m., Sean Conner wrote:

[snip]
>    I've also used it to fail a pattern that would otherwise match.  It is
> possible to come up with a pattern that matches the numbers between 0 and
> 255 but it's quite involved and part of it looks like:
>
>       dec_octet = P"25" * R"05"
>                 + P"2" * P"04" * P"09"
>                 + P"1" * ...
>
> I found it easier to use Cmt() instead:
>
>       local dec_octet = Cmt(DIGIT^1,function(_,position,capture)
>         local n = tonumber(capture)
>         if n < 256 then
>           return position
>         end
>       end)
>
> When the string of digits is greater than 255, this returns nil, which
> causes the match to fail.  Doing this:
>
>       dec_octet = DIGIT^1
>                    / function(c)
>                        local n = tonumber(c)
>                        if n < 256 then
>                          return c
>                        end
>                      end
>
> won't cause the match to fail---instead it will return nil as the captured
> data.

Unrelated, but, those aren't strictly equivalent:

tonumber("000001") --> 1
1 < 256, but "000001" doesn't follow the
(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]) pattern.

This may or may not be a problem depending on your use-case. (e.g. \000
escapes in Lua take 3 digits and fail on >255, but some hypothetical
language could take up to 3 digits less than 255 such that e.g. \999
would be \99 followed by an literal 9)

>
>    -spc
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Elegant design for creating error messages in LPEG parser

Dirk Laurie-2
Cf is very specialized. Here is a toy example of how I understood it when I first learnt LPEG (string-to-number coercion was still OK then):
 
Cf(p,fct), p:Cf(fct) (Fold by Function)
fct must be a function of two variables. Think of it as a left-associative binary operator applied between the captures of p. The result is the capture of Cf(p,fct). (C(1)^3):match"361" returns '3','6','1' but Cf(C(1)^3,math.max):match"361" returns 6.


Reply | Threaded
Open this post in threaded view
|

Re: Elegant design for creating error messages in LPEG parser

Sean Conner
In reply to this post by joy mondal
It was thus said that the Great joy mondal once stated:
> Hi Sean,
>
> It seems everything lpeg.Cf can do can be done with lpeg.Cmt.

  I think everything with LPEG can be done with lpeg.Cmt() but I haven't
proved that to myself yet (it's a conjecture).  But I don't use lpeg.Cmt()
all that much actually.  

> Under what circumstances would using lpeg.Cmt INSTEAD of lpeg.Cf be
> considered a severe design failure ?

  I don't know.  Here's one recent instance I've used lpeg.Cf():

local cutf8 = ... -- LPEG code to parse a single UTF-8 character
local nc    = ... -- LPEG code of certain UTF-8 characters to not count

local cnt = lpeg.Cf(
               lpeg.Cc(0) * (nc + cutf8 * lpeg.Cc(1))^0,
               function(c) return c + 1 end
             )

So I start by capturing a 0 (which is the accumulated value), then cycle
through the string.  A "character" matching cutf8 then returns a 1, which is
added to the running accumulator; otherwise, the "character" is ignored and
we skip to the next character.

  I've also used lpeg.Cf() in code that parses the format string given to
os.date(), such as:

        "%A, %d %B %Y @ %H:%M:%S"

and return another LPEG expression that will parse a string of that format,
like:

        "Monday, 02 July 2018 @ 16:02:48"

into a Lua table.  Code:

        https://github.com/spc476/LPeg-Parsers/blob/master/strftime.lua

  I use lpeg.Cf() when accumulating a result into something *other* than a
table.  To answer your question, when would using lpeg.Cmt() over lpeg.Cf()
be a design failure?  When you aren't taking advantage of the match time
capability of lpeg.Cmt().  I gave an example earlier of:

        dec_octet = lpeg.Cmt(lpeg.R"09"^1,function(_,pos,capture)
          local v = tonumber(capture)
          if v < 256 then
            return pos,v
          end
        end)

But some other examples can be found in the strftime.lua module referenced
above, such as:

        function chkrange(min,max)
          return function(_,position,capture)
            local val = tonumber(capture)
            if val < min or val > max then
              return false
            else
              return position,val
            end
          end
        end

        dday = Cmt(digit * digit, chkrange(1, 31))

  I could do this stuff with patterns only, but the expressions can become
unweildy and hard to follow.  This makes it pretty easy to see what's being
checked.  And the reason I'm doing it this way with lpeg.Cmt() and not like:

        dday = (digit * digit)
             / function(capture)
                 local val = tonumber(capture)
                 if val >= 1 and val <= 31 then
                   return val
                 end
               end

Is that the pattern will *NOT* fail to match a day of "45" but instead
return nil, which is *NOT* what I want.  I want the pattern to fail.  I'm
using lpeg.Cmt() because I'd rather not write a complicated LPEG expression
to parse numbers from 1 to 31 (or 01 to 31).  

> I tried using lpeg.Cf recursively and its quite convoluted.
>
> For parsing thing like this:
>
> [[[]]]

  What, exactly are you trying to parse there?  Something like:

        [ 1 2 3 [ 4 5 6 [ 7 8 9 ] 10 11 ] 12 13 ] --?

> Its quite a bit easier with Cmt since I can create an empty table ( state )
> at the start of the loop. with Cf you are not sure if you at the
> start,middle or end of the loop.

  I rarely use lpeg.Cf() to fold captures into a table.  I'd rather use
lpeg.Ct() with a capture pattern inside:

        list = lpeg.P"[" * lpeg.Ct((SPACE * C(number))^0) * lpeg.P"]"

If it's recursive, then an LPEG grammar would be appropriate:

        list = lpeg.P {
                'list',
                list = lpeg.P'[' * lpeg.Ct(lpeg.V"data") * lpeg.P']',
                data = SPACE * C(number)
                     + lpeg.V"list"
        }

This would be enough to parse the example I'm asking about (given an
appropriate definition for SPACE and number).

> I had a look at the moon-script code base ( written using LPEG ) and there
> seems no usage of lpeg.Cf.
>
> What I am trying to find is the minimum number of functions that is needed
> for LPEG.
>
> Up until now I haven't found a function that cannot be used uniquely in a
> given situation, so I am quite curious to be proven wrong.

  I don't understand what you mean here.

  -spc

Reply | Threaded
Open this post in threaded view
|

Re: Elegant design for creating error messages in LPEG parser

Sean Conner
It was thus said that the Great Sean Conner once stated:

> It was thus said that the Great joy mondal once stated:
> > I tried using lpeg.Cf recursively and its quite convoluted.
> >
> > For parsing thing like this:
> >
> > [[[]]]
>
>   What, exactly are you trying to parse there?  Something like:
>
> [ 1 2 3 [ 4 5 6 [ 7 8 9 ] 10 11 ] 12 13 ] --?
>
> > Its quite a bit easier with Cmt since I can create an empty table ( state )
> > at the start of the loop. with Cf you are not sure if you at the
> > start,middle or end of the loop.
>
>   I rarely use lpeg.Cf() to fold captures into a table.  I'd rather use
> lpeg.Ct() with a capture pattern inside:
>
> list = lpeg.P"[" * lpeg.Ct((SPACE * C(number))^0) * lpeg.P"]"
>
> If it's recursive, then an LPEG grammar would be appropriate:
>
> list = lpeg.P {
> 'list',
> list = lpeg.P'[' * lpeg.Ct(lpeg.V"data") * lpeg.P']',
> data = SPACE * C(number)
>                      + lpeg.V"list"
> }
>
> This would be enough to parse the example I'm asking about (given an
> appropriate definition for SPACE and number).

  I just reread your message, and perhaps I should present this definition
of list:

        list = lpeg.P {
                'list',
                list = lpeg.P'['
                     * lpeg.Cf(lpeg.Ct"" * lpeg.V"data"^0,function(a,b) table.insert(a,b) return a end)
                     * lpeg.P']',
                data = SPACE * C(number)
                     + lpeg.V"list"
        }

This uses lpeg.Cf() recursively (it also uses lpeg.Ct(), but only to
generate the table (generated by matching the empty string) used to
accumulate the results.  As you can see, it's a bit longer than just using
lpeg.Ct() around a pattern.

  -spc

Reply | Threaded
Open this post in threaded view
|

Re: lpeg.cut? (Re: Elegant design for creating error messages in LPEG parser)

Sérgio Medeiros-2
In reply to this post by nobody
On Wed, Apr 3, 2019 at 7:28 PM nobody <[hidden email]> wrote:
>
> ## proposal / question
>
> Would it make sense to add a way to tell LPEG "do not backtrack past
> this point" – e.g. by 'lpeg.cut( )'? (I'm taking the name from Prolog –
> maybe there's a better name?)  With cuts, there *would* be hard known
> information that could be printed:  The position in the input when LPEG
> attempted to backtrack over the cut.

I think lpeglabel.T does what you want:

--
Sérgio
Reply | Threaded
Open this post in threaded view
|

Re: Elegant design for creating error messages in LPEG parser

joy mondal
In reply to this post by Sean Conner
Hi Sean !

Thanks for your patience, really appreciate it.

1     list = lpeg.P {
2                'list',
3                list = lpeg.P'['
4                     * lpeg.Cf(lpeg.Ct"" * lpeg.V"data"^0,function(a,b) table.insert(a,b) return a end)
5                     * lpeg.P']',
6                data = SPACE * C(number)
7                     + lpeg.V"list"
8        }

In line 4 are you passing lpeg.Ct"" to lpeg.Cf , that is key I think.

I was under the impression this was an anti-pattern, that LPEG operations should be stateless ( you should not maintain local state )

Yes, Cf should be used when you are constructing something that is NOT a table.

I was so confused because this is what I had in mind:


Cn(.....Cf(Cf(Cf(Cf(Cf(C1,C2),C3),C4),C5),C6),....Cn)


If you change the notation to:

T1 -> C1 C2
T2 -> T1 C3
T3 -> T2 C4
T4 -> T3 C5
...........
...........
...........

You can see the algorithm moves linearly like a snake through every capture.

Why is this important ? Errors Handling !

You can terminate the entire algorithm at ANY wrong capture regardless of how deeply its nested.

However it gets problematic when you also nest Cf, since you cannot create LOCAL STATE in your nest ( without passing Ct '')

I was trying really hard to solve the problem using Carg(1) which is the global state.

I was a fool in making such useless assumptions.

best wishes,

Joy

On Mon, Apr 8, 2019 at 12:55 PM Sean Conner <[hidden email]> wrote:
It was thus said that the Great Sean Conner once stated:
> It was thus said that the Great joy mondal once stated:
> > I tried using lpeg.Cf recursively and its quite convoluted.
> >
> > For parsing thing like this:
> >
> > [[[]]]
>
>   What, exactly are you trying to parse there?  Something like:
>
>       [ 1 2 3 [ 4 5 6 [ 7 8 9 ] 10 11 ] 12 13 ] --?
>
> > Its quite a bit easier with Cmt since I can create an empty table ( state )
> > at the start of the loop. with Cf you are not sure if you at the
> > start,middle or end of the loop.
>
>   I rarely use lpeg.Cf() to fold captures into a table.  I'd rather use
> lpeg.Ct() with a capture pattern inside:
>
>       list = lpeg.P"[" * lpeg.Ct((SPACE * C(number))^0) * lpeg.P"]"
>
> If it's recursive, then an LPEG grammar would be appropriate:
>
>       list = lpeg.P {
>               'list',
>               list = lpeg.P'[' * lpeg.Ct(lpeg.V"data") * lpeg.P']',
>               data = SPACE * C(number)
>                      + lpeg.V"list"
>       }
>
> This would be enough to parse the example I'm asking about (given an
> appropriate definition for SPACE and number).

  I just reread your message, and perhaps I should present this definition
of list:

        list = lpeg.P {
                'list',
                list = lpeg.P'['
                     * lpeg.Cf(lpeg.Ct"" * lpeg.V"data"^0,function(a,b) table.insert(a,b) return a end)
                     * lpeg.P']',
                data = SPACE * C(number)
                     + lpeg.V"list"
        }

This uses lpeg.Cf() recursively (it also uses lpeg.Ct(), but only to
generate the table (generated by matching the empty string) used to
accumulate the results.  As you can see, it's a bit longer than just using
lpeg.Ct() around a pattern.

  -spc