lpeg could infer the validity of this grammar.

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

lpeg could infer the validity of this grammar.

Sam Putman
This is my first post to Lua list, so let me say hello and thank you.

Eliminating left recursion is a sometimes frustrating part of writing a PEG grammar. 

One way to do this rewrite is shown below. Sadly, lpeg doesn't allow this.

```lua
require "lpeg"

match, P, V, R, S = lpeg.match, lpeg.P, lpeg.V, lpeg.R, lpeg.S

space = P" "
digit = R"09"
operand = S"+-*/"
token = space + P"(" + digit^1 + P")" + operand

tokenizer = token^1

group = P {
"groups";
groups = #token * V"group" + token ;
group = P"(" * V"groups"^0 * P")";
}
---[[
expr = P {
"expr";
expr = token + ( V"group" + V"factor" + V"term") 
group = P"(" * V"expr"^0 * P")";
factor = #token * V"expr" * (P"*" + P"/") * V"expr";
term = token; -- we do not reach this point. 
}
--]] -- The following asserts are valid if expr is commented:

test_s = "(1 + 2 / 3) + 4 * 5 + (6)"
group_s = "(1 (2))"
invalid_token =  "(1 + 2 / ; 3) + 4 * 5 + (6)"
assert(match(tokenizer,test_s) == #test_s+1)
assert(match(tokenizer,invalid_token) ~= #invalid_token+1)
assert(match(group,group_s))
```

This program fails with "rule 'expr' may be left recursive".

I assure you, this is not possible. This program is somewhat of a logical proof of that. 

The # predicate means that, if token is called by the rule, it will consume part of the string. Because we put it at the start of expr, it is given a *chance*to*fail* before any infinite recursion can take place. 

lpeg does need to follow the call chain and ensure that token is an inevitable call. This is static inference of the grammar with no runtime cost. It is moderately complex, we must assure that it's impossible for the call to expr to succeed on empty without first calling token. PEGs are deterministic, so this is guaranteed to be a possible analysis. 

This was designed so that it should succeed deterministically. The naive state machine will always check for token in expr before checking the grouped terms, and factor uses #token to assure that this first parse will succeed before expr is allowed to be called. 

This grammar will never recurse, on valid or invalid input. It will behave as specified. It's badly broken as a real expression parser, intended only to demonstrate the principle. 

 I am willing to accept a grammar that shows that it's impossible to statically infer whether #pattern propagation can or cannot be performed, as a counterargument. Otherwise, I would very much like to see lpeg support this inference.

Thank you for your consideration. 

cheers,
-Sam Putman. 
Reply | Threaded
Open this post in threaded view
|

Re: lpeg could infer the validity of this grammar.

Dirk Laurie-2
2014-12-14 21:39 GMT+02:00 Sam Putman <[hidden email]>:

> expr = token + ( V"group" + V"factor" + V"term")
-- semicolon needed here
> group = P"(" * V"expr"^0 * P")";
> factor = #token * V"expr" * (P"*" + P"/") * V"expr";

> This program fails with "rule 'expr' may be left recursive".

If `expr` is not `token` or `group`, it tries `factor`.
The rule for `factor` invokes V"expr" before any
input has been consumed.

Reply | Threaded
Open this post in threaded view
|

Re: lpeg could infer the validity of this grammar.

Sam Putman


On Sun, Dec 14, 2014 at 12:56 PM, Dirk Laurie <[hidden email]> wrote:
2014-12-14 21:39 GMT+02:00 Sam Putman <[hidden email]>:

> expr = token + ( V"group" + V"factor" + V"term")
-- semicolon needed here
> group = P"(" * V"expr"^0 * P")";
> factor = #token * V"expr" * (P"*" + P"/") * V"expr";

> This program fails with "rule 'expr' may be left recursive".

If `expr` is not `token` or `group`, it tries `factor`.
The rule for `factor` invokes V"expr" before any
input has been consumed.

Hi Dirk,

I do understand the cause for the failure to infer. The grammar is valid, but not valid lpeg, as lpeg stands.

I'm not looking for help rewriting grammars, I can get around this particular hitch. 

Because factor calls #token, you and I know that the next call to token will succeed, if it is reached. So the call to expr, which calls token first, will succeed. So we can infer statically that it's safe to call factor. No loop will in fact occur.

#token fails with an illegal character or end of string, so factor will not call expr in that case.  
Reply | Threaded
Open this post in threaded view
|

Re: lpeg could infer the validity of this grammar.

Sam Putman


On Sun, Dec 14, 2014 at 1:50 PM, Sam Putman <[hidden email]> wrote:


On Sun, Dec 14, 2014 at 12:56 PM, Dirk Laurie <[hidden email]> wrote:
2014-12-14 21:39 GMT+02:00 Sam Putman <[hidden email]>:

> expr = token + ( V"group" + V"factor" + V"term")
-- semicolon needed here
> group = P"(" * V"expr"^0 * P")";
> factor = #token * V"expr" * (P"*" + P"/") * V"expr";

> This program fails with "rule 'expr' may be left recursive".

If `expr` is not `token` or `group`, it tries `factor`.
The rule for `factor` invokes V"expr" before any
input has been consumed.


I do understand the cause for the failure to infer. The grammar is valid, but not valid lpeg, as lpeg stands.


To elaborate, I understand that #token returns the empty string, and it isn't always safe to call "expr" within "expr". When hand-writing recursive descent, it's nice when possible to do a few tokens of lookahead to ensure completion, then recur. This grammar attempts to mimic that idiom. 
Reply | Threaded
Open this post in threaded view
|

Re: lpeg could infer the validity of this grammar.

Sean Conner
It was thus said that the Great Sam Putman once stated:

> On Sun, Dec 14, 2014 at 1:50 PM, Sam Putman <[hidden email]> wrote:
> > On Sun, Dec 14, 2014 at 12:56 PM, Dirk Laurie <[hidden email]>
> > wrote:
> >> 2014-12-14 21:39 GMT+02:00 Sam Putman <[hidden email]>:
> >>
> >> > expr = token + ( V"group" + V"factor" + V"term")
> >> -- semicolon needed here
> >> > group = P"(" * V"expr"^0 * P")";
> >> > factor = #token * V"expr" * (P"*" + P"/") * V"expr";
> >>
> >> > This program fails with "rule 'expr' may be left recursive".
> >>
> >> If `expr` is not `token` or `group`, it tries `factor`.
> >> The rule for `factor` invokes V"expr" before any
> >> input has been consumed.
> >
> > I do understand the cause for the failure to infer. The grammar is valid,
> > but not valid lpeg, as lpeg stands.
>
> To elaborate, I understand that #token returns the empty string, and it
> isn't always safe to call "expr" within "expr". When hand-writing recursive
> descent, it's nice when possible to do a few tokens of lookahead to ensure
> completion, then recur. This grammar attempts to mimic that idiom.

  The rules I use for this type of thing look like:

re = require "re" -- it's still part of LPeg.
                  -- and I find it easier to follow the grammar

parser = re.compile [[
        expr    <- term   (termop   term)*
        term    <- factor (factorop factor)*
        factor  <- number / open expr close

        space           <- %s*
        number          <- space {'-'^-1 [0-9]^+1} space
        termop          <- space {[+-]} space
        factorop        <- space {[*/]} space
        open            <- space '('    space
        close           <- space ')'    space

]]

  It should be straightforward to translate this to LPeg (seeing how re can
do that).

  -spc

Reply | Threaded
Open this post in threaded view
|

Re: lpeg could infer the validity of this grammar.

Sam Putman


On Sun, Dec 14, 2014 at 6:14 PM, Sean Conner <[hidden email]> wrote:

  The rules I use for this type of thing look like:

re = require "re" -- it's still part of LPeg.
                  -- and I find it easier to follow the grammar

parser = re.compile [[
        expr    <- term   (termop   term)*
        term    <- factor (factorop factor)*
        factor  <- number / open expr close

        space           <- %s*
        number          <- space {'-'^-1 [0-9]^+1} space
        termop          <- space {[+-]} space
        factorop        <- space {[*/]} space
        open            <- space '('    space
        close           <- space ')'    space

]]

I'm glad you posted this! Someone, somewhere, is going to search for 'left recursion lpeg' and solve their problem. 

I've been merrily rewriting my grammar and am pleased to find it's working. My issue is that lpeg won't pass a valid grammar because empty-return propagation doesn't account for the predicate in #patt. I think this should change, the additional complexity to the library is at compile-time. 

The grammar, as written, will not loop, and the compiler can reason through the call chain, just like we do, to confirm that.  
Reply | Threaded
Open this post in threaded view
|

Re: lpeg could infer the validity of this grammar.

Dirk Laurie-2
2014-12-15 4:49 GMT+02:00 Sam Putman <[hidden email]>:

> The grammar, as written, will not loop, and the compiler can reason through
> the call chain, just like we do, to confirm that.

In this case, it can. But a mechanism that can in general determine whether
a given grammar will terminate for all input --- didn't Alan Turing prove that
to be impossible?

Reply | Threaded
Open this post in threaded view
|

Re: lpeg could infer the validity of this grammar.

Sam Putman


On Sun, Dec 14, 2014 at 8:24 PM, Dirk Laurie <[hidden email]> wrote:
2014-12-15 4:49 GMT+02:00 Sam Putman <[hidden email]>:

> The grammar, as written, will not loop, and the compiler can reason through
> the call chain, just like we do, to confirm that.

In this case, it can. But a mechanism that can in general determine whether
a given grammar will terminate for all input --- didn't Alan Turing prove that
to be impossible?


^_^  

rule : patt / func

func = function() return func() end

Clearly we cannot infer all non-termination. 

I would accept, a la Turing the Great, a grammar for which this #patt business cannot be sorted into "may be left recursive" or "is not left recurisve". 
Reply | Threaded
Open this post in threaded view
|

Re: lpeg could infer the validity of this grammar.

Sean Conner
In reply to this post by Sam Putman
It was thus said that the Great Sam Putman once stated:

> On Sun, Dec 14, 2014 at 12:56 PM, Dirk Laurie <[hidden email]> wrote:
> >
> > 2014-12-14 21:39 GMT+02:00 Sam Putman <[hidden email]>:
> >
> > > expr = token + ( V"group" + V"factor" + V"term")
> > -- semicolon needed here
> > > group = P"(" * V"expr"^0 * P")";
> > > factor = #token * V"expr" * (P"*" + P"/") * V"expr";
> >
> > > This program fails with "rule 'expr' may be left recursive".
> >
> > If `expr` is not `token` or `group`, it tries `factor`.
> > The rule for `factor` invokes V"expr" before any
> > input has been consumed.
> >
>
> Hi Dirk,
>
> I do understand the cause for the failure to infer. The grammar is valid,
> but not valid lpeg, as lpeg stands.

  LPeg is a Lua implementation of Parsing Expresson Grammars, which are
*NOT* (I repeat, *NOT*) the same as LR() or LL() parsers.  There's a bit
more information about this on the Wikipedia:

http://en.wikipedia.org/wiki/Parsing_expression_grammar#Indirect_left_recursion

  -spc


Reply | Threaded
Open this post in threaded view
|

Re: lpeg could infer the validity of this grammar.

Sam Putman


On Sun, Dec 14, 2014 at 10:56 PM, Sean Conner <[hidden email]> wrote:

  LPeg is a Lua implementation of Parsing Expresson Grammars, which are
*NOT* (I repeat, *NOT*) the same as LR() or LL() parsers.  There's a bit
more information about this on the Wikipedia:

http://en.wikipedia.org/wiki/Parsing_expression_grammar#Indirect_left_recursion

  -spc


Absolutely correct, now, let's focus.

The lpeg engine has to detect potential left recursion, and reject it. 

How does it do this? 

It looks at a call to "patt", and sees whether another call to "patt" can be reached without consuming any input.  We check the rules, one at a time, to assure that the second "patt" call either consumes or fails before it reaches a third. 

The only type of capture which can modify this *logic* in a way that *still prevents infinite loops* is #patt, and only under this deterministic circumstance: If we have a pattern 'token', a looping pattern "expr", and a calling pattern "factor", then if factor looks like "factor = #token * expr" and expr looks like "expr = token + factor", we're okay.

Specifically, any #patt encountered before a potential left-recursion should be followed to see if a later call to `patt` is encountered. This must happen before the cursor moves, and before the recursive call to the "factor" type rule. 

Think of #patt as a promise to the grammar compiler that 'patt' will be encountered. The compiler may statically confirm that this happens. Currently, it treats #patt as any other pattern that doesn't consume input, but in fact, it adds a new constraint which may be propagated. 

One more time: I am making a logical claim about an algorithm. I would like to see demonstration code, and I'll tell you if the algorithm would say "cannot loop infinitely" or "might loop infinitely". I am especially interested in grammars for which these categories cannot be assigned: I haven't proven that it's impossible, because I don't have the time, but I suspect it is.  
Reply | Threaded
Open this post in threaded view
|

Re: lpeg could infer the validity of this grammar.

Pierre-Yves Gérardy
Here's the code that detects left recursion:

https://github.com/lua/lpeg/blob/2960f1cf68af916a095625dfd3e39263dac5f38c/lptree.c#L934
https://github.com/lua/lpeg/blob/2960f1cf68af916a095625dfd3e39263dac5f38c/lptree.c#L915

The function walks the grammar tree and if it loops past a given
threshold, an error is raised.

LPeg could support left-recursion. Sérgio Medeiros, Fabio Mascarenhas
and Roberto Ierusalimschy wrote a paper that describes an extension to
the engine, but they did not publish the code, for some reason.

http://arxiv.org/abs/1207.0443

Rostislav Sacek added left recursion support in LPegLJ, but it only
works with LuaJIT.

https://github.com/sacek/LPegLJ

Regards,
—Pierre-Yves Gérardy
—Pierre-Yves


On Mon, Dec 15, 2014 at 4:53 PM, Sam Putman <[hidden email]> wrote:

>
>
> On Sun, Dec 14, 2014 at 10:56 PM, Sean Conner <[hidden email]> wrote:
>>
>>
>>   LPeg is a Lua implementation of Parsing Expresson Grammars, which are
>> *NOT* (I repeat, *NOT*) the same as LR() or LL() parsers.  There's a bit
>> more information about this on the Wikipedia:
>>
>>
>> http://en.wikipedia.org/wiki/Parsing_expression_grammar#Indirect_left_recursion
>>
>>   -spc
>>
>
> Absolutely correct, now, let's focus.
>
> The lpeg engine has to detect potential left recursion, and reject it.
>
> How does it do this?
>
> It looks at a call to "patt", and sees whether another call to "patt" can be
> reached without consuming any input.  We check the rules, one at a time, to
> assure that the second "patt" call either consumes or fails before it
> reaches a third.
>
> The only type of capture which can modify this *logic* in a way that *still
> prevents infinite loops* is #patt, and only under this deterministic
> circumstance: If we have a pattern 'token', a looping pattern "expr", and a
> calling pattern "factor", then if factor looks like "factor = #token * expr"
> and expr looks like "expr = token + factor", we're okay.
>
> Specifically, any #patt encountered before a potential left-recursion should
> be followed to see if a later call to `patt` is encountered. This must
> happen before the cursor moves, and before the recursive call to the
> "factor" type rule.
>
> Think of #patt as a promise to the grammar compiler that 'patt' will be
> encountered. The compiler may statically confirm that this happens.
> Currently, it treats #patt as any other pattern that doesn't consume input,
> but in fact, it adds a new constraint which may be propagated.
>
> One more time: I am making a logical claim about an algorithm. I would like
> to see demonstration code, and I'll tell you if the algorithm would say
> "cannot loop infinitely" or "might loop infinitely". I am especially
> interested in grammars for which these categories cannot be assigned: I
> haven't proven that it's impossible, because I don't have the time, but I
> suspect it is.