Problems with lpeg.Cmt captures

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Problems with lpeg.Cmt captures

Glenn McAllister
I've got an awkward grammar that requires me to reach back to a previously parsed element to see what it is (is it a newline or not) before I can go accept the match. Because of the way I need to implement things, the newline is actually captured in a previous match and there should be a NewlineChunk object as the capture. That object isn't available for me to use in the next match, so I decided to try and use Cmt and Cb to figure out what I need to do. My function is getting called correctly, but after I return true (or i) LPEG re-evaluates the same text but provides me with a different back reference to compare, which fails and then sends me down the wrong path.

Some examples will clear this up I hope. I need to implement what amounts to an if statement, with very specific rules about what whitespace I must strip. Specifically if I have

'one$if(foo)$ two \n$endif$\n three'

The the newline before and after the $endif has to be stripped. The resulting text would be 'one two three'.

If I have

'one$if(foo)$ two $endif$\n three'

the resulting text would be 'one two\n three'. Note that I'm not supposed to chew the newline after $endif$, as it isn't on a line by itself.

The 'two' bit can obviously be more than just text, and in particular it can be more $if(..)$ blah $endif$ statements, meaning I need to treat blah as an arbitrary series of chunks to be parsed. Because of the way I need to handle newlines, they are treated as a separate flyweight object.

So in the first case where I have to chew that newline object, I can't just do (ignoring captures, etc.):

EndifExpr = s.NEWLINE * ExprStart * s.ENDIF * ExprEnd * s.NEWLINE

The first newline is captured in the parsing of the chunks. So the rule I came up with is (sorry for any bad wrapping):

 EndifExpr = Cmt(Cb(1) * ExprStart * s.ENDIF * ExprEnd * s.NEWLINE,
                    function(s,i,a)
print('checking if we need to kill a newline in \'' .. s ..
                              '\' at position ' .. i)
                        if a.isA then
                            if a:isA(NewlineChunk) then
                                print('really need to kill newline')
                                return i, "kill"
                            else
                                print('not a NewlineChunk class')
                                return false
                            end
                        else
                            print('not a ST class (not expected)', a)
                            return false
                        end
                    end) +
                Cs((ExprStart * C(s.ENDIF) * ExprEnd) / "dontkill"),

Basically I want to see "kill" or "dontkill" to decide what to do with the last chunk in my table of chunks. When the overall match is determined, a function is called to create the if chunk, which is where I make my decision.

And yes, this match does actually execute. The problem for me is that it executes twice. Before you ask, if I replace 'return i, "kill"' with 'return i+1, "kill"' in an attempt to advance the match position, I still see this same behavior. The following is debug output from LPEG:

|| s: |$endif$
||  three| stck: 9 c: 14  195: choice -> 198 (0)
|| s: |$endif$
||  three| stck: 10 c: 14  196: call -> 69
|| s: |$endif$
||  three| stck: 11 c: 14  69: choice -> 90 (0)
|| s: |$endif$
||  three| stck: 12 c: 14  70: opencapture runtime(n = 0)  (off = 8)
|| s: |$endif$
||  three| stck: 12 c: 15  71: emptycapture backref(n = 0)  (off = 1)
|| s: |$endif$
||  three| stck: 12 c: 16  72: call -> 309
|| s: |$endif$
||  three| stck: 13 c: 16  309: set [(24)]
|| s: |endif$
||  three| stck: 13 c: 16  318: ret
|| s: |endif$
||  three| stck: 12 c: 16  73: char 'e'
|| s: |ndif$
||  three| stck: 12 c: 16  74: char 'n'
|| s: |dif$
||  three| stck: 12 c: 16  75: char 'd'
|| s: |if$
||  three| stck: 12 c: 16  76: char 'i'
|| s: |f$
||  three| stck: 12 c: 16  77: char 'f'
|| s: |$
||  three| stck: 12 c: 16  78: call -> 307
|| s: |$
||  three| stck: 13 c: 16  307: char '$'
|| s: |
||  three| stck: 13 c: 16  308: ret
|| s: |
||  three| stck: 12 c: 16  79: set [(0a)(0d)]
|| s: | three| stck: 12 c: 16  88: closeruntime close(n = 0)  (off = 0)
|| checking if we need to kill a newline in 'one$if(foo)$ two
|| $endif$
||  three' at position 27
|| really need to kill newline
|| s: | three| stck: 12 c: 16  89: commit -> 102
|| s: | three| stck: 11 c: 16  102: ret
|| s: | three| stck: 10 c: 16  197: failtwice
|| s: |$endif$
||  three| stck: 7 c: 14  435: closecapture close(n = 0)  (off = 0)
|| s: |$endif$
||  three| stck: 7 c: 15  436: call -> 69
|| s: |$endif$
||  three| stck: 8 c: 15  69: choice -> 90 (0)
|| s: |$endif$
||  three| stck: 9 c: 15  70: opencapture runtime(n = 0)  (off = 8)
|| s: |$endif$
||  three| stck: 9 c: 16  71: emptycapture backref(n = 0)  (off = 1)
|| s: |$endif$
||  three| stck: 9 c: 17  72: call -> 309
|| s: |$endif$
||  three| stck: 10 c: 17  309: set [(24)]
|| s: |endif$
||  three| stck: 10 c: 17  318: ret
|| s: |endif$
||  three| stck: 9 c: 17  73: char 'e'
|| s: |ndif$
||  three| stck: 9 c: 17  74: char 'n'
|| s: |dif$
||  three| stck: 9 c: 17  75: char 'd'
|| s: |if$
||  three| stck: 9 c: 17  76: char 'i'
|| s: |f$
||  three| stck: 9 c: 17  77: char 'f'
|| s: |$
||  three| stck: 9 c: 17  78: call -> 307
|| s: |$
||  three| stck: 10 c: 17  307: char '$'
|| s: |
||  three| stck: 10 c: 17  308: ret
|| s: |
||  three| stck: 9 c: 17  79: set [(0a)(0d)]
|| s: | three| stck: 9 c: 17  88: closeruntime close(n = 0)  (off = 0)
|| checking if we need to kill a newline in 'one$if(foo)$ two
|| $endif$
||  three' at position 27
|| not a ST class (not expected)        table: 0x809ad88

So for some reason, despite returning the fact that I said the match succeeded, it appears to have failed and is calling it again. At this point, I'm not sure what captured element is being provided, unless its the entire table of chunks.

I'm stumped. Anyone have any pointers as to why my Cmt capture isn't working as I expected?

--
Glenn McAllister     <[hidden email]>      +1 416 348 1594
SOMA Networks, Inc.  http://www.somanetworks.com/  +1 416 977 1414

Reply | Threaded
Open this post in threaded view
|

Re: Problems with lpeg.Cmt captures

Roberto Ierusalimschy
> So for some reason, despite returning the fact that I said the match 
> succeeded, it appears to have failed and is calling it again.  At this 
> point, I'm not sure what captured element is being provided, unless its the 
> entire table of chunks.

In your trace there is this sequence, just after the Cmt execution:

|| s: | three| stck: 12 c: 16  89: commit -> 102
|| s: | three| stck: 11 c: 16  102: ret
|| s: | three| stck: 10 c: 16  197: failtwice

The commit seems consistent with the choice in your rule (" + Cs...");
The return seems to finish the rule EndifExpr. But then there is this
failtwice. This instruction is generated only by a not predicate ('a-b'
or '-b'). Are you sure your pattern does not have this predicate?

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Problems with lpeg.Cmt captures

Glenn McAllister
Roberto Ierusalimschy wrote:
So for some reason, despite returning the fact that I said the match succeeded, it appears to have failed and is calling it again. At this point, I'm not sure what captured element is being provided, unless its the entire table of chunks.

In your trace there is this sequence, just after the Cmt execution:

|| s: | three| stck: 12 c: 16  89: commit -> 102
|| s: | three| stck: 11 c: 16  102: ret
|| s: | three| stck: 10 c: 16  197: failtwice

The commit seems consistent with the choice in your rule (" + Cs...");
The return seems to finish the rule EndifExpr. But then there is this
failtwice. This instruction is generated only by a not predicate ('a-b'
or '-b'). Are you sure your pattern does not have this predicate?

-- Roberto

I probably should have started this thread stating out front that I'm porting the Java StringTemplate template library/engine to Lua. It's the StringTemplate spec that is determining what I need to do. Specifically regarding whitespace, you can see what I'm working from here: http://tinyurl.com/65l87h (the 'Whitespace in Conditionals' section at the bottom of the page).

I think this line is the culprit:

Expr = -EndifExpr * (EscapeExpr + CommentExpr + IfExpr + TemplateRefExpr + AttrRefExpr)

I was finding that if I didn't put the -EndifExpr then $endif$ was being treated as an attribute reference (AttrRefExpr) that effectively translates into an empty string. When I just have

Expr = EscapeExpr + CommentExpr + IfExpr + TemplateRefExpr + AttrRefExpr

I get a bunch of failures where the IfExpr isn't getting parsed correctly.

Here's the entire grammar so far. I'm new to PEGs as I've previously pointed out, so I'm probably doing something wrong; any advice would be appreciated. Sorry if this wraps badly for someone.

-- Grammar terminals
local s = {
    N = R'09',
    AZ = R('__','az','AZ','\127\255'),
    NEWLINE = S'\n\r',
    SPACE = S' \t',
    WS = S' \t\n\r',
    SEMI = S';',
    COMMA = S',',
    PERIOD = S'.',
    SEPARATOR = P'separator',
    EQUALS = P'=',
    DQUOTE = S'"',
    NULL = P'null',
    EPSILON = P(true),
    ESCAPE = S'\\',
    BANG = P'!',
    EXPR_START_DOLLAR = P'$' - S'\\',
    EXPR_END_DOLLAR = P'$',
    EXPR_START_BRACKET = P'<' - S'\\',
    EXPR_END_BRACKET = P'>',
    LBRACE = P'(',
    RBRACE = P')',
    IF = P'if',
    ENDIF = P'endif'
}

-- A bunch of V'foo' assignments happen around here so the grammer
-- looks a little cleaner (from my pov).

local grammar = {
    "Template",

    Template = TemplateBody * -1,

    TemplateBody = Ct(Chunk^1),

    Chunk = Newline + Literal + Expr,

Expr = -EndifExpr * (EscapeExpr + CommentExpr + IfExpr + TemplateRefExpr + AttrRefExpr), --Expr = IfExpr + EscapeExpr + CommentExpr + TemplateRefExpr + AttrRefExpr,

    Newline = C(s.NEWLINE) / newNewline,

Literal = Cs(((LiteralEscape + 1) - (ExprStart + Newline))^1) / newLiteral,

    LiteralEscape = (s.ESCAPE * S'$<') / literalEscapes,

    AttrRef =  C((1 - (ExprEnd + s.SEMI + s.PERIOD))^1),

    AttrProp = (s.PERIOD * C((1 - (ExprEnd + s.SEMI))^1)) +
               C(s.EPSILON),

    AttrOpts = (s.SEMI * s.SPACE^0 *
                    Ct(AttrOpt * (s.COMMA * s.SPACE^0 * AttrOpt)^0)) +
C(s.EPSILON), -- ensures we always get something for the options

    AttrOpt = Ct(C((1 - (s.EQUALS + s.COMMA))^1) * s.EQUALS * s.DQUOTE *
Cs(((((s.ESCAPE * S'ntr ')/exprEscapes) + 1) - s.DQUOTE)^0)
                * s.DQUOTE) +
              Ct(C((1 - (s.COMMA + s.SPACE + ExprEnd))^1) * C(s.EPSILON)),

    AttrRefExpr = ExprStart *
                    (AttrRef * AttrProp * AttrOpts / newAttrRefExpr) *
                    ExprEnd,

    CommentExpr = ExprStart * s.BANG * (1 - s.BANG)^0 * s.BANG * ExprEnd,

    EscapeExpr = ExprStart *
Ct(Cs(((s.ESCAPE * S'ntr ') / exprEscapes))^1) / newEscapeExpr *
                    ExprEnd,

    TemplateRef = C((1 - s.LBRACE)^1),

TemplateParamList = TemplateParam * (s.WS^0 * s.COMMA * s.WS^0 * TemplateParam)^0,

    TemplateParam = Name * s.WS^0 * s.EQUALS * s.WS^0 * Name,

    Name = C(s.AZ * (s.AZ + s.N)^0),

    TemplateRefExpr = ExprStart * (TemplateRef *
                        s.LBRACE * s.WS^0 *
                        (Ct(TemplateParamList) + s.EPSILON) * s.WS^0 *
                        s.RBRACE) / newTemplateRef *
                        ExprEnd,

    EndifExpr = Cmt(Cb(1) * ExprStart * s.ENDIF * ExprEnd * s.NEWLINE,
                    function(s,i,a)
                        if a.isA then
                            if a:isA(NewlineChunk) then
                                return i, "kill"
                            else
                                return false
                            end
                        else
                            return false
                        end
                    end) +
                Cs((ExprStart * C(s.ENDIF) * ExprEnd) / "dontkill"),

    -- Because we are creating an anonymous embedded template, we need to
-- pass in options (scanner and auto_indent) that the template cares about IfExpr = ExprStart * s.IF * s.LBRACE * IfExprAttr * IfExprProp * s.RBRACE * ExprEnd * --s.NEWLINE^0 * C((1 - (ExprStart * s.ENDIF))^0) * s.NEWLINE^0 *
                s.NEWLINE^0 * Ct(Chunk^1) *
                EndifExpr *
                Carg(1) * Carg(2) / newIf,

    IfExprAttr = C((1 - (s.PERIOD + s.RBRACE + ExprEnd))^1),

    -- This one is tricky, as we need to deal with indirect property
-- references, which are ofset by braces, which also happen to delimit the
    -- attribute/property of the if expression.
IfExprProp = (s.PERIOD * C(s.LBRACE * (1 - (ExprEnd + s.RBRACE))^1 * s.RBRACE)) +
                 (s.PERIOD * C((1 - (ExprEnd + s.RBRACE))^1)) +
                 C(s.EPSILON),
}

--
Glenn McAllister     <[hidden email]>      +1 416 348 1594
SOMA Networks, Inc.  http://www.somanetworks.com/  +1 416 977 1414

Reply | Threaded
Open this post in threaded view
|

Re: Problems with lpeg.Cmt captures

Roberto Ierusalimschy
>>> So for some reason, despite returning the fact that I said the match 
>>> succeeded, it appears to have failed and is calling it again.
>>> [...]
> I think this line is the culprit:
>
> Expr = -EndifExpr * (EscapeExpr + CommentExpr + IfExpr + TemplateRefExpr + 
> AttrRefExpr)

Clearly it is the culprit. As you said, EndifExpr succeeds, so
-EndifExpr fails. A later match against IfExpr tries EndifExpr again.

-- Roberto