Generating a pretty AST with LPeg

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Generating a pretty AST with LPeg

Matthias Kluwe
Hi!

I've nearly finished my first parsing task using LPeg (which was parsing
SAS-code to visualize the "data flow") and it was fun.

As I'm lacking some experience my "AST" resulting from the parsing step
is not too pretty. Let me give you a simplified example of what my "AST"
looks now. Consider parsing "function calls" in some hypothetical
language:

Using

    local WS    = lpeg.S' \n\t'^0
    local NAME  = lpeg.C( lpeg.R'az'^1 )
    local PLIST = WS * ( NAME * ( WS * ',' * WS * NAME )^0 )^-1
    local FCALL = lpeg.Cc'FCALL'
                  * lpeg.Ct( WS * NAME * '(' * PLIST * WS * ')' )
    local ALL   = lpeg.Ct( FCALL^0 )

with

    lpeg.match( ALL, [[
    foo( bar, baz )
    asdf( qwer )]] )

results in the table

    {
      'FCALL', { 'foo', 'bar', 'baz' },
      'FCALL', { 'asdf', 'qwer' }
    }

which can be interpreted as the list of expression types ('FCALL'),
followed by a data structure containing the function name and the
arguments.

For better readability and "type safety" I'd rather like it to be

    {
      'FCALL', { name = 'foo', args = { 'bar', 'baz' } },
      'FCALL', { name = 'asdf', args = { 'qwer' } }
    }

Currently, I'm achieving this by a second transformation step.

I'm pretty sure this can be directly done by some clever trick in the
LPeg expressions above.

Please drop me a hint on how to do it.

Any other suggestions are welcome, even a "This is not the way to do it,
usually you do [...]".

Regards,
Matthias



Reply | Threaded
Open this post in threaded view
|

Re: Generating a pretty AST with LPeg

Fabien-3
If getting a robust AST is a means to an end rather than fun for the sake of it, I'd suggest that you look at Metalua:
  • The AST generator is now packaged as a separate Rock. Te parser itself is compatible with Lua 5.2 (although the rockspec might claim otherwise);
  • It's mature, deployed in several largely used projects;
  • A lot of trial / error / feedback /improvement has gone in the definition of the exact AST format;
  • It supports nice features such as good AST / source mapping, or access to metadata such as comments attached to nodes.
$ luarocks install metalua-parser --local
[...]
$ lua -l luarocks.loader
Lua 5.1.4  Copyright (C) 1994-2008 Lua.org, PUC-Rio
> mlc = require 'metalua.compiler'.new() -- parser
> pp  = require 'metalua.pprint' -- pretty-printer
> ast = mlc:src_to_ast [[ for i=1, 10 do print(i) end ]]
> pp.print(ast,{hide_hash=true,metalua_tag=true})
{ `Fornum{ `Id "i", `Number "1", `Number "10", { `Call{ `Id "print", `Id "i" } } } }

Even if your goal is to have fun with LPeg, I believe it's wise to stick with Metalua's AST definition, unless you've got a good reason: there are many fixed mistakes in it, which could avoid being repeated :)

A complete description of the AST format is available here: http://git.eclipse.org/c/koneki/org.eclipse.koneki.metalua.git/tree/README-parser.md

 
 
Reply | Threaded
Open this post in threaded view
|

Re: Generating a pretty AST with LPeg

Duncan Cross
In reply to this post by Matthias Kluwe
On Wed, Jan 29, 2014 at 10:41 AM, Matthias Kluwe <[hidden email]> wrote:
>
>     {
>       'FCALL', { 'foo', 'bar', 'baz' },
>       'FCALL', { 'asdf', 'qwer' }
>     }
>

>
> For better readability and "type safety" I'd rather like it to be
>
>     {
>       'FCALL', { name = 'foo', args = { 'bar', 'baz' } },
>       'FCALL', { name = 'asdf', args = { 'qwer' } }
>     }
>

It looks like the feature you want is lpeg.Cg(patt, name) [1].

For your example, here is an alternative version of FCALL using Cg:

>    local WS    = lpeg.S' \n\t'^0
>    local NAME  = lpeg.C( lpeg.R'az'^1 )
>    local PLIST = WS * ( NAME * ( WS * ',' * WS * NAME )^0 )^-1

  local NAME_f = lpeg.Cg(NAME, 'name')
  local ARGS_f = lpeg.Cg(lpeg.Ct(PLIST), 'args')

  local FCALL = lpeg.Cc'FCALL'
                 * lpeg.Ct( WS * NAME_f * '(' * ARGS_f * WS * ')' )


[1] http://www.inf.puc-rio.br/~roberto/lpeg/#cap-g

-Duncan

Reply | Threaded
Open this post in threaded view
|

Re: Generating a pretty AST with LPeg

Matthias Kluwe
Hi!

Duncan Cross <duncan.cross <at> gmail.com> writes:

> On Wed, Jan 29, 2014 at 10:41 AM, Matthias Kluwe <mkluwe <at> gmail.com>
wrote:

> >
> >     {
> >       'FCALL', { 'foo', 'bar', 'baz' },
> >       'FCALL', { 'asdf', 'qwer' }
> >     }
> >
>
> >
> > For better readability and "type safety" I'd rather like it to be
> >
> >     {
> >       'FCALL', { name = 'foo', args = { 'bar', 'baz' } },
> >       'FCALL', { name = 'asdf', args = { 'qwer' } }
> >     }
> >
>
> It looks like the feature you want is lpeg.Cg(patt, name) [1].

Thank you. So easy now that I see it :-)

Regards,
Matthias



Reply | Threaded
Open this post in threaded view
|

Re: Generating a pretty AST with LPeg

Roberto Ierusalimschy
In reply to this post by Fabien-3
> A complete description of the AST format is available here:
> http://git.eclipse.org/c/koneki/org.eclipse.koneki.metalua.git/tree/README-parser.md

One small comment about this format: is there a good reason why the rule

        block: { stat* }

is not

        block: `Block{ stat* }

?
These seem to be the only tables that appear in a tree without a proper
tag.

(Another small comment: The grammar uses something called "ident" that
is not defined.)

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Generating a pretty AST with LPeg

Fabien-3
is there a good reason why the rule block: { stat* } block: `Block{ stat* } ?

Several actually.

First, readability: one of the main design goals was to keep AST as human-readable as possible, and the main limiting factor proves to be verbosity. If we can spare a node or tag without affecting readability, we try to.

In a canonical AST, blocks only appear in fixed locations: as a chunk's root, and in predefined positions in loops, functions and conditionals (`Forin `Fornum `Function `If). You know simply by the node's position that it must be a block, you never have to test it with "if x.tag=='Block' then ... end".

For instance, the AST for "function(x, y) if y then return y else return x end" is:

`Function{ { `Id "x", `Id "y" }
           { `If{ `Id "y"
                   { `Return{ `Id "y" } }
                   { `Return{ `Id "x" } } } }

This is more readable, IMO, than:

`Function{ { `Id "x", `Id "y" }
           `Block{ `If{ `Id "y"
                        `Block{ `Return{ `Id "y" } }
                        `Block{ `Return{ `Id "x" } } } }

(Actually, if you add a 'Block' tag on your blocks, most tools won't even notice, since we always know that the node is a block without looking at its tag)


As a counter-example, there is a `Pair{ } tag around key/value pairs in literal tables, which could have been avoided (since it's illegal to have blocks as table elements). { "a", b="c" } is encoded as:

`Table{ `String "a", `Pair{ `String "b", `String "c" } }

That's because:
  1. code often needs to test whether a literal table element AST is a hash pair or a list value; to do so, "if x.tag=='Pair' then ... end" is a nice and straightforward idiom;
  2. when a human being reads an AST, he tends to skip braces and rely on indentation, as with most Lisp-ish languages. Tag-less pairs were too easy to misinterpret as two consecutive values, misreading the AST of { "a", b="c" } as that of { "a", "b", "c" }.


Moreover, tag-less blocks are not the same as `Do{ } blocks: they don't delimit a scope (although the surrounding loop/function/if might create one). For the compiler, `Do{ stat1, stat2, stat3, stat4 } is semantically equivalent to `Do{ stat1, { stat2, stat3 }, stat4 }, and the latter can be generated from `Do{ stat1, stat4 } with table.insert(x, 2, {stat2, stat3}). This proves very practical for AST manipulations, such as splicing argument in quasi-quotes. With multi-values discarded by Lua in non-final elements of a table, we'd otherwise miss the equivalent to Lisp's unquote-splicing operator ",@".
Reply | Threaded
Open this post in threaded view
|

Re: Generating a pretty AST with LPeg

Roberto Ierusalimschy
> Several actually.
>
> [...]

Thanks.


> > (Another small comment: The grammar uses something called "ident" that
> > is not defined.)

And what is "ident"?

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Generating a pretty AST with LPeg

David Favro
In reply to this post by Matthias Kluwe
Use lpeg.Cg() around NAME and PLIST, like this:

local WS    = lpeg.S' \n\t'^0
local NAME  = lpeg.C( lpeg.R'az'^1 )
local PLIST = lpeg.Ct( WS * ( NAME * ( WS * ',' * WS * NAME )^0 )^-1 )
local FCALL = lpeg.Cc'FCALL' * lpeg.Ct( WS * lpeg.Cg(NAME,'name')
               * '(' * lpeg.Cg(PLIST,'args') * WS * ')' )
local ALL   = lpeg.Ct( FCALL^0 )


Reply | Threaded
Open this post in threaded view
|

Re: Generating a pretty AST with LPeg

Fabien-3
In reply to this post by Roberto Ierusalimschy



On Wed, Jan 29, 2014 at 5:38 PM, Roberto Ierusalimschy <[hidden email]> wrote:
And what is "ident"?

Identifiers: id ::= `Ident{ <string> }