minor LPEG doubt

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

minor LPEG doubt

joy mondal
Hi !

Lets say you have a language that looks as follows :

hello = 
  (a,b) ->
    print 'hello world'
    1 + 2
( It looks a lot like moonscript if that helps ) 

hello = \n  (a,b) ->\n    print 'hello world'\n    (1 + 2)\n

Would the token stream out of your LPEG grammer look ideally like this ?

[var hello]

[assign]

[indent 2]

[function start a b]

[indent 4]

[call print 'hello world']

[indent 4]

[call add 1 2]
 
[indent 2]

One final issue, when using -- lpeg.Ct -- How would I iterate over the table since I cannot call .length on it ?
Can I pass a metatable to lpeg.Ct that has a length operator ?

Hopefully I am in the right path, thanks for your help.
Cheers !
Reply | Threaded
Open this post in threaded view
|

Re: minor LPEG doubt

Sean Conner
It was thus said that the Great joy mondal once stated:

> Hi !
>
> Lets say you have a language that looks as follows :
>
> hello =
>   (a,b) ->
>     print 'hello world'
>     1 + 2
>
> ( It looks a lot like moonscript if that helps )
>
> hello = \n  (a,b) ->\n    print 'hello world'\n    (1 + 2)\n
>
>
> Would the token stream out of your LPEG grammer look ideally like this ?

  Are you asking a particular person or just anyone in general?  

  LPeg doesn't automatically parse text---it's a library of code to generate
code that will parse text.

> [var hello]
>
>
> [assign]
>
> [indent 2]
>
> [function start a b]
>
> [indent 4]
>
> [call print 'hello world']
>
> [indent 4]
>
> [call add 1 2]
>
> [indent 2]

  Yes, the output of an LPeg parser *could* output that, but someone would
have to write the code to do so.  The LPeg I might write might return the
following Lua table:

        {
          nodetype = 'assignment',
          to = 'hello',
          value =
          {
            nodetype = 'function',
            input = { 'a' , 'b' },
            code =
            {
              {
                nodetype = 'call',
                id = 'print',
                parameters = { 'hello world' }
              },
              {
                nodetype = 'expression',
                code =
                {
                  nodetype = '+',
                  parameters =
                  {
                    {
                      exprtype = 'immediate',
                      type = 'number',
                      value = 1
                    },
                    {
                      exprtype = 'immediate',
                      type = 'number',
                      value = 2
                    }
                  }
                }
              }
            }
          }
        }

  It all depends upon the LPeg code.

> One final issue, when using -- lpeg.Ct -- How would I iterate over the
> table since I cannot call .length on it ?

  The table returned from lpeg.Ct() is a Lua table and can be iterated over
like any other Lua table.

        parse = lpeg.Ct(lpeg.C("a")^0)

        x = parse:match "aaaaaaaa"
        print(#x,x)
        for i = 1 , #x do print(x[i]) end

  If you mean "How do I obtain the length of the table returned by lpeg.Ct()
during parsing?" then something like this would work:

        parse = lpeg.Cf(lpeg.Ct"" * lpeg.C"a"^0,function(t,a)
          if #t < 5 then
            table.insert(t,a)
          end
          return t
        end)

        x = parse:match "aaaaaaaa"
        print(#x,x)
        for i = 1 , #x do print(x[i]) end

  There are other ways to accomplish the same thing, but this is better
served by this:

        parse = lpeg.Ct(lpeg.C"a"^-5) * lpeg.P"a"^0
       
        x = parse:match "aaaaaaaa"
        print(#x,x)
        for i = 1 , #x do print(x[i]) end

where you look for the maximum number of items you want in the table.

  It might help us to know what exactly you are trying to accomplish with
LPeg.

  -spc (And do try the examples and work out how they work)


Reply | Threaded
Open this post in threaded view
|

Re: minor LPEG doubt

joy mondal
Hi Sean Conner,

Thanks for the quick reply !

You have answered both my questions

( I need to try and write the first example since I am still unsure about the LPEG code for capturing indentation - left curly and right curly are simple enough but I am thrown off regarding indentation ).

For the first issue, yes I need to create a hierarchical tree instead of a flat output, normally examples of lexer output online show a stream of tokens, but LPEG creates an AST directly.

The second one is my fault, I was unaware that #table was the syntax for getting table length.

Cheers !

On Sat, Sep 15, 2018 at 3:47 AM Sean Conner <[hidden email]> wrote:
It was thus said that the Great joy mondal once stated:
> Hi !
>
> Lets say you have a language that looks as follows :
>
> hello =
>   (a,b) ->
>     print 'hello world'
>     1 + 2
>
> ( It looks a lot like moonscript if that helps )
>
> hello = \n  (a,b) ->\n    print 'hello world'\n    (1 + 2)\n
>
>
> Would the token stream out of your LPEG grammer look ideally like this ?

  Are you asking a particular person or just anyone in general? 

  LPeg doesn't automatically parse text---it's a library of code to generate
code that will parse text.

> [var hello]
>
>
> [assign]
>
> [indent 2]
>
> [function start a b]
>
> [indent 4]
>
> [call print 'hello world']
>
> [indent 4]
>
> [call add 1 2]
>
> [indent 2]

  Yes, the output of an LPeg parser *could* output that, but someone would
have to write the code to do so.  The LPeg I might write might return the
following Lua table:

        {
          nodetype = 'assignment',
          to = 'hello',
          value =
          {
            nodetype = 'function',
            input = { 'a' , 'b' },
            code =
            {
              {
                nodetype = 'call',
                id = 'print',
                parameters = { 'hello world' }
              },
              {
                nodetype = 'expression',
                code =
                {
                  nodetype = '+',
                  parameters =
                  {
                    {
                      exprtype = 'immediate',
                      type = 'number',
                      value = 1
                    },
                    {
                      exprtype = 'immediate',
                      type = 'number',
                      value = 2
                    }
                  }
                }
              }
            }
          }
        }

  It all depends upon the LPeg code.

> One final issue, when using -- lpeg.Ct -- How would I iterate over the
> table since I cannot call .length on it ?

  The table returned from lpeg.Ct() is a Lua table and can be iterated over
like any other Lua table.

        parse = lpeg.Ct(lpeg.C("a")^0)

        x = parse:match "aaaaaaaa"
        print(#x,x)
        for i = 1 , #x do print(x[i]) end

  If you mean "How do I obtain the length of the table returned by lpeg.Ct()
during parsing?" then something like this would work:

        parse = lpeg.Cf(lpeg.Ct"" * lpeg.C"a"^0,function(t,a)
          if #t < 5 then
            table.insert(t,a)
          end
          return t
        end)

        x = parse:match "aaaaaaaa"
        print(#x,x)
        for i = 1 , #x do print(x[i]) end

  There are other ways to accomplish the same thing, but this is better
served by this:

        parse = lpeg.Ct(lpeg.C"a"^-5) * lpeg.P"a"^0

        x = parse:match "aaaaaaaa"
        print(#x,x)
        for i = 1 , #x do print(x[i]) end

where you look for the maximum number of items you want in the table.

  It might help us to know what exactly you are trying to accomplish with
LPeg.

  -spc (And do try the examples and work out how they work)


Reply | Threaded
Open this post in threaded view
|

Re: minor LPEG doubt

Sean Conner
It was thus said that the Great joy mondal once stated:
> Hi Sean Conner,
>
> Thanks for the quick reply !
>
> You have answered both my questions
>
> ( I need to try and write the first example since I am still unsure about
> the LPEG code for capturing indentation - left curly and right curly are
> simple enough but I am thrown off regarding indentation ).

  Easy enough:

        indent = lpeg.P"\n" * lpeg.P" "^0

Or, if your gammar ends on a newline, then skip the initial lpeg.P"\n".  To
save the current level, you could do something like (untested):

        indent = lpeg.P"\n" * (lpeg.C(lpeg.P" "^0) * lpeg.Carg(1))
               / function(indent,info)
                  -- ------------------
                  -- if the length of indent is larger than the current
                  -- indent level, we have a new indent level
                  -- -----------------
                 
                   if #indent > #info.identlevels[#info.identlevels] then
                     table.insert(info.identlevels,ident)
                     return "{" -- simulate an opening bracket or
                             -- whatever you use to indicate new indent
                 
                   elseif #indent < #info.indentlevels[#info.identlevels] then
                     table.remove(info.identlevels)
                     return "}"
                   
                   else
                     return " " -- just return something neutral
                   end
                 end



  This treats indent levels as either an opening brace, closing brace or
space (change to suite your needs).  But you do need to call the top level
parsing rule to:

        ast = parser:match(text,1,{ indentlevels = { "" } })
       
paramters to lpeg.match past the initial position argument are available via
lpeg.Carg(), and I'm using that here to keep track of some addtional
information during the parse (you could skip this and keep this info in
globals, but I dislike globals as much as possible).  Here, the indentlevels
array is just a stack of seen indents.  If we get an indent that is longer
than the current one, we have a new level, and if it's shorter, we've ended
a level, and if it matches, we're still in the current level.

  If you want to handle tabs as spaces, you can do it, but it can get
complicated.

> For the first issue, yes I need to create a hierarchical tree instead of a
> flat output, normally examples of lexer output online show a stream of
> tokens, but LPEG creates an AST directly.

  Oh, LPeg can create a stream of tokens---I've had to do that type of stuff
in certain circumstances.  It's not hard, but you do have to track a bit
more information:

        local parser = lpeg.C( --[[ LPeg code ]]-- ) * lpeg.Cp()
       
        local text = " ... code to parse here ... "
        local pos  = 1
        local info = { --[[ additional information used for parsing ]]-- }
       
        while pos <= #text do
          local token,newpos = parser:match(text,pos,info)
          if not token then
            error "Error parsing"
          end
          -- process token
          pos = newpos
        end
       
Basically, the parser bit will return the next logical token and the
position to resume parsing the text for the next token.

  -spc

Reply | Threaded
Open this post in threaded view
|

Re: minor LPEG doubt

joy mondal
Hi Sean !

I am still unsure how to deal with situations such as these:

""Hello""

"Hel \"\" lo" -- using \ for escaping

I would like to parse these as (Lua Tables) :

{"string",""Hello""}

{"string","Hel \"\" lo"}

I see it would be similar to the indent logic you have illustrated above, but i  cannot seem to piece together the hints.

cheers !

On Sat, Sep 15, 2018 at 11:05 AM Sean Conner <[hidden email]> wrote:
It was thus said that the Great joy mondal once stated:
> Hi Sean Conner,
>
> Thanks for the quick reply !
>
> You have answered both my questions
>
> ( I need to try and write the first example since I am still unsure about
> the LPEG code for capturing indentation - left curly and right curly are
> simple enough but I am thrown off regarding indentation ).

  Easy enough:

        indent = lpeg.P"\n" * lpeg.P" "^0

Or, if your gammar ends on a newline, then skip the initial lpeg.P"\n".  To
save the current level, you could do something like (untested):

        indent = lpeg.P"\n" * (lpeg.C(lpeg.P" "^0) * lpeg.Carg(1))
               / function(indent,info)
                   -- ------------------
                   -- if the length of indent is larger than the current
                   -- indent level, we have a new indent level
                   -- -----------------

                   if #indent > #info.identlevels[#info.identlevels] then
                     table.insert(info.identlevels,ident)
                     return "{" -- simulate an opening bracket or
                                -- whatever you use to indicate new indent

                   elseif #indent < #info.indentlevels[#info.identlevels] then
                     table.remove(info.identlevels)
                     return "}"

                   else
                     return " " -- just return something neutral
                   end
                 end



  This treats indent levels as either an opening brace, closing brace or
space (change to suite your needs).  But you do need to call the top level
parsing rule to:

        ast = parser:match(text,1,{ indentlevels = { "" } })

paramters to lpeg.match past the initial position argument are available via
lpeg.Carg(), and I'm using that here to keep track of some addtional
information during the parse (you could skip this and keep this info in
globals, but I dislike globals as much as possible).  Here, the indentlevels
array is just a stack of seen indents.  If we get an indent that is longer
than the current one, we have a new level, and if it's shorter, we've ended
a level, and if it matches, we're still in the current level.

  If you want to handle tabs as spaces, you can do it, but it can get
complicated.

> For the first issue, yes I need to create a hierarchical tree instead of a
> flat output, normally examples of lexer output online show a stream of
> tokens, but LPEG creates an AST directly.

  Oh, LPeg can create a stream of tokens---I've had to do that type of stuff
in certain circumstances.  It's not hard, but you do have to track a bit
more information:

        local parser = lpeg.C( --[[ LPeg code ]]-- ) * lpeg.Cp()

        local text = " ... code to parse here ... "
        local pos  = 1
        local info = { --[[ additional information used for parsing ]]-- }

        while pos <= #text do
          local token,newpos = parser:match(text,pos,info)
          if not token then
            error "Error parsing"
          end
          -- process token
          pos = newpos
        end

Basically, the parser bit will return the next logical token and the
position to resume parsing the text for the next token.

  -spc

Reply | Threaded
Open this post in threaded view
|

Re: minor LPEG doubt

joy mondal
Hi Sean !

After reading my mail I think I wasn't clear enough.

How do you deal with situations where you have a matching end character in between your match ?

It particularly problematic for characters such as  " '

Is there a LPEG general best practice for such cases ?

best wishes.




On Tue, Sep 18, 2018 at 6:10 PM joy mondal <[hidden email]> wrote:
Hi Sean !

I am still unsure how to deal with situations such as these:

""Hello""

"Hel \"\" lo" -- using \ for escaping

I would like to parse these as (Lua Tables) :

{"string",""Hello""}

{"string","Hel \"\" lo"}

I see it would be similar to the indent logic you have illustrated above, but i  cannot seem to piece together the hints.

cheers !

On Sat, Sep 15, 2018 at 11:05 AM Sean Conner <[hidden email]> wrote:
It was thus said that the Great joy mondal once stated:
> Hi Sean Conner,
>
> Thanks for the quick reply !
>
> You have answered both my questions
>
> ( I need to try and write the first example since I am still unsure about
> the LPEG code for capturing indentation - left curly and right curly are
> simple enough but I am thrown off regarding indentation ).

  Easy enough:

        indent = lpeg.P"\n" * lpeg.P" "^0

Or, if your gammar ends on a newline, then skip the initial lpeg.P"\n".  To
save the current level, you could do something like (untested):

        indent = lpeg.P"\n" * (lpeg.C(lpeg.P" "^0) * lpeg.Carg(1))
               / function(indent,info)
                   -- ------------------
                   -- if the length of indent is larger than the current
                   -- indent level, we have a new indent level
                   -- -----------------

                   if #indent > #info.identlevels[#info.identlevels] then
                     table.insert(info.identlevels,ident)
                     return "{" -- simulate an opening bracket or
                                -- whatever you use to indicate new indent

                   elseif #indent < #info.indentlevels[#info.identlevels] then
                     table.remove(info.identlevels)
                     return "}"

                   else
                     return " " -- just return something neutral
                   end
                 end



  This treats indent levels as either an opening brace, closing brace or
space (change to suite your needs).  But you do need to call the top level
parsing rule to:

        ast = parser:match(text,1,{ indentlevels = { "" } })

paramters to lpeg.match past the initial position argument are available via
lpeg.Carg(), and I'm using that here to keep track of some addtional
information during the parse (you could skip this and keep this info in
globals, but I dislike globals as much as possible).  Here, the indentlevels
array is just a stack of seen indents.  If we get an indent that is longer
than the current one, we have a new level, and if it's shorter, we've ended
a level, and if it matches, we're still in the current level.

  If you want to handle tabs as spaces, you can do it, but it can get
complicated.

> For the first issue, yes I need to create a hierarchical tree instead of a
> flat output, normally examples of lexer output online show a stream of
> tokens, but LPEG creates an AST directly.

  Oh, LPeg can create a stream of tokens---I've had to do that type of stuff
in certain circumstances.  It's not hard, but you do have to track a bit
more information:

        local parser = lpeg.C( --[[ LPeg code ]]-- ) * lpeg.Cp()

        local text = " ... code to parse here ... "
        local pos  = 1
        local info = { --[[ additional information used for parsing ]]-- }

        while pos <= #text do
          local token,newpos = parser:match(text,pos,info)
          if not token then
            error "Error parsing"
          end
          -- process token
          pos = newpos
        end

Basically, the parser bit will return the next logical token and the
position to resume parsing the text for the next token.

  -spc

Reply | Threaded
Open this post in threaded view
|

Re: minor LPEG doubt

Sean Conner
It was thus said that the Great joy mondal once stated:

> Hi Sean !
>
> After reading my mail I think I wasn't clear enough.
>
> How do you deal with situations where you have a matching end character in
> between your match ?
>
> It particularly problematic for characters such as  " '
>
> Is there a LPEG general best practice for such cases ?

  I don't know about best practices, I tend to use what works for me.

  With that said, your example:

        Hello Hel "" lo

(that's what it would appear like to LPeg---a stream of characters)

I would ask---what's the rule?  What is that situation?  Beacuse that's a
situation I haven't had to deal with.  About the closest situation I can
think of is something like:

        'She said she can't come to the party.'

  Here, whether the ' is a string marker or part of a word is context
dependent.  You can either require an escape character:

        'She said she can\'t come to the party.'

or you can define parsing rules like (in English):

        * If a "space" character preceeds a single quote, treat it as a start
          of a string.  

        * If, in a string, you come across a single quote preceeded by an
          alphabetic character (A-Z, a-z) and immediately followed by a an
          alphabetic characeter, treat it as part of the word (or string).

        * If, in a string, you come across a single quote, followed by a
          "space" or "punctuation" character, treat it as the end of the
          string.

  There are other cases you might have to consider:

        ''Tis but a stratch!'

        'We ain't got nuttin' to say, copper!'

  Now the question becomes, do you want a context-free parse, or a
context-aware parse?  A context-aware parse might have to include quite a
number of exceptions.  There is no right or wrong answer here.

  -spc


Reply | Threaded
Open this post in threaded view
|

Re: minor LPEG doubt

William Ahern
In reply to this post by joy mondal
On Tue, Sep 18, 2018 at 07:40:05PM +0600, joy mondal wrote:

> Hi Sean !
>
> After reading my mail I think I wasn't clear enough.
>
> How do you deal with situations where you have a matching end character in
> between your match ?
>
> It particularly problematic for characters such as  " '
>
> Is there a LPEG general best practice for such cases ?

How would you do it if you wrote a parser by hand? You'd change the set of
characters that you matched when inside a quoted strong, either by (a) using
a new map of allowed characters to consume or (b) calling a new function
which accepted a narrower class of characters.

You do the same thing, logically, with a PEG.

  local blank = P"\n\r \v\t"
  local special = P"\\" + P'"' + blank
  local escaped = P"\\" * C(P(1)) -- explicit capture to elide blackslash
  local literal = (P(1) - special) + escaped
  local simplestring = C(literal^1)
  local quotedstring = P'"' * C((literal + (special - P'"')))^0) * P'"'
  local string = (simplestring + quotedstring)^1

In the above, simplestring captures anything that isn't special (quote,
blackslash or blank). A quotedstring captures a different class of
characters--a superset of simplestring that is everything except an
unescaped quote. (There are many ways to reformulate the above, which
happens to be untested, FWIW.)

Notice the use of predicates: the minus operator creates an expression that
matches the first operand iff the second operand does NOT match, effectively
creating a new character class by subtraction.

And notice the use of ordered choice: PEGs are evaulated left-to-right. The
first matching choice short-circuits the subsequent choices. quotedstring
wouldn't work if we swapped it like so

  P'"' * C(((special - P'"') + literal))^0) * P'"'

because backslashes would be consumed by `special - P'"'` instead of being
captured and transformed by `literal`. Ordered choice is often a convenience
that allows us to be more concise (we don't need to explicitly exclude
blackslash-escaped sequences from the second choice), but sometimes
necessary. (Perhaps there's some deep equivalency between ordered choice and
predicates?)

Predicates and ordered choice are integral to PEGs. It's why they work and
why subexpressions can be composed consistently, permitting LPeg's powerful
capturing features. Extensions to regular expression engines support
zero-width assertions, which are similar to predicates. But ordered choice
is the opposite behavior of regular expressions, which will either (a)
backtrack and choose the "best fit" subexpression or (b) compose in
unpredictable ways when subexpressions match overlapping segments.

Ordered choice also reflects how greediness is subtly different between PEGs
and regular expressions--for regular expressions greediness is a property of
the entire match, whereas for PEGs greediness is a property of each
individual subexpression. If a subexpression's greediness would cause the
entire match to fail, the PEG will simply fail, whereas a regular expression
will behave as-if a subexpression is only as greedy as possible without
preventing a successful match. This effort to make greediness "best fit" is
why when you join two subexpressions in a regular expression you can get
surprising results that seemingly change the matching behavior of a
subexpression.

Once you tweak regular expressions to become as powerful as PEGs, you've
simply reinvented PEGs, and in particular necessarily implemented predicates
and ordered choice.

Reply | Threaded
Open this post in threaded view
|

Re: minor LPEG doubt

Albert Chan


On Sep 18, 2018, at 2:34 PM, William Ahern <[hidden email]> wrote:

> On Tue, Sep 18, 2018 at 07:40:05PM +0600, joy mondal wrote:
>
> local blank = P"\n\r \v\t"
> local special = P"\\" + P'"' + blank

definition of blank is wrong, should be lpeg.S


> Once you tweak regular expressions to become as powerful as PEGs, you've
> simply reinvented PEGs, and in particular necessarily implemented predicates
> and ordered choice.

Where is your question ?

BTW, LPEG and regular expression is totally different.
You can not tweak a regular expression, and turn to LPEG ...




Reply | Threaded
Open this post in threaded view
|

Re: minor LPEG doubt

Roberto Ierusalimschy
> BTW, LPEG and regular expression is totally different.
> You can not tweak a regular expression, and turn to LPEG ...

Yes, you can. (But you really have to tweak it...)

-- Roberto