LPeg indent/whitespace based grammar

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

LPeg indent/whitespace based grammar

Matthias Dörfelt-2

attachment0 (12 bytes) Download Attachment
encrypted.asc (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: LPeg indent/whitespace based grammar

Matthias Dörfelt-2
For some reason the initial email shows up as encrypted in my mail client (which I am pretty sure is a bug), can you guys see it? :)

sorry about that. thanks!

On Jul 6, 2017, at 17:11, Matthias Dörfelt <[hidden email]> wrote:

Hi All,

I am trying to write a lexer using LPeg for a very simply markup language that is very closely inspired by https://pugjs.org/ . It uses indentation to represent a hierarchy which as far as I can tell is something that is not super easily dealt with in LPeg. Here is a basic working prototype just to see if I can get the hierarchical aspect to work:

It parses the following input:

el
    child1
        child11
    child2

and puts its basic AST representation into a lua table that looks something like this:

{
    children =
    {
        name = el
        children =
        {
            name = child1
            children =
            {
                name = child11
            }
        },
        {
            name = child2
        }
    }
}

While my implementation works, I feel like I am circumventing a lot of the actual capturing mechanisms of LPeg in order to deal with back-referencing to build the hierarchy. I’d like to get some feedback on weather there are any better ways to approach this kind of grammar using LPeg. I am new to it in general so I might be missing something obvious here. Thank you very much!! Here is my lua code, it should just run if you have LPeg installed:

local lpeg = require("lpeg")

-- some basic lpeg helpers
local Newline = lpeg.S("\n\r")
local Indent = lpeg.P("    ") -- indent is just four spaces for now

-- helpers for back referencing and building
-- the final hierarchy.
local root = {children = {}}
local currentParent = root
local lastNode = root
local currentIndent = 0

-- helper to do all the indentation trickery that needs
-- to be applied to each line of the input
local function handleLine(_str)
    -- compute the current indent level of the line
    local idx = (Indent ^ 0):match(_str)
    local id = _str:sub(1, idx - 1)
    local level = id:len() / 4

    -- check if the parent changed based on the indent level
    local parent = currentParent
    if level > currentIndent then
        -- we went one level deeper
        parent = lastNode
    elseif level < currentIndent then
        -- we returned to the previous level
        parent = parent.parent
    end

    -- set the current parent and level
    currentParent = parent
    currentIndent = level
    assert(currentParent)

    -- create the new node
    local name = _str:sub(idx)
    local node = {parent = currentParent, children = {}, name = name}
    -- and insert it into the hierarchy
    table.insert(currentParent.children, node)
    lastNode = node

    return name
end

-- pattern to detect individual lines and call handleLine on the resulting line
local Line = ((Newline ^ 0 * lpeg.C((1 - Newline) ^ 1) / handleLine)) ^ 0

-- the test string to lex
local testString = "el\n    child1\n        child11\n    child2"

-- perform the lexing
-- NOTE: we are not using any captures here right now as we basically
-- do all the capturing manuall in handleLine :(
Line:match(testString)

-- helper to print the hierarchy
local function printHierarchy(_node, _indent)
    local indentStr = string.rep("    ", _indent)
    print(indentStr .. "{")
    if _node.name then print(indentStr .. "    name = " .. _node.name) end
    if #_node.children > 0 then
        print(indentStr .. "    children = ")
        for _, v in ipairs(_node.children) do
            printHierarchy(v, _indent + 1)
        end
    end
    print(indentStr .. "},")
end

-- print the resulting hierarchy
printHierarchy(root, 0)

Matthias


Reply | Threaded
Open this post in threaded view
|

[OT] Re: LPeg indent/whitespace based grammar

Petite Abeille

> On Jul 7, 2017, at 6:23 AM, Matthias Dörfelt <[hidden email]> wrote:
>
> For some reason the initial email shows up as encrypted in my mail client (which I am pretty sure is a bug), can you guys see it? :)

Indeed:

Content-Type: multipart/encrypted;
 boundary="Apple-Mail=_5EB37EA8-AE23-4B38-B3F5-14F07FBEF99D";
 protocol="application/pgp-encrypted"

Content-Type: application/pgp-encrypted

Rogue GPG on your side?



Reply | Threaded
Open this post in threaded view
|

Re: LPeg indent/whitespace based grammar

Duncan Cross
In reply to this post by Matthias Dörfelt-2
On Fri, Jul 7, 2017 at 5:23 AM, Matthias Dörfelt <[hidden email]> wrote:
> While my implementation works, I feel like I am circumventing a lot of the
> actual capturing mechanisms of LPeg in order to deal with back-referencing
> to build the hierarchy. I’d like to get some feedback on weather there are
> any better ways to approach this kind of grammar using LPeg. I am new to it
> in general so I might be missing something obvious here.

Here's how I would approach it.

First, some helper patterns:


--
local initial_indent = lpeg.Cg(lpeg.P(''), 'indent')

local match_same_indent = lpeg.Cmt(
  lpeg.Cb('indent'),
  function(s, p, indent)
    if s:sub(p, p + #indent - 1) ~= current_indent then
      return false
    end
    return p + #indent
  end)

local capture_deeper_indent = lpeg.Cg(
  match_same_indent * lpeg.S' \t'^1,
  'indent')
--


initial_indent captures an empty string and stores it as the named
group "indent".

match_same_indent checks that the immediate next set of characters is
identical to the current contextual value of the named group "indent".

capture_deeper_indent requires that the current indent matches at this
position, and is immediately followed by at least one more whitespace
character. If it succeeds, the full new indent is stored as a new
named group capture for "indent".

To create the actual hierarchy document grammar, I would turn to the
"re" module, passing in the helper patterns via the "defs" table
parameter:


--
local hierarchy = re.compile([[

  hierarchy <- %initial_indent {|
    {:children: {| (<first_root> <next_root>*)? |} :}
  |} !.

  first_root <- <entry>
  next_root <- %nl <entry>

  entry <- {|
    {:name: <name> :}
    {:children: {| (<first_child> <next_child>*)? |} :}
  |}

  first_child <- %nl %capture_deeper_indent <entry>
  next_child <- %nl %match_same_indent <entry>

  name <- (!%nl .)+

]], {
  initial_indent = initial_indent;
  match_same_indent = match_same_indent;
  capture_deeper_indent = capture_deeper_indent;
})
--


To explain the elements of this grammar in English:

<hierarchy>:
  - capture the empty string as an initial base value for "indent"
  - create a table with a "children" sub-table
  - populate "children" with either no values, or <first_root>
followed by zero or more <next_root>s
  - make sure there is no more input

<first_root>: same as <entry>

<next_root>: newline character followed by <entry>

<entry>: a table containing
  - a <name> captured as field "name"
  - a sub-table "children" containing either no values, or
<first_child> followed by zero or more <next_child>ren

<first_child>:
  - newline character
  - capture a new, deeper indent than the current contextual indent
  - <entry>

<next_child>:
  - newline character
  - match the current contextual indent
  - <entry>

<name>: one or more non-newline characters


A couple of quirks to note:

1. This implementation does not enforce indentation in sets of four
spaces, but instead any arbitrary set of spaces and tabs. Hopefully it
should not be too difficult to modify capture_deeper_indent to be
strict about this, if you want.

2. The "indent" value ends up retained as a field inside "children"
tables in the final result. For example:

--
print(hierarchy:match('a\n    b').children[1].children.indent)
--

....will print four spaces. If this is an undesirable side-effect, a
function capture could be used to remove this field once the
"children" table is fully captured.


-Duncan

Reply | Threaded
Open this post in threaded view
|

Re: LPeg indent/whitespace based grammar

Matthias Dörfelt-2
Awesome, thanks that is super helpful!
Haven’t looked at the “re” portion of lpeg, yet. I was initially gonna stick with pure LPeg, but let’s see!

Is it pretty straight forward to express your re code in pure lua & LPeg?

Thanks!

> On Jul 7, 2017, at 07:38, Duncan Cross <[hidden email]> wrote:
>
> On Fri, Jul 7, 2017 at 5:23 AM, Matthias Dörfelt <[hidden email]> wrote:
>> While my implementation works, I feel like I am circumventing a lot of the
>> actual capturing mechanisms of LPeg in order to deal with back-referencing
>> to build the hierarchy. I’d like to get some feedback on weather there are
>> any better ways to approach this kind of grammar using LPeg. I am new to it
>> in general so I might be missing something obvious here.
>
> Here's how I would approach it.
>
> First, some helper patterns:
>
>
> --
> local initial_indent = lpeg.Cg(lpeg.P(''), 'indent')
>
> local match_same_indent = lpeg.Cmt(
>  lpeg.Cb('indent'),
>  function(s, p, indent)
>    if s:sub(p, p + #indent - 1) ~= current_indent then
>      return false
>    end
>    return p + #indent
>  end)
>
> local capture_deeper_indent = lpeg.Cg(
>  match_same_indent * lpeg.S' \t'^1,
>  'indent')
> --
>
>
> initial_indent captures an empty string and stores it as the named
> group "indent".
>
> match_same_indent checks that the immediate next set of characters is
> identical to the current contextual value of the named group "indent".
>
> capture_deeper_indent requires that the current indent matches at this
> position, and is immediately followed by at least one more whitespace
> character. If it succeeds, the full new indent is stored as a new
> named group capture for "indent".
>
> To create the actual hierarchy document grammar, I would turn to the
> "re" module, passing in the helper patterns via the "defs" table
> parameter:
>
>
> --
> local hierarchy = re.compile([[
>
>  hierarchy <- %initial_indent {|
>    {:children: {| (<first_root> <next_root>*)? |} :}
>  |} !.
>
>  first_root <- <entry>
>  next_root <- %nl <entry>
>
>  entry <- {|
>    {:name: <name> :}
>    {:children: {| (<first_child> <next_child>*)? |} :}
>  |}
>
>  first_child <- %nl %capture_deeper_indent <entry>
>  next_child <- %nl %match_same_indent <entry>
>
>  name <- (!%nl .)+
>
> ]], {
>  initial_indent = initial_indent;
>  match_same_indent = match_same_indent;
>  capture_deeper_indent = capture_deeper_indent;
> })
> --
>
>
> To explain the elements of this grammar in English:
>
> <hierarchy>:
>  - capture the empty string as an initial base value for "indent"
>  - create a table with a "children" sub-table
>  - populate "children" with either no values, or <first_root>
> followed by zero or more <next_root>s
>  - make sure there is no more input
>
> <first_root>: same as <entry>
>
> <next_root>: newline character followed by <entry>
>
> <entry>: a table containing
>  - a <name> captured as field "name"
>  - a sub-table "children" containing either no values, or
> <first_child> followed by zero or more <next_child>ren
>
> <first_child>:
>  - newline character
>  - capture a new, deeper indent than the current contextual indent
>  - <entry>
>
> <next_child>:
>  - newline character
>  - match the current contextual indent
>  - <entry>
>
> <name>: one or more non-newline characters
>
>
> A couple of quirks to note:
>
> 1. This implementation does not enforce indentation in sets of four
> spaces, but instead any arbitrary set of spaces and tabs. Hopefully it
> should not be too difficult to modify capture_deeper_indent to be
> strict about this, if you want.
>
> 2. The "indent" value ends up retained as a field inside "children"
> tables in the final result. For example:
>
> --
> print(hierarchy:match('a\n    b').children[1].children.indent)
> --
>
> ....will print four spaces. If this is an undesirable side-effect, a
> function capture could be used to remove this field once the
> "children" table is fully captured.
>
>
> -Duncan
>
>


Reply | Threaded
Open this post in threaded view
|

Re: LPeg indent/whitespace based grammar

Duncan Cross
On Fri, Jul 7, 2017 at 9:09 PM, Matthias Dörfelt <[hidden email]> wrote:
> Awesome, thanks that is super helpful!
> Haven’t looked at the “re” portion of lpeg, yet. I was initially gonna stick with pure LPeg, but let’s see!
>
> Is it pretty straight forward to express your re code in pure lua & LPeg?
>
> Thanks!

Sure, I think this should be the equivalent:


--
local hierarchy = lpeg.P {

  'hierarchy';

  hierarchy = initial_indent
    *
    lpeg.Ct(
      lpeg.Cg(
        lpeg.Ct(
          (
            lpeg.V('first_root')
            *
            lpeg.V('next_root')^0
          )^-1
        ),
        'children'
      )
    )
    *
    lpeg.P(-1);

  first_root = lpeg.V('entry');

  next_root = '\n' * lpeg.V('entry');

  entry = lpeg.Ct(
    lpeg.Cg(lpeg.V('name'), 'name')
    *
    lpeg.Cg(
      lpeg.Ct(
        (
          lpeg.V('first_child')
          *
          lpeg.V('next_child')^0
        )^-1
      ),
      'children'
    )
  );

  first_child = '\n' * capture_deeper_indent * lpeg.V('entry');
  next_child = '\n' * match_same_indent * lpeg.V('entry');

  name = (1 - lpeg.P('\n'))^0;

}
--


-Duncan

Reply | Threaded
Open this post in threaded view
|

Re: LPeg indent/whitespace based grammar

Matthias Dörfelt-2
Thank you so much! re module looks very nice, will give it a shot I think after I got it to work with pure LPeg!

Best
Matthias

> On Jul 7, 2017, at 14:21, Duncan Cross <[hidden email]> wrote:
>
> On Fri, Jul 7, 2017 at 9:09 PM, Matthias Dörfelt <[hidden email]> wrote:
>> Awesome, thanks that is super helpful!
>> Haven’t looked at the “re” portion of lpeg, yet. I was initially gonna stick with pure LPeg, but let’s see!
>>
>> Is it pretty straight forward to express your re code in pure lua & LPeg?
>>
>> Thanks!
>
> Sure, I think this should be the equivalent:
>
>
> --
> local hierarchy = lpeg.P {
>
>  'hierarchy';
>
>  hierarchy = initial_indent
>    *
>    lpeg.Ct(
>      lpeg.Cg(
>        lpeg.Ct(
>          (
>            lpeg.V('first_root')
>            *
>            lpeg.V('next_root')^0
>          )^-1
>        ),
>        'children'
>      )
>    )
>    *
>    lpeg.P(-1);
>
>  first_root = lpeg.V('entry');
>
>  next_root = '\n' * lpeg.V('entry');
>
>  entry = lpeg.Ct(
>    lpeg.Cg(lpeg.V('name'), 'name')
>    *
>    lpeg.Cg(
>      lpeg.Ct(
>        (
>          lpeg.V('first_child')
>          *
>          lpeg.V('next_child')^0
>        )^-1
>      ),
>      'children'
>    )
>  );
>
>  first_child = '\n' * capture_deeper_indent * lpeg.V('entry');
>  next_child = '\n' * match_same_indent * lpeg.V('entry');
>
>  name = (1 - lpeg.P('\n'))^0;
>
> }
> --
>
>
> -Duncan
>
>


Reply | Threaded
Open this post in threaded view
|

Re: LPeg indent/whitespace based grammar

Matthias Dörfelt-2
In reply to this post by Duncan Cross
Hi Duncan,

I just tried your implementation and I can’t really get it to work. My main question is where current_indent inside match_same_indent comes from. I tried some different things based on guessing without success.

Thanks,
Matthias

> On Jul 7, 2017, at 07:38, Duncan Cross <[hidden email]> wrote:
>
> On Fri, Jul 7, 2017 at 5:23 AM, Matthias Dörfelt <[hidden email]> wrote:
>> While my implementation works, I feel like I am circumventing a lot of the
>> actual capturing mechanisms of LPeg in order to deal with back-referencing
>> to build the hierarchy. I’d like to get some feedback on weather there are
>> any better ways to approach this kind of grammar using LPeg. I am new to it
>> in general so I might be missing something obvious here.
>
> Here's how I would approach it.
>
> First, some helper patterns:
>
>
> --
> local initial_indent = lpeg.Cg(lpeg.P(''), 'indent')
>
> local match_same_indent = lpeg.Cmt(
>  lpeg.Cb('indent'),
>  function(s, p, indent)
>    if s:sub(p, p + #indent - 1) ~= current_indent then
>      return false
>    end
>    return p + #indent
>  end)
>
> local capture_deeper_indent = lpeg.Cg(
>  match_same_indent * lpeg.S' \t'^1,
>  'indent')
> --
>
>
> initial_indent captures an empty string and stores it as the named
> group "indent".
>
> match_same_indent checks that the immediate next set of characters is
> identical to the current contextual value of the named group "indent".
>
> capture_deeper_indent requires that the current indent matches at this
> position, and is immediately followed by at least one more whitespace
> character. If it succeeds, the full new indent is stored as a new
> named group capture for "indent".
>
> To create the actual hierarchy document grammar, I would turn to the
> "re" module, passing in the helper patterns via the "defs" table
> parameter:
>
>
> --
> local hierarchy = re.compile([[
>
>  hierarchy <- %initial_indent {|
>    {:children: {| (<first_root> <next_root>*)? |} :}
>  |} !.
>
>  first_root <- <entry>
>  next_root <- %nl <entry>
>
>  entry <- {|
>    {:name: <name> :}
>    {:children: {| (<first_child> <next_child>*)? |} :}
>  |}
>
>  first_child <- %nl %capture_deeper_indent <entry>
>  next_child <- %nl %match_same_indent <entry>
>
>  name <- (!%nl .)+
>
> ]], {
>  initial_indent = initial_indent;
>  match_same_indent = match_same_indent;
>  capture_deeper_indent = capture_deeper_indent;
> })
> --
>
>
> To explain the elements of this grammar in English:
>
> <hierarchy>:
>  - capture the empty string as an initial base value for "indent"
>  - create a table with a "children" sub-table
>  - populate "children" with either no values, or <first_root>
> followed by zero or more <next_root>s
>  - make sure there is no more input
>
> <first_root>: same as <entry>
>
> <next_root>: newline character followed by <entry>
>
> <entry>: a table containing
>  - a <name> captured as field "name"
>  - a sub-table "children" containing either no values, or
> <first_child> followed by zero or more <next_child>ren
>
> <first_child>:
>  - newline character
>  - capture a new, deeper indent than the current contextual indent
>  - <entry>
>
> <next_child>:
>  - newline character
>  - match the current contextual indent
>  - <entry>
>
> <name>: one or more non-newline characters
>
>
> A couple of quirks to note:
>
> 1. This implementation does not enforce indentation in sets of four
> spaces, but instead any arbitrary set of spaces and tabs. Hopefully it
> should not be too difficult to modify capture_deeper_indent to be
> strict about this, if you want.
>
> 2. The "indent" value ends up retained as a field inside "children"
> tables in the final result. For example:
>
> --
> print(hierarchy:match('a\n    b').children[1].children.indent)
> --
>
> ....will print four spaces. If this is an undesirable side-effect, a
> function capture could be used to remove this field once the
> "children" table is fully captured.
>
>
> -Duncan
>
>


Reply | Threaded
Open this post in threaded view
|

Re: LPeg indent/whitespace based grammar

Duncan Cross
On Sat, Jul 8, 2017 at 2:34 AM, Matthias Dörfelt <[hidden email]> wrote:
> I just tried your implementation and I can’t really get it to work. My main question is where current_indent inside match_same_indent comes from. I tried some different things based on guessing without success.

Oops -- "current_indent" should have been "indent" there. Sorry about that.

-Duncan

Reply | Threaded
Open this post in threaded view
|

Re: LPeg indent/whitespace based grammar

Matthias Dörfelt-2
Awesome, I think that did the trick! thanks! Still trying to wrap my head around a lot of whats going on but that should get me started.

Matthias

> On Jul 7, 2017, at 18:41, Duncan Cross <[hidden email]> wrote:
>
> On Sat, Jul 8, 2017 at 2:34 AM, Matthias Dörfelt <[hidden email]> wrote:
>> I just tried your implementation and I can’t really get it to work. My main question is where current_indent inside match_same_indent comes from. I tried some different things based on guessing without success.
>
> Oops -- "current_indent" should have been "indent" there. Sorry about that.
>
> -Duncan
>
>


Reply | Threaded
Open this post in threaded view
|

Re: LPeg indent/whitespace based grammar

Matthias Dörfelt-2
I actually do have a follow up question:

I am now trying to wrap my head around the re module and tried making a simple grammar to parse something like this

a = 'b'

and put it into a dictionary style table a la:

{ a = 'b'}

I tried to do so using table and named group capture, but I can’t for the life of me figure out how to dynamically set the name of the group capture. Will I have to write a custom helper function to achieve this or is there a pure re way?

Here is the code:

local attribute = re.compile([[
    attr <- {| {:<identifier>: <attributeValue> :} |}
    identifier <- (!%nl [a-zA-Z][a-zA-Z0-9_]*)+
    attributeValue <- (" "+ "=" " "+ "'" {<identifier>} "'")
    ]])

This is the line where I try to capture the first identifier as the name for the group capture: attribute <- {| {:<identifier>: <attributeValue> :} |}

Which fails with: pattern error near ': <attributeValue> :}…'

Any thoughts welcome!

Thanks
Matthias

On Jul 7, 2017, at 19:38, Matthias Dörfelt <[hidden email]> wrote:

Awesome, I think that did the trick! thanks! Still trying to wrap my head around a lot of whats going on but that should get me started.

Matthias

On Jul 7, 2017, at 18:41, Duncan Cross <[hidden email]> wrote:

On Sat, Jul 8, 2017 at 2:34 AM, Matthias Dörfelt <[hidden email]> wrote:
I just tried your implementation and I can’t really get it to work. My main question is where current_indent inside match_same_indent comes from. I tried some different things based on guessing without success.

Oops -- "current_indent" should have been "indent" there. Sorry about that.

-Duncan






Reply | Threaded
Open this post in threaded view
|

Re: LPeg indent/whitespace based grammar

Sean Conner
It was thus said that the Great Matthias Dörfelt once stated:

> I actually do have a follow up question:
>
> I am now trying to wrap my head around the re module and tried making a
> simple grammar to parse something like this
>
> a = 'b'
>
> and put it into a dictionary style table a la:
>
> { a = 'b'}
>
> I tried to do so using table and named group capture, but I can’t for the
> life of me figure out how to dynamically set the name of the group
> capture. Will I have to write a custom helper function to achieve this or
> is there a pure re way?

  I found *a* way.  I'm not sure if it's the *best* way, and there's no way
to translate it to use only the re module.  So, the code:

lpeg = require "lpeg"

local function doset(tab,name,value)
  if tab[name] == nil then
    tab[name] = value
  elseif type(tab[name]) == 'table' then
    table.insert(tab[name],value)
  else
    tab[name] = { tab[name] , value }
  end
  return tab
end

local token = lpeg.C(lpeg.R"!~"^1)
local pair  = lpeg.Cg(token * lpeg.P" "^1 * token) * lpeg.P"\n"
local list  = lpeg.Cf(lpeg.Ct"" * pair^1,doset)

test = [[
field1 value1
field2 value2
field3 value3a
field3 value3b
field3 value3e
field4 value4
]]

x = list:match(test)

lpeg.Cf() is a folding capture, which is used to accumulate a single result
from a stream of patter matches.  I first use lpeg.Ct() with an empty string
to obtain a table to use.  Then, pair will capture two tokens (sequence of
characters separated by spaces) and group them with lpeg.Cg().  The initial
table and the pair of captures are passed to the function doset() which
accumulates the results.  The rather convoluted nature of doset() is to
handle the case of the same field name being repeated (a case I had to
handle when I came up with this).  When you run example above, you'll end up
with a table like:

x =
{
  field1 = "value1",
  field3 =
  {
    [1] = "value3a",
    [2] = "value3b",
    [3] = "value3e",
  },
  field4 = "value4",
  field2 = "value2",
}

  I do wish re had a folding catpure syntax, but I can work around it.

  -spc


Reply | Threaded
Open this post in threaded view
|

Re: LPeg indent/whitespace based grammar

Duncan Cross
In reply to this post by Matthias Dörfelt-2
On Sat, Jul 8, 2017 at 8:12 PM, Matthias Dörfelt <[hidden email]> wrote:
> I tried to do so using table and named group capture, but I can’t for the
> life of me figure out how to dynamically set the name of the group capture.
> Will I have to write a custom helper function to achieve this or is there a
> pure re way?

You do need to use a function for this -- re is only a thin layer on
top of the same basic primitives as lpeg itself, and lpeg.Cg() takes a
constant value, not a pattern capture, for the group name specifier.

One approach is to capture an array of key/value pair objects, and
then use a function capture to transform this array into an object
with named fields. Something like this:

--
local re = require 're'

local fields = re.compile([[
  fields <- {| <kvpair>* !. |} -> kvpairs_to_fields

  kvpair <- {| %s* {:key: <key> :} %s* '=' %s* {:value: <value> :} %s* |}
  key <- [a-zA-Z_] [a-zA-Z_0-9]*
  value <- <single_quoted> / <double_quoted>
  single_quoted <- ['] { [^']* } [']
  double_quoted <- ["] { [^"]* } ["]
]], {
  kvpairs_to_fields = function(kvpairs)
    local attrs = {}
    for _, kv in ipairs(kvpairs) do
      attrs[kv.key] = kv.value
    end
    return attrs
  end;
})

for k,v in pairs(fields:match[[ a="one" b = 'two' c="three" ]]) do
  print(k, v)
end
--

-Duncan

Reply | Threaded
Open this post in threaded view
|

Re: LPeg indent/whitespace based grammar

Matthias Dörfelt-2
In reply to this post by Sean Conner
Thanks, that looks like one way of getting this to work!

On Jul 8, 2017, at 22:44, Sean Conner <[hidden email]> wrote:

It was thus said that the Great Matthias Dörfelt once stated:
I actually do have a follow up question:

I am now trying to wrap my head around the re module and tried making a
simple grammar to parse something like this

a = 'b'

and put it into a dictionary style table a la:

{ a = 'b'}

I tried to do so using table and named group capture, but I can’t for the
life of me figure out how to dynamically set the name of the group
capture. Will I have to write a custom helper function to achieve this or
is there a pure re way?

 I found *a* way.  I'm not sure if it's the *best* way, and there's no way
to translate it to use only the re module.  So, the code:

lpeg = require "lpeg"

local function doset(tab,name,value)
 if tab[name] == nil then
   tab[name] = value
 elseif type(tab[name]) == 'table' then
   table.insert(tab[name],value)
 else
   tab[name] = { tab[name] , value }
 end
 return tab
end

local token = lpeg.C(lpeg.R"!~"^1)
local pair  = lpeg.Cg(token * lpeg.P" "^1 * token) * lpeg.P"\n"
local list  = lpeg.Cf(lpeg.Ct"" * pair^1,doset)

test = [[
field1 value1
field2 value2
field3 value3a
field3 value3b
field3 value3e
field4 value4
]]

x = list:match(test)

lpeg.Cf() is a folding capture, which is used to accumulate a single result
from a stream of patter matches.  I first use lpeg.Ct() with an empty string
to obtain a table to use.  Then, pair will capture two tokens (sequence of
characters separated by spaces) and group them with lpeg.Cg().  The initial
table and the pair of captures are passed to the function doset() which
accumulates the results.  The rather convoluted nature of doset() is to
handle the case of the same field name being repeated (a case I had to
handle when I came up with this).  When you run example above, you'll end up
with a table like:

x =
{
 field1 = "value1",
 field3 =
 {
   [1] = "value3a",
   [2] = "value3b",
   [3] = "value3e",
 },
 field4 = "value4",
 field2 = "value2",
}

 I do wish re had a folding catpure syntax, but I can work around it.

 -spc

Reply | Threaded
Open this post in threaded view
|

Re: LPeg indent/whitespace based grammar

Matthias Dörfelt-2
In reply to this post by Duncan Cross
Awesome thanks, reading your examples really helps me grasping most of the core ideas of LPeg and re!!

> On Jul 11, 2017, at 20:51, Duncan Cross <[hidden email]> wrote:
>
> On Sat, Jul 8, 2017 at 8:12 PM, Matthias Dörfelt <[hidden email]> wrote:
>> I tried to do so using table and named group capture, but I can’t for the
>> life of me figure out how to dynamically set the name of the group capture.
>> Will I have to write a custom helper function to achieve this or is there a
>> pure re way?
>
> You do need to use a function for this -- re is only a thin layer on
> top of the same basic primitives as lpeg itself, and lpeg.Cg() takes a
> constant value, not a pattern capture, for the group name specifier.
>
> One approach is to capture an array of key/value pair objects, and
> then use a function capture to transform this array into an object
> with named fields. Something like this:
>
> --
> local re = require 're'
>
> local fields = re.compile([[
>  fields <- {| <kvpair>* !. |} -> kvpairs_to_fields
>
>  kvpair <- {| %s* {:key: <key> :} %s* '=' %s* {:value: <value> :} %s* |}
>  key <- [a-zA-Z_] [a-zA-Z_0-9]*
>  value <- <single_quoted> / <double_quoted>
>  single_quoted <- ['] { [^']* } [']
>  double_quoted <- ["] { [^"]* } ["]
> ]], {
>  kvpairs_to_fields = function(kvpairs)
>    local attrs = {}
>    for _, kv in ipairs(kvpairs) do
>      attrs[kv.key] = kv.value
>    end
>    return attrs
>  end;
> })
>
> for k,v in pairs(fields:match[[ a="one" b = 'two' c="three" ]]) do
>  print(k, v)
> end
> --
>
> -Duncan
>
>