tables as parsers

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

tables as parsers

Soni "They/Them" L.
I'm rolling with this weird idea. is it possible to make a simple parser
out of tables? I tried some stuff but ran into a few issues. I don't
wanna use LPeg or anything more complicated than string.find(foo, bar,
1, true).

e.g.:

local ltk = {} -- lua_tokens
ltk.string = {}
ltk['"'] = "string"
ltk["'"] = "string" -- how to handle end of string correctly?
ltk.string['\\'] = "escape"
ltk.string.escape = {}
ltk.string.escape['z'] = {}
ltk.string.escape['z'].next = ltk.string.escape['z']
for i, v in ipairs({" ", "\t", "\n", etc}) do
   ltk.string.escape['z'][v] = "next"
end
ltk.string.escape['z'][''] = ltk.string -- not sure if there'd be a
better way to do this
-- etc

Reply | Threaded
Open this post in threaded view
|

Re: tables as parsers

Sean Conner
It was thus said that the Great Soni They/Them L. once stated:
> I'm rolling with this weird idea. is it possible to make a simple parser
> out of tables?

  Yes.  There's an entire book dedicated to this topic---_Compilers:
Principles, Techniques, and Tools_, [1] also known as The Dragon Book
(because of the cover).  It doesn't use Lua tables per se, but the lexing
portion is driven completely by a 2D array that represents a state machine.

  The other book I'm familiar with is _Compiler Design in C_ [2].  It too,
uses a table to drive the lexing.

> I tried some stuff but ran into a few issues. I don't
> wanna use LPeg or anything more complicated than string.find(foo, bar,
> 1, true).

  You might be better off learning LPEG.

  -spc (But that's just my opinion)

[1] https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools

[2] https://holub.com/compiler/

Reply | Threaded
Open this post in threaded view
|

Re: tables as parsers

Soni "They/Them" L.


On 2019-03-27 9:03 p.m., Sean Conner wrote:

> It was thus said that the Great Soni They/Them L. once stated:
>> I'm rolling with this weird idea. is it possible to make a simple parser
>> out of tables?
>    Yes.  There's an entire book dedicated to this topic---_Compilers:
> Principles, Techniques, and Tools_, [1] also known as The Dragon Book
> (because of the cover).  It doesn't use Lua tables per se, but the lexing
> portion is driven completely by a 2D array that represents a state machine.
>
>    The other book I'm familiar with is _Compiler Design in C_ [2].  It too,
> uses a table to drive the lexing.
>
>> I tried some stuff but ran into a few issues. I don't
>> wanna use LPeg or anything more complicated than string.find(foo, bar,
>> 1, true).
>    You might be better off learning LPEG.

It's not that I don't know LPEG (I could make an LPEG re for it in uh,
maybe an hour at most. no need to write something to parse the table
structures and stuff).

I just can't use it.

(besides, this is a lot cuter than an LPEG re IMO)

>
>    -spc (But that's just my opinion)
>
> [1] https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools
>
> [2] https://holub.com/compiler/
>


Reply | Threaded
Open this post in threaded view
|

Re: tables as parsers

Coda Highland
In reply to this post by Soni "They/Them" L.
On Wed, Mar 27, 2019 at 6:47 PM Soni "They/Them" L. <[hidden email]> wrote:
I'm rolling with this weird idea. is it possible to make a simple parser
out of tables? I tried some stuff but ran into a few issues. I don't
wanna use LPeg or anything more complicated than string.find(foo, bar,
1, true).

e.g.:

local ltk = {} -- lua_tokens
ltk.string = {}
ltk['"'] = "string"
ltk["'"] = "string" -- how to handle end of string correctly?
ltk.string['\\'] = "escape"
ltk.string.escape = {}
ltk.string.escape['z'] = {}
ltk.string.escape['z'].next = ltk.string.escape['z']
for i, v in ipairs({" ", "\t", "\n", etc}) do
   ltk.string.escape['z'][v] = "next"
end
ltk.string.escape['z'][''] = ltk.string -- not sure if there'd be a
better way to do this
-- etc

Not a bad start for someone who's never done this formally before.

There are a zillion different ways to define a parser. The Dragon Book recommendation is well-given.

I think the insight you're missing is that the table really represents the structure of a state machine. Your top-level table shouldn't contain things like ["'"] = "string" directly; rather, you'd want something like ltk.base["'"]. Then you wouldn't have ltk.string.escape, you'd just have ltk.string and ltk.escape. The nesting doesn't belong in the table itself; instead, your parser will maintain a stack of states, and you push a state when you start into a parsing rule and you pop it off when you've finished that rule.

The place where this is suddenly going to get complicated is if your grammar contains any prefix ambiguities (e.g. "fork" vs "forward"), because then you have to implement backtracking. For most simpler grammars it's best to just break it into two passes -- one that makes simple lexical tokens out of a string of characters, and one that evaluates the syntax out of a list of lexical tokens.

I don't mind offering assistance with this. I've actually built a complete standards-compliant parser for C++ from first principles before. I think it's pretty fun.

/s/ Adam
Reply | Threaded
Open this post in threaded view
|

Re: tables as parsers

Soni "They/Them" L.


On 2019-03-27 10:10 p.m., Coda Highland wrote:

> On Wed, Mar 27, 2019 at 6:47 PM Soni "They/Them" L. <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     I'm rolling with this weird idea. is it possible to make a simple
>     parser
>     out of tables? I tried some stuff but ran into a few issues. I don't
>     wanna use LPeg or anything more complicated than string.find(foo,
>     bar,
>     1, true).
>
>     e.g.:
>
>     local ltk = {} -- lua_tokens
>     ltk.string = {}
>     ltk['"'] = "string"
>     ltk["'"] = "string" -- how to handle end of string correctly?
>     ltk.string['\\'] = "escape"
>     ltk.string.escape = {}
>     ltk.string.escape['z'] = {}
>     ltk.string.escape['z'].next = ltk.string.escape['z']
>     for i, v in ipairs({" ", "\t", "\n", etc}) do
>        ltk.string.escape['z'][v] = "next"
>     end
>     ltk.string.escape['z'][''] = ltk.string -- not sure if there'd be a
>     better way to do this
>     -- etc
>
>
> Not a bad start for someone who's never done this formally before.
>
> There are a zillion different ways to define a parser. The Dragon Book
> recommendation is well-given.
>
> I think the insight you're missing is that the table really represents
> the structure of a state machine. Your top-level table shouldn't
> contain things like ["'"] = "string" directly; rather, you'd want
> something like ltk.base["'"]. Then you wouldn't have
> ltk.string.escape, you'd just have ltk.string and ltk.escape. The
> nesting doesn't belong in the table itself; instead, your parser will
> maintain a stack of states, and you push a state when you start into a
> parsing rule and you pop it off when you've finished that rule.

Yeah but then I wouldn't have self-referential tables. Besides I wasn't
planning on having a stack. FSMs (same class as regex) don't have a stack.

>
> The place where this is suddenly going to get complicated is if your
> grammar contains any prefix ambiguities (e.g. "fork" vs "forward"),
> because then you have to implement backtracking. For most simpler
> grammars it's best to just break it into two passes -- one that makes
> simple lexical tokens out of a string of characters, and one that
> evaluates the syntax out of a list of lexical tokens.

"fork" vs "forward" is unambiguous and doesn't require backtracking.

>
> I don't mind offering assistance with this. I've actually built a
> complete standards-compliant parser for C++ from first principles
> before. I think it's pretty fun.
>
> /s/ Adam


Reply | Threaded
Open this post in threaded view
|

Re: tables as parsers

dyngeccetor8
In reply to this post by Soni "They/Them" L.
On 28/03/2019 02.39, Soni "They/Them" L. wrote:

> I'm rolling with this weird idea. is it possible to make a simple parser out of
> tables? I tried some stuff but ran into a few issues. I don't wanna use LPeg or
> anything more complicated than string.find(foo, bar, 1, true).
>
> e.g.:
>
> local ltk = {} -- lua_tokens
> ltk.string = {}
> ltk['"'] = "string"
> ltk["'"] = "string" -- how to handle end of string correctly?
> ltk.string['\\'] = "escape"
> ltk.string.escape = {}
> ltk.string.escape['z'] = {}
> ltk.string.escape['z'].next = ltk.string.escape['z']
> for i, v in ipairs({" ", "\t", "\n", etc}) do
>   ltk.string.escape['z'][v] = "next"
> end
> ltk.string.escape['z'][''] = ltk.string -- not sure if there'd be a better way
> to do this
> -- etc

Agree with Coda Highland, nice try.

Here is how I use Lua tables to parse Lua strings: [1]. It is not
finite-state-machine approach, it uses tables as lists. But they
were enough to describe Lua 5.3 syntax and create parser with fixed
number of base functions.

FSMs is still good way to parse less complicated things like
quoted CSV strings or Lua numbers. You may try your approach on
them first.

[1]:
https://github.com/martin-eden/workshop/blob/master/formats/lua/syntax/type_string.lua

-- Martin

Reply | Threaded
Open this post in threaded view
|

Re: tables as parsers

Coda Highland
In reply to this post by Soni "They/Them" L.


On Wed, Mar 27, 2019 at 9:06 PM Soni "They/Them" L. <[hidden email]> wrote:


On 2019-03-27 10:10 p.m., Coda Highland wrote:
> On Wed, Mar 27, 2019 at 6:47 PM Soni "They/Them" L. <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     I'm rolling with this weird idea. is it possible to make a simple
>     parser
>     out of tables? I tried some stuff but ran into a few issues. I don't
>     wanna use LPeg or anything more complicated than string.find(foo,
>     bar,
>     1, true).
>
>     e.g.:
>
>     local ltk = {} -- lua_tokens
>     ltk.string = {}
>     ltk['"'] = "string"
>     ltk["'"] = "string" -- how to handle end of string correctly?
>     ltk.string['\\'] = "escape"
>     ltk.string.escape = {}
>     ltk.string.escape['z'] = {}
>     ltk.string.escape['z'].next = ltk.string.escape['z']
>     for i, v in ipairs({" ", "\t", "\n", etc}) do
>        ltk.string.escape['z'][v] = "next"
>     end
>     ltk.string.escape['z'][''] = ltk.string -- not sure if there'd be a
>     better way to do this
>     -- etc
>
>
> Not a bad start for someone who's never done this formally before.
>
> There are a zillion different ways to define a parser. The Dragon Book
> recommendation is well-given.
>
> I think the insight you're missing is that the table really represents
> the structure of a state machine. Your top-level table shouldn't
> contain things like ["'"] = "string" directly; rather, you'd want
> something like ltk.base["'"]. Then you wouldn't have
> ltk.string.escape, you'd just have ltk.string and ltk.escape. The
> nesting doesn't belong in the table itself; instead, your parser will
> maintain a stack of states, and you push a state when you start into a
> parsing rule and you pop it off when you've finished that rule.

Yeah but then I wouldn't have self-referential tables. Besides I wasn't
planning on having a stack. FSMs (same class as regex) don't have a stack.

FSMs can't handle parentheses. String escapes are about the limit of what they can manage.
 
>
> The place where this is suddenly going to get complicated is if your
> grammar contains any prefix ambiguities (e.g. "fork" vs "forward"),
> because then you have to implement backtracking. For most simpler
> grammars it's best to just break it into two passes -- one that makes
> simple lexical tokens out of a string of characters, and one that
> evaluates the syntax out of a list of lexical tokens.

"fork" vs "forward" is unambiguous and doesn't require backtracking.

Derp. That's my mistake. You're right, that one's good.

/s/ Adam 
Reply | Threaded
Open this post in threaded view
|

Re: tables as parsers

Viacheslav Usov
On Thu, Mar 28, 2019 at 2:30 PM Coda Highland <[hidden email]> wrote:

> FSMs can't handle parentheses.

This is not true unless you qualify this further. (Hint: any general-purpose computer is an FSM)

Cheers,
V.

Reply | Threaded
Open this post in threaded view
|

Re: tables as parsers

Coda Highland


On Thu, Mar 28, 2019 at 10:13 AM Viacheslav Usov <[hidden email]> wrote:
On Thu, Mar 28, 2019 at 2:30 PM Coda Highland <[hidden email]> wrote:

> FSMs can't handle parentheses.

This is not true unless you qualify this further. (Hint: any general-purpose computer is an FSM)

Cheers,
V.

No, general-purpose computers are Finite Turing Machines, not Finite State Machines. These two formalisms are TWO tiers of computing power apart, with Push-Down Automata existing in between.

/s/ Adam
Reply | Threaded
Open this post in threaded view
|

Re: tables as parsers

Soni "They/Them" L.
In reply to this post by Coda Highland


On 2019-03-28 10:21 a.m., Coda Highland wrote:

>
>
> On Wed, Mar 27, 2019 at 9:06 PM Soni "They/Them" L. <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>
>
>     On 2019-03-27 10:10 p.m., Coda Highland wrote:
>     > On Wed, Mar 27, 2019 at 6:47 PM Soni "They/Them" L.
>     <[hidden email] <mailto:[hidden email]>
>     > <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>     >
>     >     I'm rolling with this weird idea. is it possible to make a
>     simple
>     >     parser
>     >     out of tables? I tried some stuff but ran into a few issues.
>     I don't
>     >     wanna use LPeg or anything more complicated than
>     string.find(foo,
>     >     bar,
>     >     1, true).
>     >
>     >     e.g.:
>     >
>     >     local ltk = {} -- lua_tokens
>     >     ltk.string = {}
>     >     ltk['"'] = "string"
>     >     ltk["'"] = "string" -- how to handle end of string correctly?
>     >     ltk.string['\\'] = "escape"
>     >     ltk.string.escape = {}
>     >     ltk.string.escape['z'] = {}
>     >     ltk.string.escape['z'].next = ltk.string.escape['z']
>     >     for i, v in ipairs({" ", "\t", "\n", etc}) do
>     >        ltk.string.escape['z'][v] = "next"
>     >     end
>     >     ltk.string.escape['z'][''] = ltk.string -- not sure if
>     there'd be a
>     >     better way to do this
>     >     -- etc
>     >
>     >
>     > Not a bad start for someone who's never done this formally before.
>     >
>     > There are a zillion different ways to define a parser. The
>     Dragon Book
>     > recommendation is well-given.
>     >
>     > I think the insight you're missing is that the table really
>     represents
>     > the structure of a state machine. Your top-level table shouldn't
>     > contain things like ["'"] = "string" directly; rather, you'd want
>     > something like ltk.base["'"]. Then you wouldn't have
>     > ltk.string.escape, you'd just have ltk.string and ltk.escape. The
>     > nesting doesn't belong in the table itself; instead, your parser
>     will
>     > maintain a stack of states, and you push a state when you start
>     into a
>     > parsing rule and you pop it off when you've finished that rule.
>
>     Yeah but then I wouldn't have self-referential tables. Besides I
>     wasn't
>     planning on having a stack. FSMs (same class as regex) don't have
>     a stack.
>
>
> FSMs can't handle parentheses. String escapes are about the limit of
> what they can manage.

FSMs can't handle an arbitrary number of parentheses. But you can very
much create an FSM with a nesting depth limit. It's just gonna be... a
*little* big...

>     >
>     > The place where this is suddenly going to get complicated is if
>     your
>     > grammar contains any prefix ambiguities (e.g. "fork" vs "forward"),
>     > because then you have to implement backtracking. For most simpler
>     > grammars it's best to just break it into two passes -- one that
>     makes
>     > simple lexical tokens out of a string of characters, and one that
>     > evaluates the syntax out of a list of lexical tokens.
>
>     "fork" vs "forward" is unambiguous and doesn't require backtracking.
>
>
> Derp. That's my mistake. You're right, that one's good.
>
> /s/ Adam


Reply | Threaded
Open this post in threaded view
|

Re: tables as parsers

Roberto Ierusalimschy
In reply to this post by Viacheslav Usov
> This is not true unless you qualify this further. (Hint: any
> general-purpose computer is an FSM)

256^(2^33) might be finite in theory, but not in practice.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: tables as parsers

Viacheslav Usov
In reply to this post by Coda Highland
On Thu, Mar 28, 2019 at 8:32 PM Coda Highland <[hidden email]> wrote:

> No, general-purpose computers are Finite Turing Machines, not Finite State Machines.

I am afraid a "Finite Turing Machine" is not a commonly used term, so I am not sure what you are saying here. Regardless, any general purpose computer is trivially an FSM, given their finite alphabets and states.

> These two formalisms are TWO tiers of computing power apart, with Push-Down Automata existing in between.

The latter have infinite stacks, a luxury that current general-purpose computers cannot afford.

Coming back to the subject, it is possible to parse just about any grammar with a FSM, given that the input string shall not be longer than a fixed N characters. But it may be (far) more practical to use a different formalism in many cases.

Cheers,
V.
Reply | Threaded
Open this post in threaded view
|

Re: tables as parsers

Coda Highland
On Fri, Mar 29, 2019 at 6:54 AM Viacheslav Usov <[hidden email]> wrote:
On Thu, Mar 28, 2019 at 8:32 PM Coda Highland <[hidden email]> wrote:

> No, general-purpose computers are Finite Turing Machines, not Finite State Machines.

I am afraid a "Finite Turing Machine" is not a commonly used term, so I am not sure what you are saying here. Regardless, any general purpose computer is trivially an FSM, given their finite alphabets and states.

Er, sorry, I meant deterministic Turing machine, not finite. However, a general purpose computer is not a FSM, because a FSM only has one symbol worth of state (the currently selected state). A PDA adds a last-in-first-out stack. A TM replaces the stack with a seekable tape. A general-purpose computer has random-access storage.
 
> These two formalisms are TWO tiers of computing power apart, with Push-Down Automata existing in between.

The latter have infinite stacks, a luxury that current general-purpose computers cannot afford.

General-purpose computers can simulate infinite stacks/tapes using arbitrary attachable storage. If it runs out, it prompts the user to provide more storage. This means that any algorithm that can be evaluated by a TM can be evaluated by a modern general-purpose computer, with no fixed limits.
 
Coming back to the subject, it is possible to parse just about any grammar with a FSM, given that the input string shall not be longer than a fixed N characters. But it may be (far) more practical to use a different formalism in many cases.

This is trivially true, as you can always construct an FSM with O(k^n) states that hard-codes every possible valid input of length n or less. However, that disgusting complexity means that defining the FSM requires exponential space, which isn't STRICTLY in contradiction with the intent of working within bounded space, but it does a valiant job of TRYING to violate it.

/s/ Adam