Pattern matching among multiple chunks

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Pattern matching among multiple chunks

pocomane
I am writing a parser for a very simple configuration language. I know
that LPEG is designed for this purpose, but since it is a very simple
language, I would like to use the lua patterns only (please, do not
judge me :) ).

I have a working parser that acts on a single string, but now I would
like to extend it to get the input splitted among several chunks,
without waiting for the whole data unless it is necessary.

We consider, for example, the pattern
^{%a+}
to match against the following input:
{foobar}
We suppose it is splitted in two chunks right in the middle.

On the first chunk the match fails. But it may match if I wait for
another chunk. How I can check for this? [1]

Obviously I can use another pattern. In the example, on a fail I can
just check ^[^{] to know if I need another chunk. But in this way,
each pattern in the application should be treated separately.

Is there a generic way to solve this issue? For example, if the lua
API exposed the point where the matching stops on failure, I could
easly know if I need to wait for another chunk. But that information
is not avaiable (I think)...

pocomane.
Ciao.

[1] And, just to be clear, I want to immediately stop the parsing in
other cases, e.g. when matchin against:
fo}obar

Reply | Threaded
Open this post in threaded view
|

Re: Pattern matching among multiple chunks

Dirk Laurie-2
Have you studied how lua.c solves precisely this problem?

Op Wo. 16 Jan. 2019 om 10:16 het pocomane <[hidden email]> geskryf:
I am writing a parser for a very simple configuration language. I know
that LPEG is designed for this purpose, but since it is a very simple
language, I would like to use the lua patterns only (please, do not
judge me :) ).

I have a working parser that acts on a single string, but now I would
like to extend it to get the input splitted among several chunks,
without waiting for the whole data unless it is necessary.

We consider, for example, the pattern
^{%a+}
to match against the following input:
{foobar}
We suppose it is splitted in two chunks right in the middle.

On the first chunk the match fails. But it may match if I wait for
another chunk. How I can check for this? [1]

Obviously I can use another pattern. In the example, on a fail I can
just check ^[^{] to know if I need another chunk. But in this way,
each pattern in the application should be treated separately.

Is there a generic way to solve this issue? For example, if the lua
API exposed the point where the matching stops on failure, I could
easly know if I need to wait for another chunk. But that information
is not avaiable (I think)...

pocomane.
Ciao.

[1] And, just to be clear, I want to immediately stop the parsing in
other cases, e.g. when matchin against:
fo}obar

Reply | Threaded
Open this post in threaded view
|

Re: Pattern matching among multiple chunks

Sean Conner
In reply to this post by pocomane
It was thus said that the Great pocomane once stated:
> I am writing a parser for a very simple configuration language.

  I've used Lua itself as the configuration language (for personal and work
related projects).  It's pretty easy to do:

        configuration = {}
        local f = loadfile("myconfigfile.txt","t",configuration)
        if f then
          f()
        else
          error()
        end

        if configuration.blah ...

  And if you need access to some data, you can always make it available in
the variable 'configuration':

        configuration = { HOME = os.getenv "HOME" }

then your configuration file can reference HOME.  (spoilers) If the input
comes in piecemeal, check out load(), which can deal with that situation.
Just a thought.

> I know
> that LPEG is designed for this purpose, but since it is a very simple
> language, I would like to use the lua patterns only (please, do not
> judge me :) ).

  No judgement here, but be aware that LPeg has the same issues you are
encountering.

> I have a working parser that acts on a single string, but now I would
> like to extend it to get the input splitted among several chunks,
> without waiting for the whole data unless it is necessary.
>
> We consider, for example, the pattern
> ^{%a+}
> to match against the following input:
> {foobar}
> We suppose it is splitted in two chunks right in the middle.
>
> On the first chunk the match fails. But it may match if I wait for
> another chunk. How I can check for this? [1]
>
> [1] And, just to be clear, I want to immediately stop the parsing in
> other cases, e.g. when matchin against:
> fo}obar

  So, you are expecting

        {foobar}

in the input.  The input

        {fo)obar}

is expected to be rejected.  But when starting, all you have is

        {fo

  Tough problem.  

> Obviously I can use another pattern. In the example, on a fail I can
> just check ^[^{] to know if I need another chunk. But in this way,
> each pattern in the application should be treated separately.
>
> Is there a generic way to solve this issue? For example, if the lua
> API exposed the point where the matching stops on failure, I could
> easly know if I need to wait for another chunk. But that information
> is not avaiable (I think)...

  Reading up on Lua patterns, I don't think so.  And I don't think there's a
generic solution.  I know *of* a solution for LPeg but it requires knowing
up front that you might not have all the data at one go.  Something like:

        local lpeg = require "lpeg"
        local P , R , Cc , Cp = lpeg.P , lpeg.R , lpeg.Cc , lpeg.Cp

        local rest    = R"az"^1 * P"}"  * Cc'full-match' * Cp()
                      + R"az"^1 * P(-1) * Cc'more') * Cp()
                      + Cc'bad-match' * Cp()
        local pattern = P"{" * P(-1) * Cc('more') * Cp()
                      + P"{" * rest

        -- '*' is pattern AND pattern
        -- '+' is pattern OR pattern
        -- P"string" will match the given string literal;
        -- P(-1) will match end of string
        -- R"az"^1 matches the range "a" through "z" one or more times
        -- Cc(value) will return (as a capture) the given value
        -- Cp() will return the current position.

  So, if you call pattern:match(string), the following strings will return:

        {foobar} -> 'full-match' 9
        {fo)bar} -> 'bad-match' 4
        {fo -> 'more' 4

if you get back 'more', then you need to follow up with

        rest:match(string,position)

which will tell you if you have a full-match, bad-match or it needs more
input.

  I'm not saying you have to use LPeg, but it is the tool I reach for when
parsing, because it's more flexible (but more complex) in dealing with
input.

  -spc

Reply | Threaded
Open this post in threaded view
|

Re: Pattern matching among multiple chunks

Sean Conner
In reply to this post by pocomane
It was thus said that the Great pocomane once stated:
> I have a working parser that acts on a single string, but now I would
> like to extend it to get the input splitted among several chunks,
> without waiting for the whole data unless it is necessary.

  This is currently more interesting than anything going on at work, so let
me ask:  why can't you have the entire config file in memory?  Or at least a
line at a time?  This seems like a weird, if arbitrary, limitation.

  -spc

Reply | Threaded
Open this post in threaded view
|

Re: Pattern matching among multiple chunks

dyngeccetor8
In reply to this post by pocomane
On 1/16/19 11:16 AM, pocomane wrote:

> I have a working parser that acts on a single string, but now I would
> like to extend it to get the input splitted among several chunks,
> without waiting for the whole data unless it is necessary.
>
> We consider, for example, the pattern
> ^{%a+}
> to match against the following input:
> {foobar}
> We suppose it is splitted in two chunks right in the middle.
>
> On the first chunk the match fails. But it may match if I wait for
> another chunk. How I can check for this? [1]

I think you need to implement finite state automata if
you need to know parsing state at any character of input.

Alternatively, you may limit maximum size of "valid" caption.
Say, to 8K. And then implement sliding window buffer of that
size with stock pattern matching.

-- Martin

Reply | Threaded
Open this post in threaded view
|

Re: Pattern matching among multiple chunks

Sam Pagenkopf
Random ideas:
 - Use a heuristic to figure out if you match hit a buffer boundary, if possible. A good example would be checking for unmatched braces.
 - Chunk the file by a delimiter, if you can, that guarantees it will be parse-able. The "lines" method might solve your problem. This could impose format limits.
 - Is this for parsing something being sent over a network? Try parsing it before sending it over.
 - Make your own finite-state machine that can work on chunks of input.
 - Make a library that generates the above automatically using lua-style regex.

On Wed, Jan 16, 2019, 2:52 PM dyngeccetor8 <[hidden email] wrote:
On 1/16/19 11:16 AM, pocomane wrote:
> I have a working parser that acts on a single string, but now I would
> like to extend it to get the input splitted among several chunks,
> without waiting for the whole data unless it is necessary.
>
> We consider, for example, the pattern
> ^{%a+}
> to match against the following input:
> {foobar}
> We suppose it is splitted in two chunks right in the middle.
>
> On the first chunk the match fails. But it may match if I wait for
> another chunk. How I can check for this? [1]

I think you need to implement finite state automata if
you need to know parsing state at any character of input.

Alternatively, you may limit maximum size of "valid" caption.
Say, to 8K. And then implement sliding window buffer of that
size with stock pattern matching.

-- Martin

Reply | Threaded
Open this post in threaded view
|

Re: Pattern matching among multiple chunks

pocomane
On Wed, Jan 16, 2019 at 11:24 AM Dirk Laurie <[hidden email]> wrote:
> Have you studied how lua.c solves precisely this problem?

Can you be more specific? It does not seem to me that lua uses pattern
matching (nor on multiple chunks). If you are talking about the lua
lexer... well, I do not want to implement a proper lexer. Here I was
just searching for a way to detect if a failing pattern have some
chance to match when data is appended to the input

On Wed, Jan 16, 2019 at 8:02 PM Sean Conner <[hidden email]> wrote:
>   I've used Lua itself as the configuration language (for personal and work
> related projects).  It's pretty easy to do:

Me to... but now I need something more "Ini" like...

On Wed, Jan 16, 2019 at 9:16 PM Sean Conner <[hidden email]> wrote:
>   This is currently more interesting than anything going on at work, so let
> me ask:  why can't you have the entire config file in memory?  Or at least a
> line at a time?  This seems like a weird, if arbitrary, limitation.

Yes, I agree with you that it is quite arbitrary for the described
scenario. However, I am investigating on the possibility to use the
same format for data insertion/transfer/storage. So keep all in memory
could not be feasible. Splitting the chunks on specific token (e,g,
new line) could work, but I have some patterns that spawn across
multiple lines, so at least that ones must be handled in a special
way.

On Thu, Jan 17, 2019 at 4:59 AM Sam Pagenkopf <[hidden email]> wrote:
>  - Make a library that generates the above automatically using lua-style regex.

If I have to write a C library, probably the quickest solution is just
to modify the lua pattern matching library in order to return a the
end position in case of failure. I think it should be enough.

Reply | Threaded
Open this post in threaded view
|

Re: Pattern matching among multiple chunks

Hugo Musso Gualandi
> Here I was
> just searching for a way to detect if a failing pattern have some
> chance to match when data is appended to the input

The default implementation of Lua's pattern matching library just does
a simple search to find the pattern. However, it doesn't try to reason
about the grammar that you are trying to match or record what paths it
could have tried to take had there been more input available, so I
don't think you can quite do what you are asking.

One option that may or may not work is to be optimistic on failure. If
a match fails you save the remainder and try again when more input is
available.

The more ambitious alternative would be to create a fork of the pattern
matching library that implements this feature but I think at this pointyou may be better off using an existing lexer generator.

> If I have to write a C library, probably the quickest solution is
> just
> to modify the lua pattern matching library in order to return a the
> end position in case of failure. I think it should be enough.

One thing that might work, given that Lua patterns don't support
complex alternation, is to split your pattern into a serries or smaller
patterns and try one after the other. You can use string.find to know
when each match ends and where the next should begin.

For example, if your original pattern is "%S+ %d+" you can first try to
match "%S+" then, if it succeeds, try matching "%d+" from where the
previous match stopped.

To keep the code sequential and easy to read you can use coroutines to
implement that "yield if there is still unread input" approach that we
mentioned earlier.


Reply | Threaded
Open this post in threaded view
|

Re: Pattern matching among multiple chunks

Sean Conner
In reply to this post by pocomane
It was thus said that the Great pocomane once stated:

> On Wed, Jan 16, 2019 at 11:24 AM Dirk Laurie <[hidden email]> wrote:
> > Have you studied how lua.c solves precisely this problem?
>
> Can you be more specific? It does not seem to me that lua uses pattern
> matching (nor on multiple chunks). If you are talking about the lua
> lexer... well, I do not want to implement a proper lexer. Here I was
> just searching for a way to detect if a failing pattern have some
> chance to match when data is appended to the input
>
> On Wed, Jan 16, 2019 at 8:02 PM Sean Conner <[hidden email]> wrote:
> >   I've used Lua itself as the configuration language (for personal and work
> > related projects).  It's pretty easy to do:
>
> Me to... but now I need something more "Ini" like...

  Well, if it's ini-like, I have a parser for that:

https://github.com/spc476/LPeg-Parsers/blob/master/ini.lua

  But it requires the data to be in memory.

> On Wed, Jan 16, 2019 at 9:16 PM Sean Conner <[hidden email]> wrote:
> >   This is currently more interesting than anything going on at work, so let
> > me ask:  why can't you have the entire config file in memory?  Or at least a
> > line at a time?  This seems like a weird, if arbitrary, limitation.
>
> Yes, I agree with you that it is quite arbitrary for the described
> scenario. However, I am investigating on the possibility to use the
> same format for data insertion/transfer/storage. So keep all in memory
> could not be feasible. Splitting the chunks on specific token (e,g,
> new line) could work, but I have some patterns that spawn across
> multiple lines, so at least that ones must be handled in a special
> way.

  I also have a JSON parser that can handle streaming:

https://github.com/spc476/LPeg-Parsers/blob/master/jsons.lua

        json = require "org.conman.parsers.jsons

        blob = json:match(input)

If input is a string, it decodes it.  If input is a function, it will
continue to call it until input returns nil (to indicate no more input).  I
can understand not wanting to use JSON, but it might be good to look at how
I handle it.

  -spc