lpeg.U ?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

lpeg.U ?

Albert Chan
pat = re.compile "{g <- .g / &'and' }"   -- lua pattern "(.*)and"
= lpeg.pcode( pat )                             -- using debug version of lpeg

i noticed its pcode has a "behind 3" instruction to not consume the last 'and'

there is a lpeg.B function to do look-behind, but how to go back to it if B matched ?

Is there a lpeg.U(n) (for undo n characters) or something similar ?

As an example of its usefulness, say # is lpeg re for undo 1 character

REDO above re pattern, but without backtrack stack overflow problem:
NOTE: I want to capture ALL except LAST 'and'

pat = re.compile "{ (g <- 'and' / . [^a]* g)+ ### }"

Without UNDO, I have to do this (likely much slower):

pat = re.compile( "(g <- 'and' / . [^a]* g)+ -> drop3", {drop3 = function(s) return s:sub(1,-4) end} )



Reply | Threaded
Open this post in threaded view
|

Re: lpeg.U ?

Sean Conner

  It is at this point I'm going to ask what you are trying to accomplish
here, because it's coming across as an XY problem [1].  I also saw this you
posted on the forum linked to earlier [2]:

        All of my code is for learning lpeg re, since many claimed that lpeg
        can do everything lua pattern can do.

  I do believe that LPeg can do everything that patterns can do, but nothing
in that statement says anything about speed or ease.

  With that said ...

It was thus said that the Great albertmcchan once stated:
> pat = re.compile "{g <- .g / &'and' }"   -- lua pattern "(.*)and"
> = lpeg.pcode( pat )                             -- using debug version of lpeg
>
> i noticed its pcode has a "behind 3" instruction to not consume the last 'and'

  I didn't find this lpeg.pcode() function.  There is an lpeg.ptree()
function, and the re expression above generates:
       
        [1 = g  ]
        capture kind: 'simple'  key: 0
          grammar 1
            rule n: 0  key: 1
              choice
                seq
                  any
                  call key: 1  (rule: 0)
                and
                  seq
                    char 'a'
                    seq
                      char 'n'
                      char 'd'

  I'm not sure what you mean by "behind 3" instruction.

> there is a lpeg.B function to do look-behind, but how to go back to it if B matched ?
>
> Is there a lpeg.U(n) (for undo n characters) or something similar ?

  I think that lpeg.Cmt() can do something like that, but as I wrote in a
previous message [3] to Jonathan Goble:

>   I think you are thinking about LPeg with the wrong mindset---yes, you can
> look for patters in text with LPeg [1] but it's for *parsing*---pulling out
> semantic information from text, rather than just patterns.  I've written a
> lot of LPeg code [2], and not once have I needed a greedy, non-possessive
> repetition to parse text.

  Now, back to your message:
 

> As an example of its usefulness, say # is lpeg re for undo 1 character
>
> REDO above re pattern, but without backtrack stack overflow problem:
> NOTE: I want to capture ALL except LAST 'and'
>
> pat = re.compile "{ (g <- 'and' / . [^a]* g)+ ### }"
>
> Without UNDO, I have to do this (likely much slower):
>
> pat = re.compile( "(g <- 'and' / . [^a]* g)+ -> drop3", {drop3 = function(s) return s:sub(1,-4) end} )

  If you are looking for a final "and" (which ends the input), then this
works:

        last_and = P"and" * P(-1)
        char     = R("\0\96","b\255")^1
                 + -last_and * P"a"
        pat      = C((char)^0) * last_and

        print(pat:match(string.rep("this and that land",400) .. "and"))

  The wierd production of 'char' is to burn through large sequences of
charaters that don't contain the letter 'a'.  It's probably faster than this
version:

        last_and = P"and" * P(-1)
        pat      = C((P(1) - last_and)^0) * last_and

but I did not bother to benchmark it.  Personally, I would probably use the
above version since it's easier to understand.  If it became an issue, then
I might go with the version with 'char' and if that was still slow, then I
would take stock with what I'm really trying to accomplish and adjust
accordingly.

  -spc (So, what is it you are really trying to do?)

[1] http://xyproblem.info/

[2] http://www.gammon.com.au/forum/?id=14149&reply=31#reply31

[3] http://lua-users.org/lists/lua-l/2017-10/msg00143.html

Reply | Threaded
Open this post in threaded view
|

Re: lpeg.U ?

Albert Chan


On Jan 25, 2018, at 8:03 PM, Sean Conner <[hidden email]> wrote:

>
>  It is at this point I'm going to ask what you are trying to accomplish
> here, because it's coming across as an XY problem [1].  I also saw this you
> posted on the forum linked to earlier [2]:
>
>    All of my code is for learning lpeg re, since many claimed that lpeg
>    can do everything lua pattern can do.
>
>  I do believe that LPeg can do everything that patterns can do, but nothing
> in that statement says anything about speed or ease.
>
>  With that said ...
>
> It was thus said that the Great albertmcchan once stated:
>> pat = re.compile "{g <- .g / &'and' }"   -- lua pattern "(.*)and"
>> = lpeg.pcode( pat )                             -- using debug version of lpeg
>>
>> i noticed its pcode has a "behind 3" instruction to not consume the last 'and'
>
>  I didn't find this lpeg.pcode() function.  There is an lpeg.ptree()
> function, and the re expression above generates:
>    
>    [1 = g  ]
>    capture kind: 'simple'  key: 0
>      grammar 1
>        rule n: 0  key: 1
>          choice
>            seq
>              any
>              call key: 1  (rule: 0)
>            and
>              seq
>                char 'a'
>                seq
>                  char 'n'
>                  char 'd'
>
>  I'm not sure what you mean by "behind 3" instruction.
>
>> there is a lpeg.B function to do look-behind, but how to go back to it if B matched ?
>>
>> Is there a lpeg.U(n) (for undo n characters) or something similar ?
>
>  I think that lpeg.Cmt() can do something like that, but as I wrote in a
> previous message [3] to Jonathan Goble:
>
>>  I think you are thinking about LPeg with the wrong mindset---yes, you can
>> look for patters in text with LPeg [1] but it's for *parsing*---pulling out
>> semantic information from text, rather than just patterns.  I've written a
>> lot of LPeg code [2], and not once have I needed a greedy, non-possessive
>> repetition to parse text.
>
>  Now, back to your message:
>
>> As an example of its usefulness, say # is lpeg re for undo 1 character
>>
>> REDO above re pattern, but without backtrack stack overflow problem:
>> NOTE: I want to capture ALL except LAST 'and'
>>
>> pat = re.compile "{ (g <- 'and' / . [^a]* g)+ ### }"
>>
>> Without UNDO, I have to do this (likely much slower):
>>
>> pat = re.compile( "(g <- 'and' / . [^a]* g)+ -> drop3", {drop3 = function(s) return s:sub(1,-4) end} )
>
>  If you are looking for a final "and" (which ends the input), then this
> works:
>
>    last_and = P"and" * P(-1)
>    char     = R("\0\96","b\255")^1
>                 + -last_and * P"a"
>    pat      = C((char)^0) * last_and
>
>    print(pat:match(string.rep("this and that land",400) .. "and"))
>
>  The wierd production of 'char' is to burn through large sequences of
> charaters that don't contain the letter 'a'.  It's probably faster than this
> version:
>
>    last_and = P"and" * P(-1)
>    pat      = C((P(1) - last_and)^0) * last_and
>
> but I did not bother to benchmark it.  Personally, I would probably use the
> above version since it's easier to understand.  If it became an issue, then
> I might go with the version with 'char' and if that was still slow, then I
> would take stock with what I'm really trying to accomplish and adjust
> accordingly.
>
>  -spc (So, what is it you are really trying to do?)
>
> [1]    http://xyproblem.info/
>
> [2]    http://www.gammon.com.au/forum/?id=14149&reply=31#reply31
>
> [3]    http://lua-users.org/lists/lua-l/2017-10/msg00143.html
>

Reply | Threaded
Open this post in threaded view
|

Re: lpeg.U ?

Albert Chan
In reply to this post by Sean Conner
On Jan 25, 2018, at 8:03 PM, Sean Conner <[hidden email]> wrote:

>  I didn't find this lpeg.pcode() function.  There is an lpeg.ptree()
> function, and the re expression above generates:
>    
>    [1 = g  ]
>    capture kind: 'simple'  key: 0
>      grammar 1
>        rule n: 0  key: 1
>          choice
>            seq
>              any
>              call key: 1  (rule: 0)
>            and
>              seq
>                char 'a'
>                seq
>                  char 'n'
>                  char 'd'
>  I'm not sure what you mean by "behind 3" instruction.

i am using lpeg 1.0.1, and it has lpeg.code, which call lp_printcode in lptree.c

pat = re.compile "{g <- .g / &'and'}"
= lpeg.pcode(pat)

[1 = g ]
00 opencapture simple (idx = 0)
01 call -> 5
03 jmp -> 19
05 testany -> 14
07 choice -> 14
09 any
10 call -> 5
12 commit -> 18
14 char 'a'
15 char 'n'
16 char 'd'
17 behind 3          -- undo consumed 'and'
18 ret
19 closecapture
20 end

behind == codebehind() in lpcode.c (line 659 - 663)

>  If you are looking for a final "and" (which ends the input), then this
> works:
>
>    last_and = P"and" * P(-1)
>    char     = R("\0\96","b\255")^1
>                 + -last_and * P"a"
>    pat      = C((char)^0) * last_and

the code is just an example of the usefulness of undo

FYI, the last 'and' may not be anchored in end-of-string.
If you remove the anchor, you got the first 'and', not the last

i was just curious if undo is possible ...



Reply | Threaded
Open this post in threaded view
|

Re: lpeg.U ?

Albert Chan
In reply to this post by Sean Conner

On Jan 25, 2018, at 8:03 PM, Sean Conner <[hidden email]> wrote:

>  If you are looking for a final "and" (which ends the input), then this
> works:
>
>    last_and = P"and" * P(-1)
>    char     = R("\0\96","b\255")^1
>                 + -last_and * P"a"
>    pat      = C((char)^0) * last_and
>
>    print(pat:match(string.rep("this and that land",400) .. "and"))
>

your example also shows usefulness of undo lpeg.U (if exist):

pat = C( P(1)^3 * U(3) * #P'and' )


Reply | Threaded
Open this post in threaded view
|

Re: lpeg.U ?

Sean Conner
In reply to this post by Albert Chan
It was thus said that the Great albertmcchan once stated:
> i am using lpeg 1.0.1, and it has lpeg.code, which call lp_printcode in lptree.c

  Ah, I did not see it when I scanned the source code.  

> the code is just an example of the usefulness of undo

  I still think you approaching LPeg as a replacement for patterns instead
of a tool for semantically parsing text.  

> FYI, the last 'and' may not be anchored in end-of-string.
> If you remove the anchor, you got the first 'and', not the last
>
> i was just curious if undo is possible ...

  "Undo" what?  If you have an expression:

        A * ( B + C )

A is matched.  Then B is tried.  If that fails, LPeg will "undo" B and try
C.  That's an "undo".  You also have look-ahead with the '#' operator (and
to some degree with the '-' operator).

I know, you are trying to do:

        (.*)and(.*)

and have it replicate a Lua pattern.  Yes, you can do that.  There have been
several versions doing so.  But (and this is something I can't repeat
enough) you need to think differently with LPeg---think "structure" and not
just "pattern".  Compare this:

        https://github.com/spc476/LPeg-talk/blob/master/email-addr.lua

with this:

        http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

to parse email addresses as defined in RFC-822 (the LPeg one can handle
comments; the regex one can't).

  -spc (So I ask again, what exactly are you trying to do?)


Reply | Threaded
Open this post in threaded view
|

Re: lpeg.U ?

Sean Conner
In reply to this post by Albert Chan
It was thus said that the Great albertmcchan once stated:

>
> On Jan 25, 2018, at 8:03 PM, Sean Conner <[hidden email]> wrote:
>
> >  If you are looking for a final "and" (which ends the input), then this
> > works:
> >
> >    last_and = P"and" * P(-1)
> >    char     = R("\0\96","b\255")^1
> >                 + -last_and * P"a"
> >    pat      = C((char)^0) * last_and
> >
> >    print(pat:match(string.rep("this and that land",400) .. "and"))
> >
>
> your example also shows usefulness of undo lpeg.U (if exist):
>
> pat = C( P(1)^3 * U(3) * #P'and' )

  Even if lpeg.U() existed, I don't think this would do what you expect it
do to.  

        read and understand this

The P(1)^3 will consume past the word "this".  

        read and understand this
                                ^

Then what?  the U(3) would back up three characters:

        read and understand this
                            ^

And then it would fail because the expression #P'and' fails.

This brings up another point---the Lua pattern

        (.*)and(.*)

applied to "read and understand this" will return

        "read and underst"
        " this"

If this is what you want, then the Lua pattern is probably what you want.
If this isn't what you want, and what you really wanted was

        "read"
        "understand this"

Then you need to rethink what you are attempting to do (and adjust the
patterns or LPeg code accordingly).

  -spc


Reply | Threaded
Open this post in threaded view
|

Re: lpeg.U ?

Dirk Laurie-2
In reply to this post by Albert Chan
2018-01-26 0:19 GMT+02:00 albertmcchan <[hidden email]>:

> Is there a lpeg.U(n) (for undo n characters) or something similar ?

You have two options.

   #p matches what p matches, but consumes no input.

   A Cg/Cb allows you to store a capture and use it later.

Actually, all the answers I could give to questions like yours are in my
lpeg-brief article on https://github.com/dlaurie/lua-notes.

Reply | Threaded
Open this post in threaded view
|

Re: lpeg.U ?

Albert Chan
In reply to this post by Sean Conner

> On Jan 25, 2018, at 11:01 PM, Sean Conner <[hidden email]> wrote:
>
> It was thus said that the Great albertmcchan once stated:
>>
>>> On Jan 25, 2018, at 8:03 PM, Sean Conner <[hidden email]> wrote:
>>>
>>> If you are looking for a final "and" (which ends the input), then this
>>> works:
>>>
>>>   last_and = P"and" * P(-1)
>>>   char     = R("\0\96","b\255")^1
>>>                + -last_and * P"a"
>>>   pat      = C((char)^0) * last_and
>>>
>>>   print(pat:match(string.rep("this and that land",400) .. "and"))
>>
>> your example also shows usefulness of undo lpeg.U (if exist):
>>
>> pat = C( P(1)^3 * U(3) * #P'and' )
>
>  Even if lpeg.U() existed, I don't think this would do what you expect

Sorry for the mis-understanding.
I was referring to YOUR example, a final "and" (which ends the input).

The natural way of matching is to move to position -3, then check for 'and'
If I use lua string library, i would do this:

function string_match_and_at_endofstring(s)
   if string.sub(s, -3) ~= 'and' then return nil end
   return string.sub(s, 1, -4)
end

Above code, literally translate to lpeg (if #s is available)

function lpeg_match_and_at_endofstring(s)
   local pat = C(P(#s - 3) * #P'and')
   return lpeg.match(pat, s)
end

Without #s and if U exist then pat = C(P(1)^3 * U(3) * #P'and')


Reply | Threaded
Open this post in threaded view
|

Re: lpeg.U ?

Albert Chan
In reply to this post by Dirk Laurie-2
>
> On Jan 26, 2018, at 2:14 AM, Dirk Laurie <[hidden email]> wrote:
>
> 2018-01-26 0:19 GMT+02:00 albertmcchan <[hidden email]>:
>
>> Is there a lpeg.U(n) (for undo n characters) or something similar ?
>
> You have two options.
>
>   #p matches what p matches, but consumes no input.
>
>   A Cg/Cb allows you to store a capture and use it later.
>
> Actually, all the answers I could give to questions like yours are in my
> lpeg-brief article on https://github.com/dlaurie/lua-notes.
>

Thank you, Dirk !
The official lpeg documentation is much too terse.
I like the Cg/Cb surname example, which is similar to perl back reference.

I had guessed that undo is impossible, since lpeg matching is possessive.
I was just checking if anyone did a patch for lpeg.U (or similar)

= re.match('hi', '.?..')       -- actually match 3+ characters
nil





Reply | Threaded
Open this post in threaded view
|

Re: lpeg.U ?

Albert Chan
In reply to this post by Sean Conner

On Jan 25, 2018, at 8:03 PM, Sean Conner <[hidden email]> wrote:

>  If you are looking for a final "and" (which ends the input), then this
> works:
>
>    last_and = P"and" * P(-1)
>    char     = R("\0\96","b\255")^1
>                 + -last_and * P"a"
>    pat      = C((char)^0) * last_and
>
>    print(pat:match(string.rep("this and that land",400) .. "and"))

Just for fun I disable lpeg.B so it only produce behind n instruction
without doing matching:

comment out line 662 in lpcode.c and recompile (YES, just 1 line)
// codegen(compst, sib1(tree), 0, NOINST, fullset)

-- should use P(1)^3, but I want to see what happen with bad input
pat = C(P(1)^0 * B(3)) * P'and'

= lpeg.pcode(pat)
00 opencapture simple (idx = 0)
01 span [(00-ff)]
10 behind 3
11 closecapture
12 char 'a'
13 char 'n'
14 char 'd'
15 end

for _, v in pairs {'', 'a', 'an', 'and', 'aand'} do print( pat:match(v) ) end
nil
nil
nil

a

Again, just for fun, try string.rep("this and that land",400) .. "and")),
and this patched_B version speed is 15X of last_and version above.

Question: is B(-n) has any meaning ?, if no, this could be use for U(n)


Reply | Threaded
Open this post in threaded view
|

Re: lpeg.U ?

Albert Chan
In reply to this post by Albert Chan
Does lpeg.B(-n) has any real meaning ?
I did a patch that use B(-n) for the behind n instruction
NOTE: %b == lpge.B(-1) == behind 1

Example: lua patten "(.*)and(.*)"

pat = re.compile "{(g <- 'and' / . [^a]* g)+ %b%b%b} ... {.*}"
= pat:match 'this and that and whatever'
this and that
whatever

Example: capture chars in groups of 3's

pat = re.compile "({...} %b%b)+"
= pat:match "123456789"
123     234     345     456     567     678     789

----
Follwing is the patches if anyone interested (ONLY 4 lines !)

>diff lpeg\lptree.c lpeg-1.0.1\lptree.c
430d429
<       tree->u.n = n;                        // TNot, possibly UNDO
689d687
<   if (sib1(tree)->tag == TNot) n = sib1(tree)->u.n;

>diff lpeg\lpcode.c lpeg-1.0.1\lpcode.c
661d661
<   if (sib1(tree)->tag != TNot)        // TNot == UNDO

>diff lpeg\re.lua lpeg-1.0.1\re.lua
23c27,28
< local Predef = { nl = m.P"\n", b = m.B(-1) }
---
> local Predef = { nl = m.P"\n" }