Questions about opcodes used in LPEG's vm implementation

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Questions about opcodes used in LPEG's vm implementation

lua.greatwolf
Hi all,

I'm attempting to add JIT functionality(Will be using DynASM to help with this) to the reference
LPeg 0.12 implementation but don't really have any prior experience doing this so not sure how far
I'll get on this project. Maybe I'll learn something at the end.

As a starting point I'm currently studying the implementation for vm parsing machine in lpvm.c.
Additionally, I'm using the published peg documentation here by Roberto:

http://www.inf.puc-rio.br/~roberto/docs/peg.pdf

Anyways, onto my questions.

I'm looking at the 'ISet' and 'ITestSet' opcodes but I'm not really understanding how it checks
whether the current character is part of the set or not. Here's a code except for 'ISet':

       case ISet: {
         int c = (byte)*s;
         if (testchar((p+1)->buff, c) && s < e)
           { p += CHARSETINSTSIZE; s++; }

"testchar" is a macro defined in lptypes.h:143 as:

     #define testchar(st,c)    (((int)(st)[((c) >> 3)] & (1 << ((c) & 7))))

Now according to the peg.pdf:

> ... Sets are represented as bit sets, with one bit for
> each possible value of a character. Each such instruction
> uses 256 extra bits, or 16 bytes, to represent its set.

There's probably a mistake in there because 16 bytes is actually 128-bits. Also LPeg had a rewrite
since this doc was published so I don't know if all the info here is still accurate.

So my question is, when there's a "ISet", "ITestSet" or possibly "ISpan" opcode in the compiled
pattern, what would the opcode listing for that actually look like? Is the set of characters being
checked for also encoded into this listing -- eg. similar to a line offset? If so, how many bytes
would that take?

Here's the union definition for opcode instructions in lpvm.h:

   typedef union Instruction {
     struct Inst {
       byte code;
       byte aux;
       short key;
     } i;
     int offset;
     byte buff[1];
   } Instruction;

So from this I can infer that each opcode instruction entry in the listing can be either i(an actual
opcode instruction), an offset(for the opcode above it) or a byte buffer with 1 element in it.

The two fields I'm unclear about is the purpose of "key" and "buff". "key" doesn't seem to be used
during the interpreting phase so I'm guessing maybe it's used during pattern construction? In which
case I can probably ignore that?

Now the "buff" field is being used above in "testchar". Why is buff an array of byte with one
element in it? Even if say this decayed into a pointer so it takes 4 bytes I still can't see how it
fits in. Let say the pattern had a really big set. How would that translate into the opcode and what
would the listing look like?

Can anyone shed some light and help clear this up?

Thanks!


Reply | Threaded
Open this post in threaded view
|

Re: Questions about opcodes used in LPEG's vm implementation

Pierre-Yves Gérardy
On Fri, Sep 13, 2013 at 11:01 PM,  <[hidden email]> wrote:

> Hi all,
>
> I'm attempting to add JIT functionality(Will be using DynASM to help with
> this) to the reference LPeg 0.12 implementation but don't really have any
> prior experience doing this so not sure how far I'll get on this project.
> Maybe I'll learn something at the end.
>
> As a starting point I'm currently studying the implementation for vm parsing
> machine in lpvm.c. Additionally, I'm using the published peg documentation
> here by Roberto:
>
> http://www.inf.puc-rio.br/~roberto/docs/peg.pdf
>
> Anyways, onto my questions.
>
> I'm looking at the 'ISet' and 'ITestSet' opcodes but I'm not really
> understanding how it checks whether the current character is part of the set
> or not. Here's a code except for 'ISet':
>
>       case ISet: {
>         int c = (byte)*s;
>         if (testchar((p+1)->buff, c) && s < e)
>           { p += CHARSETINSTSIZE; s++; }
>
> "testchar" is a macro defined in lptypes.h:143 as:
>
>     #define testchar(st,c)    (((int)(st)[((c) >> 3)] & (1 << ((c) & 7))))
>
> Now according to the peg.pdf:
>
>> ... Sets are represented as bit sets, with one bit for each possible value
>> of a character. Each such instruction
>> uses 256 extra bits, or 16 bytes, to represent its set.
>
>
> There's probably a mistake in there because 16 bytes is actually 128-bits.
> Also LPeg had a rewrite since this doc was published so I don't know if all
> the info here is still accurate.
>
> So my question is, when there's a "ISet", "ITestSet" or possibly "ISpan"
> opcode in the compiled pattern, what would the opcode listing for that
> actually look like? Is the set of characters being checked for also encoded
> into this listing -- eg. similar to a line offset? If so, how many bytes
> would that take?
>
> Here's the union definition for opcode instructions in lpvm.h:
>
>   typedef union Instruction {
>     struct Inst {
>       byte code;
>       byte aux;
>       short key;
>     } i;
>     int offset;
>     byte buff[1];
>   } Instruction;
>
> So from this I can infer that each opcode instruction entry in the listing
> can be either i(an actual opcode instruction), an offset(for the opcode
> above it) or a byte buffer with 1 element in it.
>
> The two fields I'm unclear about is the purpose of "key" and "buff". "key"
> doesn't seem to be used during the interpreting phase so I'm guessing maybe
> it's used during pattern construction? In which case I can probably ignore
> that?
>
> Now the "buff" field is being used above in "testchar". Why is buff an array
> of byte with one element in it? Even if say this decayed into a pointer so
> it takes 4 bytes I still can't see how it fits in. Let say the pattern had a
> really big set. How would that translate into the opcode and what would the
> listing look like?
>
> Can anyone shed some light and help clear this up?
>
> Thanks!
>
>

The documentation is correct, except for the 16 bytes, which should read 32.

A set is thus represented in the opcode stream by an ISet instruction
followed by 256 consecutive bits to be addressed as an array of 32
bytes. Each bit acting as the presence or absence of a given character
in the set.

>     #define testchar(st,c)    (((int)(st)[((c) >> 3)] & (1 << ((c) & 7))))

This tests looks for the presence of the "c"-th bit in "st", the byte
array that starts at the buff address.

    (st)[((c) >> 3)] // select the correct byte ([0-31])
    & (1 << ((c) & 7)) bitwise and with the correct bit in this byte [0-7].

It can be done safely because the compiler always allocates the full
256 bits set.

Not sure why the end result is cast as an int, nor why the buffer is
defined as a one element array rather than a pointer... I suppose it
is meant to keep the compiler happy, I'm not a C expert at all.

Hope it helps.
-- Pierre-Yves

Reply | Threaded
Open this post in threaded view
|

Re: Questions about opcodes used in LPEG's vm implementation

Pierre-Yves Gérardy
-- Pierre-Yves



> It can be done safely because the compiler always allocates the full
> 256 bits set.

Just to be clear: "compiler" here refers to the LPeg bytecode generator.

> -- Pierre-Yves

Reply | Threaded
Open this post in threaded view
|

Re: Questions about opcodes used in LPEG's vm implementation

Roberto Ierusalimschy
In reply to this post by Pierre-Yves Gérardy
> Not sure why the end result is cast as an int, nor why the buffer is
> defined as a one element array rather than a pointer... I suppose it
> is meant to keep the compiler happy, I'm not a C expert at all.

The buffer is not a pointer, it is the buffer itself. As you explained,
the LPeg compiler allocates the correct space in the opcode array. If
buff was declared with its correct size, all instructions would need
those additional 32 bytes. So, it is declared with size 1 (C89 does
not allow zero-sized arrays) just as a place holder. This is a common
technique in C (see [1], for instance).

[1] http://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Questions about opcodes used in LPEG's vm implementation

lua.greatwolf
In reply to this post by lua.greatwolf
Thank you Roberto and Pierra for clearing this up! After looking at the source some more and
thinking about it, I think I have more or less a visual idea of what the structural looks like in
memory.

Now I just have to figure out how that's going to map to assembly. Initial idea is to maybe embed
the bit mapset right into the assembly code as data bytes perhaps? Though I'm not sure of the
consequences and performance implications of doing it this way but it seems like a natural approach
to mimic what's currently being done in the vm.

> The buffer is not a pointer, it is the buffer itself. As you explained,
> the LPeg compiler allocates the correct space in the opcode array. If
> buff was declared with its correct size, all instructions would need
> those additional 32 bytes...

That part I understood though initially when I looked at the definition I was a bit confused on
making that `buff[1]`. Since this is a union, the size of `Instruction` is going to be *at least*
sizeof(int) + any alignment padding by the compiler. On all the platforms that matter today, this
usually means at least 4 bytes. It made me wonder for a moment why not declare buff to something
more like `byte buff[sizeof(int)]`. So even though buff is made to have 1 element, in reality it'll
always have more space to hold 3 more because it's not the smallest field in this union.

On an unrelated sidenote, I noticed LPeg defining `BITSPERCHAR` to 8 in one of the headers instead
of using `CHAR_BITS` from the standard which I found a bit unusual.


Reply | Threaded
Open this post in threaded view
|

Re: Questions about opcodes used in LPEG's vm implementation

Roberto Ierusalimschy
> On an unrelated sidenote, I noticed LPeg defining `BITSPERCHAR` to 8
> in one of the headers instead of using `CHAR_BITS` from the standard
> which I found a bit unusual.

The definitions of 'setchar' and 'testchar' only work for 8-bit chars,
and their method only work for char sizes that are power of two. Note
that ANSI says that CHAR_BITS must be at least 8, so the whole stuff
should work on any machine (wasting the extra bits).

-- Roberto