how does lua arrange vmcase?

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

how does lua arrange vmcase?

Xianfu Pan
AFAIK, cases are sensitive of orders in C language. switch (?) case XXX:break; case YYY:break;

vmcase(OP_MOVE) is before vmcase(OP_LOADK)  before   vmcase(OP_LOADKX) then vmcase(OP_LOADBOOL) followed?

why such order?  frequency order? or any other considerations?

lua_execute is hot point of lua performance. if of frequency considerations, I guess, why not write a Huffman tree's binary search instead of vmcase's linar search?

best regards to all.
Reply | Threaded
Open this post in threaded view
|

Re: how does lua arrange vmcase?

Soni "They/Them" L.


On 2017-04-04 12:51 PM, Xianfu Pan wrote:

> AFAIK, cases are sensitive of orders in C language. switch (?) case
> XXX:break; case YYY:break;
>
> vmcase(OP_MOVE) is before vmcase(OP_LOADK)  before   vmcase(OP_LOADKX)
> then vmcase(OP_LOADBOOL) followed?
>
> why such order?  frequency order? or any other considerations?
>
> lua_execute is hot point of lua performance. if of frequency
> considerations, I guess, why not write a Huffman tree's binary search
> instead of vmcase's linar search?
>
> best regards to all.

Linear?

It should be O(1) (aka constant), not O(n) (aka linear)...

--
Disclaimer: these emails may be made public at any given time, with or without reason. If you don't agree with this, DO NOT REPLY.


Reply | Threaded
Open this post in threaded view
|

Re: how does lua arrange vmcase?

Luiz Henrique de Figueiredo
In reply to this post by Xianfu Pan
> AFAIK, cases are sensitive of orders in C language. switch (?) case
> XXX:break; case YYY:break;
>
> vmcase(OP_MOVE) is before vmcase(OP_LOADK)  before   vmcase(OP_LOADKX)
> then vmcase(OP_LOADBOOL) followed?
>
> why such order?  frequency order? or any other considerations?

It's mostly the order the opcodes are listed in the enum defined in lopcodes.h.
A switch whose cases are in increasing order should be easily handled by
the C compiler, which can generated a simple jump table. More generally,
if the cases form an interval [1,n] with n small, then a simple jump
table works, even if the cases are not ordered.

Frequency of execution is handled by the branch predictions in the CPU.

The order in lopcodes.h and lvm.c does differ in a few places:

               17,18d16
               < OP_MOD
               < OP_POW

               20d17
               < OP_IDIV

               25a23,25
               > OP_MOD
               > OP_IDIV
               > OP_POW

where < is lopcodes.h and > is lvm.c.

Reply | Threaded
Open this post in threaded view
|

Threaded vm core loop (Was: Re: how does lua arrange vmcase?)

Gunnar Zötl
Am 2017-04-04 18:10, schrieb Luiz Henrique de Figueiredo:

> It's mostly the order the opcodes are listed in the enum defined in
> lopcodes.h.
> A switch whose cases are in increasing order should be easily handled
> by
> the C compiler, which can generated a simple jump table. More
> generally,
> if the cases form an interval [1,n] with n small, then a simple jump
> table works, even if the cases are not ordered.

And the compiler does so quite efficiently. After stumbling upon this
old paper:

https://www.jilp.org/vol5/v5paper12.pdf

I had a go at making the core vm loop of lua 5.3.4 into a direct
threaded interpreter. The change itself was pretty easy, basically it
involved turning the big switch statement in luaV_execute() into a
jumptable and a goto. This did result in a speedup, but the results are
less than spectacular. On the average, a speedup of about 3% has been
gained across a selection of benchmarks from here:
http://benchmarksgame.alioth.debian.org/u64q/lua.html (with output and
gc turned off), but for some cases the modified interpreter was even a
little bit slower than the vanilla lua interpreter. An interesting
observation was that the distribution of speedups across the benchmarks
were quite different on the 2 different processors I could test this on,
although the average was the same.

So, in conclusion, this was an interesting exercise, but compilers and
processors have advanced quite a bit since the paper was written, and so
the results are not worth maintaining a custom code base.

Gunnar

Reply | Threaded
Open this post in threaded view
|

Re: Threaded vm core loop (Was: Re: how does lua arrange vmcase?)

Daurnimator
On 5 April 2017 at 22:28,  <[hidden email]> wrote:

> Am 2017-04-04 18:10, schrieb Luiz Henrique de Figueiredo:
>
>> It's mostly the order the opcodes are listed in the enum defined in
>> lopcodes.h.
>> A switch whose cases are in increasing order should be easily handled by
>> the C compiler, which can generated a simple jump table. More generally,
>> if the cases form an interval [1,n] with n small, then a simple jump
>> table works, even if the cases are not ordered.
>
>
> And the compiler does so quite efficiently. After stumbling upon this old
> paper:
>
> https://www.jilp.org/vol5/v5paper12.pdf
>
> I had a go at making the core vm loop of lua 5.3.4 into a direct threaded
> interpreter. The change itself was pretty easy, basically it involved
> turning the big switch statement in luaV_execute() into a jumptable and a
> goto. This did result in a speedup, but the results are less than
> spectacular. On the average, a speedup of about 3% has been gained across a
> selection of benchmarks from here:
> http://benchmarksgame.alioth.debian.org/u64q/lua.html (with output and gc
> turned off), but for some cases the modified interpreter was even a little
> bit slower than the vanilla lua interpreter. An interesting observation was
> that the distribution of speedups across the benchmarks were quite different
> on the 2 different processors I could test this on, although the average was
> the same.

I'd be interested in seeing your patch. Have you uploaded it anywhere?


> So, in conclusion, this was an interesting exercise, but compilers and
> processors have advanced quite a bit since the paper was written, and so the
> results are not worth maintaining a custom code base.

As far as I understand it, this should have been a relatively easy
change to make and maintain; lvm.c has #defines ready for you to make
such a modification (vmdispatch, vmcase, vmbreak).
https://www.lua.org/source/5.3/lvm.c.html#vmdispatch

Reply | Threaded
Open this post in threaded view
|

Re: Threaded vm core loop (Was: Re: how does lua arrange vmcase?)

Roberto Ierusalimschy
> As far as I understand it, this should have been a relatively easy
> change to make and maintain; lvm.c has #defines ready for you to make
> such a modification (vmdispatch, vmcase, vmbreak).
> https://www.lua.org/source/5.3/lvm.c.html#vmdispatch

Last time I tested, the following definitions would do the trick.

-- Roberto

/*---------------------------------------------------------------------*/
#undef vmdispatch
#undef vmcase
#undef vmbreak

#define vmdispatch(x)     goto *disptab[x];

#define vmcase(l)     L_##l:

#define vmbreak vmfetch(); vmdispatch(GET_OPCODE(i));


static void *disptab[] = {
  &&L_OP_MOVE,
  &&L_OP_LOADK,
  &&L_OP_LOADKX,
  &&L_OP_LOADBOOL,
  &&L_OP_LOADNIL,
  &&L_OP_GETUPVAL,
  &&L_OP_GETTABUP,
  &&L_OP_GETTABLE,
  &&L_OP_SETTABUP,
  &&L_OP_SETUPVAL,
  &&L_OP_SETTABLE,
  &&L_OP_NEWTABLE,
  &&L_OP_SELF,
  &&L_OP_ADD,
  &&L_OP_SUB,
  &&L_OP_MUL,
  &&L_OP_MOD,
  &&L_OP_POW,
  &&L_OP_DIV,
  &&L_OP_IDIV,
  &&L_OP_BAND,
  &&L_OP_BOR,
  &&L_OP_BXOR,
  &&L_OP_SHL,
  &&L_OP_SHR,
  &&L_OP_UNM,
  &&L_OP_BNOT,
  &&L_OP_NOT,
  &&L_OP_LEN,
  &&L_OP_CONCAT,
  &&L_OP_JMP,
  &&L_OP_EQ,
  &&L_OP_LT,
  &&L_OP_LE,
  &&L_OP_TEST,
  &&L_OP_TESTSET,
  &&L_OP_CALL,
  &&L_OP_TAILCALL,
  &&L_OP_RETURN,
  &&L_OP_FORLOOP,
  &&L_OP_FORPREP,
  &&L_OP_TFORCALL,
  &&L_OP_TFORLOOP,
  &&L_OP_SETLIST,
  &&L_OP_CLOSURE,
  &&L_OP_VARARG,
  &&L_OP_EXTRAARG,
};

Reply | Threaded
Open this post in threaded view
|

Re: Threaded vm core loop

Gunnar Zötl
In reply to this post by Daurnimator
Am 2017-04-05 14:52, schrieb Daurnimator:
> On 5 April 2017 at 22:28,  <[hidden email]> wrote:
>> I had a go at making the core vm loop of lua 5.3.4 into a direct
>> threaded
>> interpreter. The change itself was pretty easy, basically it involved
>
> I'd be interested in seeing your patch. Have you uploaded it anywhere?

I'll attach it to this mail. This patch is against lvm.c from lua 5.3.4

> As far as I understand it, this should have been a relatively easy
> change to make and maintain; lvm.c has #defines ready for you to make
> such a modification (vmdispatch, vmcase, vmbreak).

yes, but I didn't use those as this was basically just a hack to see
what
benefits it would bring.

Gunnar

lvm-threaded.patch (49K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Threaded vm core loop

Gunnar Zötl
In reply to this post by Roberto Ierusalimschy
Am 2017-04-05 15:16, schrieb Roberto Ierusalimschy:
>
> #define vmbreak vmfetch(); vmdispatch(GET_OPCODE(i));

my vmbreak is continue, this variant shaves another percent off, and
brings
it to a bit more than 4% speed improvement on the average.

Gunnar

Reply | Threaded
Open this post in threaded view
|

Re: Threaded vm core loop (Was: Re: how does lua arrange vmcase?)

云风 Cloud Wu
In reply to this post by Roberto Ierusalimschy

Roberto Ierusalimschy <[hidden email]>于2017年4月5日周三 下午9:16写道:
> As far as I understand it, this should have been a relatively easy
> change to make and maintain; lvm.c has #defines ready for you to make
> such a modification (vmdispatch, vmcase, vmbreak).
> https://www.lua.org/source/5.3/lvm.c.html#vmdispatch

Last time I tested, the following definitions would do the trick.

-- Roberto

/*---------------------------------------------------------------------*/
#undef vmdispatch
#undef vmcase
#undef vmbreak

#define vmdispatch(x)     goto *disptab[x];

#define vmcase(l)     L_##l:

#define vmbreak         vmfetch(); vmdispatch(GET_OPCODE(i));



I think gcc has already  optimize switch/case into jump table.

I test it in gcc 4.9.2 :

gcc -std=gnu99 -O2 -S -Wall -c lvm.c

.L358:
movl %ebx, %r15d
movl %ebx, %eax
shrl $6, %r15d
andl $63, %eax
movzbl %r15b, %edx
movq %rdx, %r9
movq %rdx, %r15
salq $4, %r9
cmpl $45, %eax
leaq (%rdi,%r9), %rsi
ja .L357
movslq 0(%r13,%rax,4), %rax
addq %r13, %rax
jmp *%rax
.section .rdata,"dr"
.align 4
.L360:
.long .L359-.L360
.long .L361-.L360
.long .L362-.L360
        .....

It use jmp *%rax for switch case. The trick only remove the unused default case.

 
Reply | Threaded
Open this post in threaded view
|

Re: Threaded vm core loop (Was: Re: how does lua arrange vmcase?)

Roberto Ierusalimschy
> It use jmp *%rax for switch case. The trick only remove the unused default
> case.

The real goal of the trick would be to have different copies of 'jmp
%rax' after each opcode (given the expansion of 'vmdispatch' at the end
of each instruction).

In a typical loop, the VM hardly executes twice the same opcode in
sequence, so the jump prediction of a single jump is very poor. However,
we can expect that the frequency of a given instruction being followed by
other particular fixed instruction can be high. For instance, consider
the following loop:

        5 [2] FORPREP   1 1 ; to 7
        6 [3] ADD       0 0 4
        7 [2] FORLOOP   1 -2 ; to 6

ADD is always followed by FORLOOP which is always followed by ADD. So,
if we have different jump instructions ('jmp %rax') at the end of
each opcode, they would hava a 100% hit rate. By contrast, a unique
'jmp %rax' for both instructions (which happens when we use a switch)
would have a 0% hit rate.

Unfortunately, many compilers "optimize" this kind of code, unifying
the different copies of 'vmdispatch' into one to reduce code size,
and therfore they kill that kind of gain. :-(

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Threaded vm core loop

sur-behoffski
In reply to this post by Gunnar Zötl
Way back in 1997, I used the x86 instruction sequence "lodsb; xlat; jmp ax"
(all of 4 bytes) to do an unrolled threaded finite-state machine.  I
published an article in Dr. Dobb's Journal ("High-Speed Finite-State
Machines"), showing this technique in use for a word-counting utility, a
static Huffman decoder, and a stripped-down non-deterministic (sigh)
regular expression search engine (incorporating a Boyer-Moore string
search).

One of the benefits was that the entire code page for the state machine
fitted into 256 bytes, which gave it high cache coherency (even on a 386,
I think the instruction cache was L1).  I could also analyse relationships
between state tables and sometimes substitute cheaper opcodes where I
could see that the next state, if not successful, would not have a
backtracking path.

However, the project only met with muted interest.  I found that the
Pentium was becoming the dominant PC processor, and it was moving towards
a load-store architecture, with branch prediction, with microcode being
used to support instructions less relevant as compiler technology also
evolved.  I suspect that lodsb and xlat were candidates for microcode.

Where I saw potential with the threaded code I was playing with, but
never really got to look at closely, was in places where a programmer
could apply "vertical merging", e.g. combining a UTF-8 decoder with a
word count algorithm.  However, this also has been made less relevant
in recent times, with speculative execution and multiple execution
units meaning that sequential operation could be kept front and centre...

... this is because an execution pipeline stall, such as would almost
certainly occur from a jump-table branch, is really expensive, as most of
the tricks noted above become unavailable.  This is where "vertical
merging" is my hope:  You cram various vertical abstract layers into a
single finite-state machine, such that the state machine does enough work
with each table-lookup dispatch, that you gain enough performance to
afford the significant performance hit of the pipeline stall.

The likely biggest downside of such merging is that state tables may very
quickly become too numerous to provide good performance, first in cache
coherency, and then perhaps in exceeding available memory.

--

My gut-feel reaction to  performance differences on different machines
would be to look at cache coherency, perhaps microcode support, and then
branch prediction efficiency.  JIT is very likely to write moderately
longer machine-code sequences than the tight opcode versions of code
segments, and so may suffer consequences in a higher rate of cache misses,
but I suspect this is a fairly minor consideration in modern hardware.

--

Anyway, hope that you found this threaded code/VM loop message useful.

cheers,

sur-behoffski