Implementation of Lua and direct/context threaded code

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Implementation of Lua and direct/context threaded code

Grellier, Thierry

Hello,

 

I was reading the article: The Implementation of Lua 5.0 and went through the usage of switch/case instruction dispatch preferred to direct threaded code techniques (bound to gcc usage) for portability reasons. I thought that conditional compilation was also key to portability more than language... I also guess that a lot of us are building our lua interpreter with gcc.

It is hard to fully understand how much it improves a real application in the end, so I was wondering if anyone has experimented with using these techniques instead of default lua implementation. I wished I could have had time to do so, but…

So when this proves to be a major speed improvement why not using luaconf to detect if gcc is used, and hardware thus select the most efficient dispatch scheme. I see in this article that trade off is actually depending upon target hardware (http://citeseer.ist.psu.edu/berndl05context.html).

This shall have limited impact on lua core size, and affect mainly lvm.* and parser I guess, limiting thus the maintenance overhead…

 

regards

 

 

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Implementation of Lua and direct/context threaded code

Mike Pall-5-2
Hi,

Grellier, Thierry wrote:

> I was reading the article: The Implementation of Lua 5.0 and went
> through the usage of switch/case instruction dispatch preferred to
> direct threaded code techniques (bound to gcc usage) for portability
> reasons. I thought that conditional compilation was also key to
> portability more than language... I also guess that a lot of us are
> building our lua interpreter with gcc.
>
> It is hard to fully understand how much it improves a real application
> in the end, so I was wondering if anyone has experimented with using
> these techniques instead of default lua implementation. I wished I could
> have had time to do so, but...

http://lua-users.org/lists/lua-l/2004-09/msg00610.html

Summary: not worth it -- at least not on x86.

There's a reason: Lua uses a one-opcode + three-operand bytecode
and operates on a virtual (caller/callee-overlapping) register
file. This means the machine code implementing each opcode is
much "fatter" (compared to a stack VM) and some of the operand
decoding can be moved before the opcode dispatch. This offers
more opportunities for out-of-order scheduling and filling the
pipeline bubbles caused by the branch mispredictions (note that
the direct threaded code technique does not remove all branch
mispredictions either).

Shameless plug: if you want faster execution (at the expense of
portability) then try LuaJIT: http://luajit.luaforge.net/

Bye,
     Mike
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Implementation of Lua and direct/context threaded code

Grellier, Thierry
In reply to this post by Grellier, Thierry
Ok I missed the thread, because searching with "direct threaded"
keywords.
Well 3% gain... I wasn't expecting much for the reasons you mentioned,
but, but I still think this is little effort to support it (cf silly
example below) so why not?
And supporting it is NOT at the expense of portability though; I contest
that idea, because I think this is more at the expense of
maintainability. And yes I would consider using JIT when speed matters,
but not sure I will only make programs for x86...
It is true that branch misprediction matters, this is why I was quoting
the article claiming they can improve it (with context threading instead
of direct threading) though I haven't looked in depth... I guess that
opcode ordering in switch may also have an influence...

enum { SETR1, SETR2, INCR_R1, COMP_R1_R2, BRANCH_TRUE, NEG_BRANCH,
END_OF_PROGRAM };
#include <stdio.h>

#define GET_OPCODE(i) i

#ifdef DIRECT_THREADED
# define SWITCH(e) goto *dispatches[e];
# define CASE
# define BREAK     goto *dispatches[GET_OPCODE(*pc++)]
#else
# define SWITCH(e) switch(e)
# define CASE      case
# define BREAK     continue
#endif

typedef char Instruction ;

void interprete(Instruction pc[]) {
  int r0 = 0, r1 = 0, r2 = 0;

#ifdef DIRECT_THREADED
  static void* dispatches[] = { &&SETR1, &&SETR2, &&INCR_R1,
&&COMP_R1_R2, &&BRANCH_TRUE, &&NEG_BRANCH, &&END_OF_PROGRAM };
#endif

  for(;;) {
    Instruction i = *pc++;
    SWITCH(GET_OPCODE(i)) {
    CASE SETR1:
      r1 = (int) *pc++;
      printf("[pc=%xd, r1=%d, r2=%d] NSETR1\n", pc, r1, r2);
      BREAK;
    CASE SETR2:
      r2 = (int) *pc++;
      printf("[pc=%xd, r1=%d, r2=%d] NSETR2\n", pc, r1, r2);
      BREAK;
    CASE INCR_R1:
      r1++;
      printf("[pc=%xd, r1=%d, r2=%d] NINCR_R1\n", pc, r1, r2);
      BREAK;
    CASE COMP_R1_R2:
      r0 = r1 == r2;
      printf("[pc=%xd, r1=%d, r2=%d, r0=%d] NCOMP_R1_R2\n", pc, r1, r2,
r0);
      BREAK;
    CASE BRANCH_TRUE:
      if (r0) pc += *pc;
      else pc++;
      printf("[pc=%xd, r1=%d, r2=%d, r0=%d] NBRANCH_TRUE\n", pc, r1, r2,
r0);
      BREAK;
    CASE NEG_BRANCH:
      pc -= *pc;
      printf("[pc=%xd, r1=%d, r2=%d] NEG_BRANCH\n", pc, r1, r2);
      BREAK;
    CASE END_OF_PROGRAM:
     printf("[pc=%xd, r1=%d, r2=%d] END_OF_PROGRAM\n", pc, r1, r2);
     return;
    }
  }
}

int main() {
  static Instruction program[] = {
    SETR1, 1,
    SETR2, 125,
    COMP_R1_R2,
    BRANCH_TRUE, 4,
    INCR_R1,
    NEG_BRANCH, 5,
    END_OF_PROGRAM
    };
  int i;
  for (i = 0; i < 10000; i++)
    interprete(program);
  return 0;
}
-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of Mike Pall
Sent: Monday, May 29, 2006 6:33 PM
To: Lua list
Subject: Re: Implementation of Lua and direct/context threaded code

Hi,

Grellier, Thierry wrote:

> I was reading the article: The Implementation of Lua 5.0 and went
> through the usage of switch/case instruction dispatch preferred to
> direct threaded code techniques (bound to gcc usage) for portability
> reasons. I thought that conditional compilation was also key to
> portability more than language... I also guess that a lot of us are
> building our lua interpreter with gcc.
>
> It is hard to fully understand how much it improves a real application
> in the end, so I was wondering if anyone has experimented with using
> these techniques instead of default lua implementation. I wished I
could
> have had time to do so, but...

http://lua-users.org/lists/lua-l/2004-09/msg00610.html

Summary: not worth it -- at least not on x86.

There's a reason: Lua uses a one-opcode + three-operand bytecode
and operates on a virtual (caller/callee-overlapping) register
file. This means the machine code implementing each opcode is
much "fatter" (compared to a stack VM) and some of the operand
decoding can be moved before the opcode dispatch. This offers
more opportunities for out-of-order scheduling and filling the
pipeline bubbles caused by the branch mispredictions (note that
the direct threaded code technique does not remove all branch
mispredictions either).

Shameless plug: if you want faster execution (at the expense of
portability) then try LuaJIT: http://luajit.luaforge.net/

Bye,
     Mike




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Implementation of Lua and direct/context threaded code

Grellier, Thierry
In reply to this post by Grellier, Thierry
I've attached a patch, that can allow using some kind of direct threaded
technique to lua 5.1 (on lvm.c) for anything but i386 with gcc compiler
(selection is done in luaconf.h and can be improved I guess, notably I
made assumption for powerpc... but at least it shall preserve
portability). On i386 it keeps switch/case.

A quick test let me think it allows to gain up to 5%-10% on sparcs using
some benches of http://shootout.alioth.debian.org/, but performs worst
on i386 (because of replicated code in BREAK I guess, and less
registers).
So if anybody wants to play with it and see how it performs on their
system...

Regarding LuaJIT, and referring to article I mentioned (more details
here:
http://www.cs.toronto.edu/~bv/tcl2005/tcl2005-slides.pdf). Does LuaJIT
use similar techniques to reduce misprediction and/or inline code in
branches?

-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of Mike Pall
Sent: Monday, May 29, 2006 6:33 PM
To: Lua list
Subject: Re: Implementation of Lua and direct/context threaded code

Hi,

Grellier, Thierry wrote:

> I was reading the article: The Implementation of Lua 5.0 and went
> through the usage of switch/case instruction dispatch preferred to
> direct threaded code techniques (bound to gcc usage) for portability
> reasons. I thought that conditional compilation was also key to
> portability more than language... I also guess that a lot of us are
> building our lua interpreter with gcc.
>
> It is hard to fully understand how much it improves a real application
> in the end, so I was wondering if anyone has experimented with using
> these techniques instead of default lua implementation. I wished I
could
> have had time to do so, but...

http://lua-users.org/lists/lua-l/2004-09/msg00610.html

Summary: not worth it -- at least not on x86.

There's a reason: Lua uses a one-opcode + three-operand bytecode
and operates on a virtual (caller/callee-overlapping) register
file. This means the machine code implementing each opcode is
much "fatter" (compared to a stack VM) and some of the operand
decoding can be moved before the opcode dispatch. This offers
more opportunities for out-of-order scheduling and filling the
pipeline bubbles caused by the branch mispredictions (note that
the direct threaded code technique does not remove all branch
mispredictions either).

Shameless plug: if you want faster execution (at the expense of
portability) then try LuaJIT: http://luajit.luaforge.net/

Bye,
     Mike

directthreaded.patch (15K) Download Attachment
Loading...