Lua versus Ravi versus Luajit - a simple benchmark

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Lua versus Ravi versus Luajit - a simple benchmark

Dibyendu Majumdar
Hi,

I just implemented LLVM based JIT compilation of OP_FORPREP,
OP_FORLOOP and OP_MOVE in Ravi. So this allows me to run a simple
benchmark given below. Note that the benchmark is run twice and time
measured for the second run - this is to eliminate any cost of
compilation.

Results:
Lua 5.3:                                  9.36 seconds
Ravi using LLVM:                   2.823 seconds
Luajit 2.1.0:                             0.309 seconds

I do not know yet what changes I could make to improve Ravi's
performance in this test - I will wait until I have implemented most
of the other byte codes I need before delving deeper into improving
things.

-----------------------------------------------------------------------

local function x()
  local j = 0.0
  for i=2.0,1000000000.0 do
     j = i
  end
  return j
end

x()

local t1 = os.clock()
local y = x();
local t2 = os.clock()
print(y)

assert(y == 1000000000.0)
print("time taken ", t2-t1);

Reply | Threaded
Open this post in threaded view
|

Re: Lua versus Ravi versus Luajit - a simple benchmark

Dibyendu Majumdar
On 1 March 2015 at 22:22, Dibyendu Majumdar <[hidden email]> wrote:
> I do not know yet what changes I could make to improve Ravi's
> performance in this test

Just looking at the logic in FORLOOP makes it clear that there are
some obvious optimizations.
For instance it is known when FORPREP executes whether the loop has
integer or float index, and whether the step is negative or positive.
So then one could branch to more specialized implementation of FORLOOP
from FORPREP and avoid the if/else logic inside FORLOOP.

Right now Ravi's implementation is naive i.e. replicates the logic in lvm.c.

Regards
Dibyendu

Reply | Threaded
Open this post in threaded view
|

Re: Lua versus Ravi versus Luajit - a simple benchmark

Dibyendu Majumdar
On 2 March 2015 at 00:53, Dibyendu Majumdar <[hidden email]> wrote:
> Just looking at the logic in FORLOOP makes it clear that there are
> some obvious optimizations.
> For instance it is known when FORPREP executes whether the loop has
> integer or float index, and whether the step is negative or positive.
> So then one could branch to more specialized implementation of FORLOOP
> from FORPREP and avoid the if/else logic inside FORLOOP.
>
> Right now Ravi's implementation is naive i.e. replicates the logic in lvm.c.

Also I was stupid enough to use the Lua stack/register for the index,
limit and step ... obviously these are hidden so the JIT version can
simply use native C stack variables. Only the external variable needs
to be retained on the Lua stack.

So it seems:
a) Use native C stack variable for the loop index, limit and step
b) Create separate loop control blocks for int/float, and
increasing/decreasing, and depending on FORPREP analysis use the
correct loop control block.

Time to redo this :-( it was messy enough to implement the naive version!
But on the other hand if I could get Ravi close to LuaJIT performance ... !

BTW I hope people don't mind my sharing my experiences here ... if you
do, shout and I will stop.

Regards
Dibyendu

Reply | Threaded
Open this post in threaded view
|

Re: Lua versus Ravi versus Luajit - a simple benchmark

Dirk Laurie-2
2015-03-02 22:22 GMT+02:00 Dibyendu Majumdar <[hidden email]>:

> BTW I hope people don't mind my sharing my experiences here ... if you
> do, shout and I will stop.

If they mind, all that is needed is not to open the posts. At this stage,
I'm still opening them. Keep going.

Reply | Threaded
Open this post in threaded view
|

Re: Lua versus Ravi versus Luajit - a simple benchmark

Andrew Starks


On Monday, March 2, 2015, Dirk Laurie <[hidden email]> wrote:
2015-03-02 22:22 GMT+02:00 Dibyendu Majumdar <<a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;mobile@majumdar.org.uk&#39;)">mobile@...>:

> BTW I hope people don't mind my sharing my experiences here ... if you
> do, shout and I will stop.

If they mind, all that is needed is not to open the posts. At this stage,
I'm still opening them. Keep going.


I'm lurking. It's interesting stuff.  
Reply | Threaded
Open this post in threaded view
|

Re: Lua versus Ravi versus Luajit - a simple benchmark

Mauricio Tavares
On Tue, Mar 3, 2015 at 7:36 AM, Andrew Starks <[hidden email]> wrote:

>
>
> On Monday, March 2, 2015, Dirk Laurie <[hidden email]> wrote:
>>
>> 2015-03-02 22:22 GMT+02:00 Dibyendu Majumdar <[hidden email]>:
>>
>> > BTW I hope people don't mind my sharing my experiences here ... if you
>> > do, shout and I will stop.
>>
>> If they mind, all that is needed is not to open the posts. At this stage,
>> I'm still opening them. Keep going.
>>
>
> I'm lurking. It's interesting stuff.

      It's one of those "All in the name of... Science!" kinda posts.
Always nice in my book

Reply | Threaded
Open this post in threaded view
|

Re: Lua versus Ravi versus Luajit - a simple benchmark

Valerio
In reply to this post by Dibyendu Majumdar
If there is a system that 1) is faster than vanilla lua, 2) maintains binary-compatibility with vanilla-lua, and 3) tries to catch-up with luajit, i'm also interested. 

a+
valerio

On Sun, Mar 1, 2015 at 11:22 PM, Dibyendu Majumdar <[hidden email]> wrote:
Hi,

I just implemented LLVM based JIT compilation of OP_FORPREP,
OP_FORLOOP and OP_MOVE in Ravi. So this allows me to run a simple
benchmark given below. Note that the benchmark is run twice and time
measured for the second run - this is to eliminate any cost of
compilation.

Results:
Lua 5.3:                                  9.36 seconds
Ravi using LLVM:                   2.823 seconds
Luajit 2.1.0:                             0.309 seconds

I do not know yet what changes I could make to improve Ravi's
performance in this test - I will wait until I have implemented most
of the other byte codes I need before delving deeper into improving
things.

-----------------------------------------------------------------------

local function x()
  local j = 0.0
  for i=2.0,1000000000.0 do
     j = i
  end
  return j
end

x()

local t1 = os.clock()
local y = x();
local t2 = os.clock()
print(y)

assert(y == 1000000000.0)
print("time taken ", t2-t1);


Reply | Threaded
Open this post in threaded view
|

Re: Lua versus Ravi versus Luajit - a simple benchmark

Dibyendu Majumdar
In reply to this post by Dibyendu Majumdar
Hi,

I made some changes as described previously.

Firstly using C stack variables for the index, step and limit, which
are internal variables and hence invisible to Lua code.
Secondly I now have 4 different forloop blocks - int with step > 0,
int with step <= 0, float with step > 0, float with step <= 0.
The problem I encountered was how to branch from the loop body to the
appropriate loop control block.
In the end I found that LLVM supports a computed goto operation - so
that during forprep I can save the label of the forloop block that
needs to be used  - the body can simply do a goto to the right forloop
block. The alternative is to make 4 copies of the loop body - but this
I hesitate to do.

So all these changes - did they help? Not a great deal I am afraid.
Results are below.

I haven't found a way to determine which bit is costing the most time.
I also haven't investigated the machine code being generated by LLVM.
My gut feel is that without additional type information, a limitation
of the traditional compilation approach is that we cannot avoid
expensive branches.
So if at compile time one can establish that the loop is integer / and
the step is > 0 then possibly the branching can be avoided. Perhaps
Luajit is doing this optimization.

There are some other areas to explore. For example the "external" loop
variable could be avoided if the compiler could enforce a readonly
property on the internal variable.

Anyway - I need to stay away from this problem until I have completed
implementing other bytecodes as this is an absorbing issue that can
easily consume a lot of time. Also, in a real world scenario this may
not be such an issue.

C:\github\ravi\ravi-tests>..\build\Debug\lua.exe fornum_test1.lua
1000000000.0
time taken      2.766

C:\github\ravi\ravi-tests>..\build\Debug\lua.exe fornum_test1.lua
1000000000.0
time taken      2.765

C:\github\ravi\ravi-tests>..\build\Debug\lua.exe fornum_test1.lua
1000000000.0
time taken      2.765

C:\github\ravi\ravi-tests>\luajit\luajit.exe fornum_test1.lua
1000000000
time taken      0.313

C:\github\ravi\ravi-tests>\luajit\luajit.exe fornum_test1.lua
1000000000
time taken      0.312

C:\github\ravi\ravi-tests>\luajit\luajit.exe fornum_test1.lua
1000000000
time taken      0.313

C:\github\ravi\ravi-tests>\lua-5.3.0\src\build\Release\lua.exe fornum_test1.lua
1000000000.0
time taken      9.219

C:\github\ravi\ravi-tests>\lua-5.3.0\src\build\Release\lua.exe fornum_test1.lua
1000000000.0
time taken      9.187

C:\github\ravi\ravi-tests>\lua-5.3.0\src\build\Release\lua.exe fornum_test1.lua
1000000000.0
time taken      9.188

Reply | Threaded
Open this post in threaded view
|

Re: Lua versus Ravi versus Luajit - a simple benchmark

KHMan
On 3/4/2015 10:33 AM, Dibyendu Majumdar wrote:
> [snip]
> Anyway - I need to stay away from this problem until I have completed
> implementing other bytecodes as this is an absorbing issue that can
> easily consume a lot of time. Also, in a real world scenario this may
> not be such an issue.

I think you're doing quite well, that is, in speeding up Lua in
this manner. Whether you can approach luajit's speed with this
method is another matter...

luajit is doing 3.2 billion iterations a second. Do you have a
comparison with a straight C sample? That would be informative,
indicating the amount of overhead we're looking at versus "C
compiled with the minimum overhead".

--
Cheers,
Kein-Hong Man (esq.)
Kuala Lumpur, Malaysia


Reply | Threaded
Open this post in threaded view
|

Re: Lua versus Ravi versus Luajit - a simple benchmark

Dibyendu Majumdar
On 4 March 2015 at 03:15, KHMan <[hidden email]> wrote:
> luajit is doing 3.2 billion iterations a second. Do you have a comparison
> with a straight C sample? That would be informative, indicating the amount
> of overhead we're looking at versus "C compiled with the minimum overhead".
>

Hi,
I'd like to say how much I appreciate your document on Lua VM
bytecodes - it has been of great help to me. I will update it for 5.3
when I get some time.

Re C sample - no I haven't done so. I did have a look at the Luajit
dump of machine code - it is really optimized to the bare minimum, so
a C version is unlikely to be much better than that. In any case for
very simple code like this test I suspect a C version when run through
an optimizer would recognize that actually the loop is not necessary
at all - and one could just return the final value! The code only
makes sense in Lua because of the semantics of the loop variable.

I will do comparisons  with C when I have more complex examples I can test.

There are obviously a number of optimizations that are possible in
this particular example.
a) Recognizing at compile time that the loop variable is an integer with step 1.
b) Eliminating the "external" variable that is only being read - my
current version updates it at every iteration , which means two
assignments at every iteration (the value and the type) that could be
avoided.

But I will resist temptation to keep tuning this - and focus on
getting more coverage for now.


Regards

Reply | Threaded
Open this post in threaded view
|

Re: Lua versus Ravi versus Luajit - a simple benchmark

KHMan
On 3/5/2015 4:05 AM, Dibyendu Majumdar wrote:

> On 4 March 2015 at 03:15, KHMan wrote:
>> luajit is doing 3.2 billion iterations a second. Do you have a comparison
>> with a straight C sample? That would be informative, indicating the amount
>> of overhead we're looking at versus "C compiled with the minimum overhead".
>>
> [snip]
> Re C sample - no I haven't done so. I did have a look at the Luajit
> dump of machine code - it is really optimized to the bare minimum, so
> a C version is unlikely to be much better than that. In any case for
> very simple code like this test I suspect a C version when run through
> an optimizer would recognize that actually the loop is not necessary
> at all - and one could just return the final value! The code only
> makes sense in Lua because of the semantics of the loop variable.

Just make the destination volatile. Also remember to use SSE2.

--
Cheers,
Kein-Hong Man (esq.)
Kuala Lumpur, Malaysia


Reply | Threaded
Open this post in threaded view
|

Re: Lua versus Ravi versus Luajit - a simple benchmark

Dibyendu Majumdar
In reply to this post by Dibyendu Majumdar
Hi,

I implemented following additional op codes:
LOADNIL
LOADFZ (Ravi extension - sets a register to floating point 0)
ADDFN (Ravi extension - adds a floating point value in register B to
operand C  and stores in register A)

With this I am able to run the following Ravi code:

local function x()
  local j:double
  for i=1,1000000000 do
  j = j+1
  end
  return j
end

Equivalent Lua code is:

local function x()
  local j=0.0
  for i=1,1000000000 do
  j = j+1.0
  end
  return j
end


I ran another comparison with Luajit, Lua 5.3 and Ravi:

C:\github\ravi\ravi-tests>\luajit\luajit.exe fornum_test2.lua
1000000000
time taken      0.93

C:\github\ravi\ravi-tests>\lua-5.3.0\src\build\Release\lua.exe fornum_test2.lua
1000000000.0
time taken      9.746

C:\github\ravi\ravi-tests>..\build\Debug\lua.exe fornum_test2.ravi
1000000000.0
time taken      2.782

C:\github\ravi\ravi-tests>\luajit\luajit.exe fornum_test2.lua
1000000000
time taken      0.946

C:\github\ravi\ravi-tests>\lua-5.3.0\src\build\Release\lua.exe fornum_test2.lua
1000000000.0
time taken      9.667

C:\github\ravi\ravi-tests>..\build\Debug\lua.exe fornum_test2.ravi
1000000000.0
time taken      2.786

C:\github\ravi\ravi-tests>\luajit\luajit.exe fornum_test2.lua
1000000000
time taken      0.947

C:\github\ravi\ravi-tests>\lua-5.3.0\src\build\Release\lua.exe fornum_test2.lua
1000000000.0
time taken      9.57

C:\github\ravi\ravi-tests>..\build\Debug\lua.exe fornum_test2.ravi
1000000000.0
time taken      2.829

Reply | Threaded
Open this post in threaded view
|

Re: Lua versus Ravi versus Luajit - a simple benchmark

Ahmed Charles
Why does the Ravi version exist in a Debug directory? Are optimizations on?



> On Mar 4, 2015, at 4:43 PM, Dibyendu Majumdar <[hidden email]> wrote:
>
> Hi,
>
> I implemented following additional op codes:
> LOADNIL
> LOADFZ (Ravi extension - sets a register to floating point 0)
> ADDFN (Ravi extension - adds a floating point value in register B to
> operand C  and stores in register A)
>
> With this I am able to run the following Ravi code:
>
> local function x()
>  local j:double
>  for i=1,1000000000 do
>  j = j+1
>  end
>  return j
> end
>
> Equivalent Lua code is:
>
> local function x()
>  local j=0.0
>  for i=1,1000000000 do
>  j = j+1.0
>  end
>  return j
> end
>
>
> I ran another comparison with Luajit, Lua 5.3 and Ravi:
>
> C:\github\ravi\ravi-tests>\luajit\luajit.exe fornum_test2.lua
> 1000000000
> time taken      0.93
>
> C:\github\ravi\ravi-tests>\lua-5.3.0\src\build\Release\lua.exe fornum_test2.lua
> 1000000000.0
> time taken      9.746
>
> C:\github\ravi\ravi-tests>..\build\Debug\lua.exe fornum_test2.ravi
> 1000000000.0
> time taken      2.782
>
> C:\github\ravi\ravi-tests>\luajit\luajit.exe fornum_test2.lua
> 1000000000
> time taken      0.946
>
> C:\github\ravi\ravi-tests>\lua-5.3.0\src\build\Release\lua.exe fornum_test2.lua
> 1000000000.0
> time taken      9.667
>
> C:\github\ravi\ravi-tests>..\build\Debug\lua.exe fornum_test2.ravi
> 1000000000.0
> time taken      2.786
>
> C:\github\ravi\ravi-tests>\luajit\luajit.exe fornum_test2.lua
> 1000000000
> time taken      0.947
>
> C:\github\ravi\ravi-tests>\lua-5.3.0\src\build\Release\lua.exe fornum_test2.lua
> 1000000000.0
> time taken      9.57
>
> C:\github\ravi\ravi-tests>..\build\Debug\lua.exe fornum_test2.ravi
> 1000000000.0
> time taken      2.829
>

Reply | Threaded
Open this post in threaded view
|

Re: Lua versus Ravi versus Luajit - a simple benchmark

Dibyendu Majumdar
On 5 March 2015 at 11:55, Ahmed Charles <[hidden email]> wrote:
> Why does the Ravi version exist in a Debug directory? Are optimizations on?

Hi,
The LLVM generated code is not affected by this - only affects the
interpreter code. But that is not being measured here so it isn't a
factor in the timings.

I admit though that I am new to LLVM and therefore may not be
generating the best code. I do run the LLVM optimizer passes but the
problem is that contrary to appearances LLVM is poorly documented, so
it is not at all clear how the optimizer passes should be hooked up,
and how well they are coping with the LLVM IR I am generating.
Hopefully I will work out better ways as I learn more.

Regards
Dibyendu

Reply | Threaded
Open this post in threaded view
|

Re: Lua versus Ravi versus Luajit - a simple benchmark

Ahmed Charles




> On Mar 5, 2015, at 12:17 PM, Dibyendu Majumdar <[hidden email]> wrote:
>
>> On 5 March 2015 at 11:55, Ahmed Charles <[hidden email]> wrote:
>> Why does the Ravi version exist in a Debug directory? Are optimizations on?
>
> Hi,
> The LLVM generated code is not affected by this - only affects the
> interpreter code. But that is not being measured here so it isn't a
> factor in the timings.
>
> I admit though that I am new to LLVM and therefore may not be
> generating the best code. I do run the LLVM optimizer passes but the
> problem is that contrary to appearances LLVM is poorly documented, so
> it is not at all clear how the optimizer passes should be hooked up,
> and how well they are coping with the LLVM IR I am generating.
> Hopefully I will work out better ways as I learn more.
>
> Regards
> Dibyendu

It's still a good idea to use release builds for perf testing. The jit is probably a large portion of the time and llvm is known for slow compile times in the jit since most people use it for offline compilation or scenarios where jit time is less of a factor.
Reply | Threaded
Open this post in threaded view
|

Re: Lua versus Ravi versus Luajit - a simple benchmark

Dibyendu Majumdar
On 5 March 2015 at 21:25, Ahmed Charles <[hidden email]> wrote:
> It's still a good idea to use release builds for perf testing. The jit is probably a large portion of the time and llvm is known for slow compile times in the jit since most people use it for offline compilation or scenarios where jit time is less of a factor.

Hi,
Yes you are right. But to work around the time it takes to JIT I am
calling the function twice and measuring the second call. So the time
it takes to compile is not considered, as the function gets compiled
on first call.

The problem I have is that I cannot build a debug version and link to
release version of LLVM - I get link errors. So I will to have two
parallel builds - one release and the other debug - both for Ravi and
LLVm and then link them respectively. Too much hassle ... will do it
when I get a bit more advanced on the JIT compilation.

Thanks for the feedback.

Regards
Dibyendu

Reply | Threaded
Open this post in threaded view
|

Re: Lua versus Ravi versus Luajit - a simple benchmark

Dibyendu Majumdar
In reply to this post by Dibyendu Majumdar
Hi,

This is another test. It is a modified version of the first fornum
test just to make things a bit more difficult.

local function x()
  local j = 0.0
  for i=2.0,1000000000.0 do
  j = i
  for k=3,5 do
  j = k
  end
  end
  return j
end

You can see the test programs, and generated LLVM IR at:
https://github.com/dibyendumajumdar/ravi/tree/master/ravi-tests


C:\github\ravi\ravi-tests>\luajit\luajit.exe fornum_test3.lua
5
time taken      7.806

C:\github\ravi\ravi-tests>\luajit\luajit.exe fornum_test3.lua
5
time taken      7.927

C:\github\ravi\ravi-tests>\luajit\luajit.exe fornum_test3.lua
5
time taken      7.92

C:\github\ravi\ravi-tests>\lua-5.3.0\src\build\Release\lua.exe fornum_test3.lua
5
time taken      53.932

C:\github\ravi\ravi-tests>\lua-5.3.0\src\build\Release\lua.exe fornum_test3.lua
5
time taken      54.64

C:\github\ravi\ravi-tests>\lua-5.3.0\src\build\Release\lua.exe fornum_test3.lua
5
time taken      54.584

C:\github\ravi\ravi-tests>..\build\Debug\lua.exe fornum_test3.lua
2>fornum_test3.ll
5
time taken      17.196

C:\github\ravi\ravi-tests>..\build\Debug\lua.exe fornum_test3.lua
2>fornum_test3.ll
5
time taken      15.92

C:\github\ravi\ravi-tests>..\build\Debug\lua.exe fornum_test3.lua
2>fornum_test3.ll
5
time taken      16.371