Follow-on from does Lua need a JIT - making Lua's interpreter faster

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Follow-on from does Lua need a JIT - making Lua's interpreter faster

Dibyendu Majumdar
It seems from the majority of responses so far that JIT is not
essential for Lua, or putting it another way, a majority of use cases
can be satisfied without a JIT.

A natural follow on question is can the interpreter be made faster
without resorting to hand-written assembly code or other esoteric
optimisations?

Apart from using type annotations and introducing the numeric array
types in Ravi - which when used do improve interpreter speed - I have
also been experimenting with various other changes.

1. If the Lua stack was fixed in size can that lead to more optimised
code? We need to make sure that the compiler knows about this fact.
2. Can the table get/set be optimised for integer and string access -
preferably inlined? I have been playing with the idea of using a
precomputed 'hashmask' similar to LuaJIT to reduce the instructions
needed to locate a hashed value.
3. What if we give hints to the compiler using the gcc/clang branch
weights feature to help the optimizer generate better code for the
more common scenarios? This is a trade-off as making common cases
faster can penalize other cases.
4. Faster C function calls when the C function does not depend on Lua
and takes 1-2 primitive arguments and returns a primitive argument.
5. If we use a memory allocator such as jemalloc will that help?
6. I have not considered computed gotos because I am not sure of their
benefits especially for Ravi where the number of byte code
instructions is larger than Lua. It may also lead to worse
optimisation in other areas.

I have nothing to report yet as the results from my experiments are
inconsistent across platforms, and I really haven't had the time to do
this seriously.

I would welcome any thoughts on this topic.

Regards
Dibyendu

Reply | Threaded
Open this post in threaded view
|

Re: Follow-on from does Lua need a JIT - making Lua's interpreter faster

Sean Conner
It was thus said that the Great Dibyendu Majumdar once stated:

> It seems from the majority of responses so far that JIT is not
> essential for Lua, or putting it another way, a majority of use cases
> can be satisfied without a JIT.
>
> A natural follow on question is can the interpreter be made faster
> without resorting to hand-written assembly code or other esoteric
> optimisations?
>
> Apart from using type annotations and introducing the numeric array
> types in Ravi - which when used do improve interpreter speed - I have
> also been experimenting with various other changes.
>
> 1. If the Lua stack was fixed in size can that lead to more optimised
> code? We need to make sure that the compiler knows about this fact.
> 2. Can the table get/set be optimised for integer and string access -
> preferably inlined? I have been playing with the idea of using a
> precomputed 'hashmask' similar to LuaJIT to reduce the instructions
> needed to locate a hashed value.
> 3. What if we give hints to the compiler using the gcc/clang branch
> weights feature to help the optimizer generate better code for the
> more common scenarios? This is a trade-off as making common cases
> faster can penalize other cases.
> 4. Faster C function calls when the C function does not depend on Lua
> and takes 1-2 primitive arguments and returns a primitive argument.
> 5. If we use a memory allocator such as jemalloc will that help?
> 6. I have not considered computed gotos because I am not sure of their
> benefits especially for Ravi where the number of byte code
> instructions is larger than Lua. It may also lead to worse
> optimisation in other areas.
>
> I have nothing to report yet as the results from my experiments are
> inconsistent across platforms, and I really haven't had the time to do
> this seriously.
>
> I would welcome any thoughts on this topic.

  Profile, profile, profile.  

  As I've mentioned before, at work, I use Lua to process SIP messages.
Part of this is sending requests to a backend process using a custom
protocol that uses a CRC check.  I recompiled the project with profiling
enabled, and reran the regression test (something on the order of 10,000
tests---it takes a few minutes to run).  The regression test generates quite
a bit of network traffic so there's quite a bit of CRCing going on.

  Well, I have the results (compiled with "-DNDEBUG -O3 -pg"):

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total          
 time   seconds   seconds    calls   s/call   s/call  name    
 14.54      1.41     1.41  6618161     0.00     0.00  luaS_newlstr
 13.71      2.74     1.33   780921     0.00     0.00  luaV_execute
 11.96      3.90     1.16   281588     0.00     0.00  match
  9.95      4.87     0.97 18458447     0.00     0.00  luaH_get
  5.88      5.44     0.57  6295728     0.00     0.00  luaD_precall
  3.81      5.81     0.37  2242275     0.00     0.00  sweeplist
  2.78      6.08     0.27  4067464     0.00     0.00  newkey
  1.96      6.27     0.19 13042290     0.00     0.00  luaV_gettable
  1.86      6.45     0.18   699804     0.00     0.00  propagatemark
  1.55      6.60     0.15 10018396     0.00     0.00  luaM_realloc_
  1.55      6.75     0.15  1247028     0.00     0.00  resize
  1.49      6.89     0.15   855027     0.00     0.00  luaH_getn
  1.44      7.03     0.14   303380     0.00     0.00  pushcapture
  1.13      7.14     0.11   722487     0.00     0.00  str_format
  1.03      7.24     0.10  3040708     0.00     0.00  luaV_settable
  1.03      7.34     0.10  2401886     0.00     0.00  lua_rawgeti

  Lots more, but it's rapidly turning into noise.  The CRC function shows up
as the 345th most called function (out of 435)---in other words, a function
I would expect to maybe be higher up on the list is *way* down there.

  But here---there's nothing that really jumps out to me as saying "optimize
here!" Sure, there's luaS_newlstr and luaV_execute, but what could be done
to speed those up?  Is it even worth it?  And so far, the code's performance
(which is in production and handling phone calls) hasn't been an issue (it's
compiled with "-DNDEBUG -O2" BTW; I ran with a default compile (no specific
optimizations and NDEBUG not defined) and got similar results).

  -spc



Reply | Threaded
Open this post in threaded view
|

Re: Follow-on from does Lua need a JIT - making Lua's interpreter faster

GrayFace-2
In reply to this post by Dibyendu Majumdar
On Ср 22.02.17 22:13, Dibyendu Majumdar wrote:
> It seems from the majority of responses so far that JIT is not
> essential for Lua, or putting it another way, a majority of use cases
> can be satisfied without a JIT.
>
> A natural follow on question is can the interpreter be made faster
> without resorting to hand-written assembly code or other esoteric
> optimisations?

Big Yes. When I switched from plain Lua to LuaJIT I was very happy with
the speedup. I didn't run the numbers, but it felt like 10x increase in
speed. Then I found out that mobdebug.lua was disabling JIT, so all this
speedup was from faster interpreter. Speedup from enabling JIT wasn't
nearly as big.
This speedup probably comes from faster table access, but I don't know
how LuaJIT achieves it.

What I also love about LuaJIT is that it's the golden standard of Lua in
terms of language design. Lua 5.3 has suffered a huge degradation in
that area due to its combined int64/double type.


Best regards,
Sergey "GrayFace" Rozhenko,                 mailto:[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Follow-on from does Lua need a JIT - making Lua's interpreter faster

Dibyendu Majumdar
Hi Sergey,

On 23 February 2017 at 12:59, Sergey Rozhenko <[hidden email]> wrote:

> What I also love about LuaJIT is that it's the golden standard of Lua in
> terms of language design. Lua 5.3 has suffered a huge degradation in that
> area due to its combined int64/double type.
>

I would have to disagree with you here. LuaJIT is a Lua 5.1/5.2 clone
so perhaps you mean that you prefer Lua 5.2 or 5.1 to 5.3 as Lua
rather than LuaJIT is the standard.

I think 5.3 is a better language (although I dislike incompatibilities
between the language versions as I think it is important to maintain
backward compatibility in language design). The integer subtype is
very useful; it enables 64-bit integers which otherwise would not be
possible in Lua, and also allows more efficient operations (in Ravi)
for various integer and bitwise operations. In JIT mode the generated
code can be close to C when type annotations are used.

I agree that for some operations there is a potential performance
impact - e.g. when adding numbers now extra type checks are needed.
This impacts the generated code in Ravi as well - although using type
annotations removes the type checks and the extra branching.

I also realise that the 64-bit integer support prevents use of the NaN
trick in representing values, and this has a performance impact due to
inability to use 8-byte values.

But still on the whole it is better that there is an integer sub-type
in my view.

Regards
Dibyendu

Reply | Threaded
Open this post in threaded view
|

Re: Follow-on from does Lua need a JIT - making Lua's interpreter faster

Duncan Cross
In reply to this post by GrayFace-2


On Thu, Feb 23, 2017 at 12:59 PM, Sergey Rozhenko <[hidden email]> wrote:
On Ср 22.02.17 22:13, Dibyendu Majumdar wrote:
A natural follow on question is can the interpreter be made faster
without resorting to hand-written assembly code or other esoteric
optimisations?

Big Yes. When I switched from plain Lua to LuaJIT I was very happy with the speedup. I didn't run the numbers, but it felt like 10x increase in speed. Then I found out that mobdebug.lua was disabling JIT, so all this speedup was from faster interpreter. Speedup from enabling JIT wasn't nearly as big.
This speedup probably comes from faster table access, but I don't know how LuaJIT achieves it.

It comes from exactly what Dibyendu wanted to avoid: hand-written assembly code. The LuaJIT interpreter is completely rewritten this way.

Reply | Threaded
Open this post in threaded view
|

Re: Follow-on from does Lua need a JIT - making Lua's interpreter faster

Dibyendu Majumdar
Hi Duncan,

On 23 February 2017 at 13:27, Duncan Cross <[hidden email]> wrote:

>
>
> On Thu, Feb 23, 2017 at 12:59 PM, Sergey Rozhenko <[hidden email]> wrote:
>>
>> On Ср 22.02.17 22:13, Dibyendu Majumdar wrote:
>>>
>>> A natural follow on question is can the interpreter be made faster
>>> without resorting to hand-written assembly code or other esoteric
>>> optimisations?
>>
>>
>> Big Yes. When I switched from plain Lua to LuaJIT I was very happy with
>> the speedup. I didn't run the numbers, but it felt like 10x increase in
>> speed. Then I found out that mobdebug.lua was disabling JIT, so all this
>> speedup was from faster interpreter. Speedup from enabling JIT wasn't nearly
>> as big.
>> This speedup probably comes from faster table access, but I don't know how
>> LuaJIT achieves it.
>
>
> It comes from exactly what Dibyendu wanted to avoid: hand-written assembly
> code. The LuaJIT interpreter is completely rewritten this way.
>

I think in addition to being written in assembly and inlining of
various table operations, LuaJIT also employs other optimisations, in
particular it optimises the way the Lua values are used (using 8 bytes
rather than 16 bytes) and the stack frame information is merged into
the Lua stack of values. I am not sure if there is a single most
important optimisation - one that increases performance most - if
anyone has a view on this then please let me know. We know that the
combination of optimisations leads to almost 2x improvement in
interpreted mode which is very impressive.

I think I can inline table operations even in C code without having to
resort to assembly. Additionally by doing other changes I may be able
to convince the C optimizer to do better optimizations. But supporting
64-bit integers prevents the optimization of value types, and to
change the way the stack frames are coded would mean a large amount of
rework (it is doable as I don't think it requires rewriting Lua in
assembly code).

The other area where performance is lost is in function call sequence
where I think LuaJIT is able to improve performance significantly.

My knowledge of LuaJIT's implementation is patchy however as the code
is hard to read and understand. If anyone can provide more insight
then please do so.

Regards
Dibyendu

Reply | Threaded
Open this post in threaded view
|

Re: Follow-on from does Lua need a JIT - making Lua's interpreter faster

Dibyendu Majumdar
In reply to this post by Sean Conner
Hi Sean,

On 22 February 2017 at 18:57, Sean Conner <[hidden email]> wrote:

> It was thus said that the Great Dibyendu Majumdar once stated:
>> It seems from the majority of responses so far that JIT is not
>> essential for Lua, or putting it another way, a majority of use cases
>> can be satisfied without a JIT.
>>
>> A natural follow on question is can the interpreter be made faster
>> without resorting to hand-written assembly code or other esoteric
>> optimisations?
>>
>   Profile, profile, profile.
>

I agree that profiling is important. But normal C profiling doesn't
tell you much as it doesn't reveal which Lua bytecodes are important
to optimise. So I have (not released yet) an option in Ravi to measure
time taken by each bytecode and the number of bytecode executions - by
dumping this out once can understand better where to focus.

Second issue with profiling is you need test applications to profile.
This is perhaps the hardest problem - I do not know of representative
pure Lua apps that I could run just to obtain profiling data.

Regards
Dibyendu

Reply | Threaded
Open this post in threaded view
|

Re: Follow-on from does Lua need a JIT - making Lua's interpreter faster

Daurnimator
In reply to this post by Dibyendu Majumdar
On 24 February 2017 at 00:26, Dibyendu Majumdar <[hidden email]> wrote:
> I also realise that the 64-bit integer support prevents use of the NaN
> trick in representing values, and this has a performance impact due to
> inability to use 8-byte values.

This may not be able to live on for too much longer.
ARM64 is already at 52 bits, and X86-64 just got extended to 56 bits:
https://lwn.net/Articles/714824/

Reply | Threaded
Open this post in threaded view
|

Re: Follow-on from does Lua need a JIT - making Lua's interpreter faster

Jorge Visca
In reply to this post by Dibyendu Majumdar
On 22/02/17 12:13, Dibyendu Majumdar wrote:
> It seems from the majority of responses so far that JIT is not
> essential for Lua, or putting it another way, a majority of use cases
> can be satisfied without a JIT.
There is at least some selection bias here, tough.

Jorge