Squeezing more performance from the Lua interpreter

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Squeezing more performance from the Lua interpreter

Eduardo Barthel
Hello folks,

Suppose I have the Lua interpreter compiled just to work with a single application for a target hardware system, this means the interpreter could be tuned to work better with that application and to the system doing things like tuning the compiler C flags or Lua C defines. I've been trying to do this and managed to get good results, so I would like to share the ones I did and to ask if anyone has more ideas.

1. Setting a baseline.
My baseline is the Lua 5.4 interpreter using just "-O2" optimization in CFLAGS,
compiled with GCC 10, targeting x86_64 on a Linux system, no custom Lua defines,
standard garbage collector configuration, then I calculate the average runtime of 10 runs of my application.
Results: 1.071 seconds.

2. Compile just "onelua.c".
By compiling Lua in a single big C source allows the compiler perform more optimizations.
Results: 1.028 seconds. (-4% from baseline)

3. Add `-fno-plt -fno-stack-protector -flto` CFLAGS.
These were some optimization flags that I thought that made sense to try out in Lua.
Results: 1.004 seconds. (-6.3% from baseline)

4. Use a custom memory allocator.
There are many memory allocators out there such as tcmalloc, jemalloc, mimalloc,
you usually can just link them and replace system's malloc. From the ones I tried
lua worked best with mimalloc, so adding the flag '-lmimalloc'
Results: 0.864 seconds. (-19.3% from baseline)

PS: That's interesting, mimalloc improved by a good margin.

5. Make the GC less aggressive.
My application creates a large number of tables (it is a compiler and create hundreds of AST nodes), so the GC is working too early all the time, I use the following to make
the GC less aggressive:

collectgarbage("setpause", 800)
collectgarbage("setstepmul", 400)

Results: 0.777 seconds. (-27.7% from baseline)

PS: I know this consumes much more memory, but I don't care in my use case.

Other things I've tried and made almost no difference:
* Adding -march=native CFLAGS.
* Trying different values for LUAI_MAXSHORTLEN, STRCACHE_N, STRCACHE_M.

Note: The optimizations were stacked, that means, 5 is really 1+2+3+4+5.

Does anyone have other ideas on how to tune the Lua interpreter to squeeze more performance?

Best regards,
Eduardo Bart
Reply | Threaded
Open this post in threaded view
|

Re: Squeezing more performance from the Lua interpreter

Gisle Vanem
Eduardo Bart wrote:

> Does anyone have other ideas on how to tune the Lua interpreter to squeeze more performance?

On MSVC (Win32), one can use register-calls.
Option '-Gr', but it needs some patching of .h-files.
E.g. all var-arg functions must be declared as
'__cdecl' etc.

I suppose some gcc have 'register-calls' too (?)

--
--gv
v
Reply | Threaded
Open this post in threaded view
|

Re: Squeezing more performance from the Lua interpreter

v
In reply to this post by Eduardo Barthel
On Thu, 2020-08-20 at 08:50 -0300, Eduardo Bart wrote:
> 1. Setting a baseline.
> My baseline is the Lua 5.4 interpreter using just "-O2" optimization
> in CFLAGS,
> Other things I've tried and made almost no difference:
> * Adding -march=native CFLAGS.

I suspect that -march=native might have bigger impact at -O3, where
automatic vectorization is allowed, so machine-specific boost from
-march becomes more noticable. Can't say this reliably.

Either way, -O3 alone may raise performance.
--
v <[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: Squeezing more performance from the Lua interpreter

Eduardo Barthel
I've tested -O3 -march=native and it makes no difference for my use case. I tend to avoid O3 unless on code that I know math operations could be vectorized, like in a matrix library, I don't think O3 can do much in the Lua interpreter. Also O3 can generate more code and bigger binaries, under certain circumstances this can be slower.

Em qui., 20 de ago. de 2020 às 10:23, v <[hidden email]> escreveu:
On Thu, 2020-08-20 at 08:50 -0300, Eduardo Bart wrote:
> 1. Setting a baseline.
> My baseline is the Lua 5.4 interpreter using just "-O2" optimization
> in CFLAGS,
> Other things I've tried and made almost no difference:
> * Adding -march=native CFLAGS.

I suspect that -march=native might have bigger impact at -O3, where
automatic vectorization is allowed, so machine-specific boost from
-march becomes more noticable. Can't say this reliably.

Either way, -O3 alone may raise performance.
--
v <[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: Squeezing more performance from the Lua interpreter

Frank Kastenholz-2
In reply to this post by Eduardo Barthel

Hi
Eduardo bart’s numbers are certainly very good.

Several years ago (in the Lua 5.2 days) I ran into some really big performance problems in a Lua app we were supporting.  We got huge improvements (10x or more iirc) by being more careful in coding, tailoring the code to Lua’s characteristics.  Two big changes that I recall were 1) using local wherever possible and 2) doing simple optimizations like pulling constant evaluations out of loops.  

Frank


On Aug 20, 2020, at 10:00 AM, Eduardo Bart <[hidden email]> wrote:


I've tested -O3 -march=native and it makes no difference for my use case. I tend to avoid O3 unless on code that I know math operations could be vectorized, like in a matrix library, I don't think O3 can do much in the Lua interpreter. Also O3 can generate more code and bigger binaries, under certain circumstances this can be slower.

Em qui., 20 de ago. de 2020 às 10:23, v <[hidden email]> escreveu:
On Thu, 2020-08-20 at 08:50 -0300, Eduardo Bart wrote:
> 1. Setting a baseline.
> My baseline is the Lua 5.4 interpreter using just "-O2" optimization
> in CFLAGS,
> Other things I've tried and made almost no difference:
> * Adding -march=native CFLAGS.

I suspect that -march=native might have bigger impact at -O3, where
automatic vectorization is allowed, so machine-specific boost from
-march becomes more noticable. Can't say this reliably.

Either way, -O3 alone may raise performance.
--
v <[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: Squeezing more performance from the Lua interpreter

Eduardo Barthel
This I've known, in the Lua code I tend to follow these guidelines made by Roberto years ago,

But brings me the question, do all of them still hold true in Lua 5.4? I see that those tips were made 12 years ago.

Em qui., 20 de ago. de 2020 às 11:30, Frank Kastenholz <[hidden email]> escreveu:

Hi
Eduardo bart’s numbers are certainly very good.

Several years ago (in the Lua 5.2 days) I ran into some really big performance problems in a Lua app we were supporting.  We got huge improvements (10x or more iirc) by being more careful in coding, tailoring the code to Lua’s characteristics.  Two big changes that I recall were 1) using local wherever possible and 2) doing simple optimizations like pulling constant evaluations out of loops.  

Frank


On Aug 20, 2020, at 10:00 AM, Eduardo Bart <[hidden email]> wrote:


I've tested -O3 -march=native and it makes no difference for my use case. I tend to avoid O3 unless on code that I know math operations could be vectorized, like in a matrix library, I don't think O3 can do much in the Lua interpreter. Also O3 can generate more code and bigger binaries, under certain circumstances this can be slower.

Em qui., 20 de ago. de 2020 às 10:23, v <[hidden email]> escreveu:
On Thu, 2020-08-20 at 08:50 -0300, Eduardo Bart wrote:
> 1. Setting a baseline.
> My baseline is the Lua 5.4 interpreter using just "-O2" optimization
> in CFLAGS,
> Other things I've tried and made almost no difference:
> * Adding -march=native CFLAGS.

I suspect that -march=native might have bigger impact at -O3, where
automatic vectorization is allowed, so machine-specific boost from
-march becomes more noticable. Can't say this reliably.

Either way, -O3 alone may raise performance.
--
v <[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: Squeezing more performance from the Lua interpreter

Dibyendu Majumdar
In reply to this post by Eduardo Barthel
With gcc, when using using computed goto, it seems following options
help: -fno-crossjumping -fno-gcse.

I found that the benefits of a custom allocator depend on the
platform. Default malloc/free on Linux seems very good - but on MacOSX
a custom allocator is measurably faster (you can try the binary trees
benchmark for this).

Regards
Reply | Threaded
Open this post in threaded view
|

Re: Squeezing more performance from the Lua interpreter

Dibyendu Majumdar
On Fri, 21 Aug 2020 at 14:39, Dibyendu Majumdar <[hidden email]> wrote:
>
> With gcc, when using using computed goto, it seems following options
> help: -fno-crossjumping -fno-gcse.
>

Forgot to add that only enable this for lvm.c

> I found that the benefits of a custom allocator depend on the
> platform. Default malloc/free on Linux seems very good - but on MacOSX
> a custom allocator is measurably faster (you can try the binary trees
> benchmark for this).
>
> Regards
Reply | Threaded
Open this post in threaded view
|

Re: Squeezing more performance from the Lua interpreter

Eduardo Barthel
In reply to this post by Dibyendu Majumdar
Thanks for the tip!

Seems like I got another ~4% improvement with those flags! (although I've enabled globally)
What are they doing in the interpreter? Better jumping when executing the lua instructions?

Custom allocator also benefits me on Linux, the results I've shown were all on linux.
I've managed to bundle my Lua interpreter with this allocator https://github.com/mjansson/rpmalloc
Seems to work fine and it's much more simple to use (a single C file) and performs the same as mimalloc,
and much faster than my system's standard malloc.
I think it could be even faster if the thread safety was removed from that allocator.
All good allocators in C that I find out there are thread safe and I don't see much need in the Lua case.

I've also experimented porting the LuaJIT's allocator to Lua 5.4 (was quite easy to do),
and it performed worse than mimalloc or rpmalloc for my use case.

Regards,
Eduardo Bart


Em sex., 21 de ago. de 2020 às 10:40, Dibyendu Majumdar <[hidden email]> escreveu:
With gcc, when using using computed goto, it seems following options
help: -fno-crossjumping -fno-gcse.

I found that the benefits of a custom allocator depend on the
platform. Default malloc/free on Linux seems very good - but on MacOSX
a custom allocator is measurably faster (you can try the binary trees
benchmark for this).

Regards
Reply | Threaded
Open this post in threaded view
|

Re: Squeezing more performance from the Lua interpreter

Dibyendu Majumdar
On Fri, 21 Aug 2020 at 17:11, Eduardo Bart <[hidden email]> wrote:
>
> Thanks for the tip!
>
> Seems like I got another ~4% improvement with those flags! (although I've enabled globally)
> What are they doing in the interpreter? Better jumping when executing the lua instructions?

I wouldn't enable globally as it disables some optimizations -
specially global common subexpression - which is likely bad for the
code.
The only code that needs these options is the VM which uses computed
goto - I think in 5.4 this is enabled for gcc.
If you see the generated assembly output - without these flags the
code may not be using computed goto at all.

>
> Custom allocator also benefits me on Linux, the results I've shown were all on linux.
> I've managed to bundle my Lua interpreter with this allocator https://github.com/mjansson/rpmalloc
> Seems to work fine and it's much more simple to use (a single C file) and performs the same as mimalloc,
> and much faster than my system's standard malloc.
> I think it could be even faster if the thread safety was removed from that allocator.
> All good allocators in C that I find out there are thread safe and I don't see much need in the Lua case.
>
> I've also experimented porting the LuaJIT's allocator to Lua 5.4 (was quite easy to do),
> and it performed worse than mimalloc or rpmalloc for my use case.

You can use dlmalloc - which was used by LuaJIT (with some mods).
I use it in Ravi.
It has an arena mode (ONLY_MSPACES ) - and you can disable all locking
too. So that makes it suited for Lua as a single threaded allocator.
I used this instead of the LuaJIT version as it is well documented and
just drop-in.

Regards
Reply | Threaded
Open this post in threaded view
|

Re: Squeezing more performance from the Lua interpreter

Eduardo Barthel
Ok, I've enabled those flags just in the luaV_execute (I looked into ravi sources)

```
#if defined(__GNUC__) && !defined(__clang__)
__attribute((optimize("no-crossjumping,no-gcse")))
#endif
void luaV_execute (lua_State *L, CallInfo *ci) {
```

Numbers are still good!

About the allocators, I've tried dlmalloc 2.8.6 and compared with the others,
I did not use the MSPACE interface though, I've used the usual malloc/realloc/free interface.

The following numbers are under the same conditions as the original post
(1+2+3+4+5 optimizations) and also enabled  "-fno-crossjumping -fno-gcse." in lvm.c:

dlmalloc 0.820s (-23.4% from baseline)
rpmalloc 0.760s (-29.0% from baseline)
mimalloc 0.747s (-30.3% from baseline)

Notice mimalloc went from 0.777 seconds. (-27.7%) to  0.747s (-30.3%) due to the  "-fno-crossjumping -fno-gcse." flags in lvm.c.

I will stick with rpmalloc because it performs better in my use case and is smaller, faster and can still bundle in a single C file,
while mimalloc is a burden as dependency.
Perhaps I could optimize rpmalloc later to remove atomic operations, thread locals and the numbers could be better.

Regards,
Eduardo Bart




Em sex., 21 de ago. de 2020 às 13:23, Dibyendu Majumdar <[hidden email]> escreveu:
On Fri, 21 Aug 2020 at 17:11, Eduardo Bart <[hidden email]> wrote:
>
> Thanks for the tip!
>
> Seems like I got another ~4% improvement with those flags! (although I've enabled globally)
> What are they doing in the interpreter? Better jumping when executing the lua instructions?

I wouldn't enable globally as it disables some optimizations -
specially global common subexpression - which is likely bad for the
code.
The only code that needs these options is the VM which uses computed
goto - I think in 5.4 this is enabled for gcc.
If you see the generated assembly output - without these flags the
code may not be using computed goto at all.

>
> Custom allocator also benefits me on Linux, the results I've shown were all on linux.
> I've managed to bundle my Lua interpreter with this allocator https://github.com/mjansson/rpmalloc
> Seems to work fine and it's much more simple to use (a single C file) and performs the same as mimalloc,
> and much faster than my system's standard malloc.
> I think it could be even faster if the thread safety was removed from that allocator.
> All good allocators in C that I find out there are thread safe and I don't see much need in the Lua case.
>
> I've also experimented porting the LuaJIT's allocator to Lua 5.4 (was quite easy to do),
> and it performed worse than mimalloc or rpmalloc for my use case.

You can use dlmalloc - which was used by LuaJIT (with some mods).
I use it in Ravi.
It has an arena mode (ONLY_MSPACES ) - and you can disable all locking
too. So that makes it suited for Lua as a single threaded allocator.
I used this instead of the LuaJIT version as it is well documented and
just drop-in.

Regards
Reply | Threaded
Open this post in threaded view
|

Re: Squeezing more performance from the Lua interpreter

Sven Olsen
In reply to this post by Eduardo Barthel
Does anyone have other ideas on how to tune the Lua interpreter to squeeze more performance?

My experience with Visual Studio (on windows) is that rebuilding the sources using profile guided optimization can almost double my program speeds.  Not sure if it would translate to Linux -- but, insomuch as you do have the capacity to do a build that's optimized based on real world usage data, rather than just using a canned heuristic set, I think that's worth trying.

-Sven
Reply | Threaded
Open this post in threaded view
|

Re: Squeezing more performance from the Lua interpreter

Eduardo Barthel
Thanks for mentioning this, I've used the PGO (profile guided optimization) from GCC through `-fprofile-generate` and `-fprofile-use` flags and got another ~10% relative improvement!
Thus now combiling all the tips from this thread this is my results from the baseline:

Results: 0.706 seconds. (-34.1% from baseline)

Thus I found out that generally my Lua 5.4 is about ~10% faster when running the script compiled with profiled guided optimizations for that same script.


Em seg., 31 de ago. de 2020 às 18:22, Sven Olsen <[hidden email]> escreveu:
Does anyone have other ideas on how to tune the Lua interpreter to squeeze more performance?

My experience with Visual Studio (on windows) is that rebuilding the sources using profile guided optimization can almost double my program speeds.  Not sure if it would translate to Linux -- but, insomuch as you do have the capacity to do a build that's optimized based on real world usage data, rather than just using a canned heuristic set, I think that's worth trying.

-Sven