LuaJIT2 performance for number crunching

classic Classic list List threaded Threaded
149 messages Options
1234 ... 8
Reply | Threaded
Open this post in threaded view
|

LuaJIT2 performance for number crunching

Francesco Abbate
Hi all,

I have been quite excited in the last weeks about LuaJIT2 and its new
FFI library. As you know the performance of LuaJIT2 in term of
execution speed can, in some cases, approach the speed of C or Fortran
compiled code. The FFI library is also very interesting because it
does allow a very easy way to call directly C routine but also to
improve the efficiency by using native C structures and arrays to
greatly improves, in some cases, the memory usage and speed if
compared with a standard Lua implementation based on Lua tables.

As many have already pointed out there is a big problem when using
external C libraries that repeatedly calls Lua callback functions,
like Leo Rozumov was mentioning for numerical integration routines
that requires a pointer to a C function to evaluate the function. The
problem is that in these circumstances LuaJIT is not able to optimize
the Lua callback function because it is called from the C code.

I did think a lot of think about this problem because this is one of
the major problem of GSL Shell with routine like ODE integration,
non-linear fit and numerical integration. All these function requires
to call many time a Lua function from C Code and the performance are
very bad in term of execution speed.

Some suggested to rewrite the C routines in pure Lua to take advantage
of the exceptional performances of LuaJIT2 and avoid the C callback
problem altogether.

In order to test how effective was actually the LuaJIT2 for heavily
numeric code I've chosen a classical examples, the Runge-Kutta 4th
order ODE integration method, to test the LuaJIT2 performances versus
C optimized code. For the implementation I've taken the GSL
implementation of the method, the rk4.c is taken from the GSL library
and is included so that people can easily look at it.

I've implemented straightforwardly the same routines 'step' and
'apply' is pure Lua to compare the performance. The idea is that we
write exactly the same implementation of the algorithm in Lua and in C
and we test the results.

A this point a remark is needed about the implementation of the
algorithm. It does works for system of any dimension and it works by
using an array of doubles to store the computed values and
derivatives. As in this example we have only a 2-element
vector it would have been easy to unpack the components to have a more
efficient Lua code. The idea is that we don't want
to do that because we want to have an algorithm that can work for ODE
systems of any dimension, from 1 to N where N is possibly a big
integer.

In Lua I've got a few options to store the array: Lua native tables,
VLA (variable length array) with FFI and GSL matrices as implemented
in GSL shell. These latters have the handicap that every single
read/write access to the element of the matrix is made by a C function
that checks the index to detect out-of-range errors.

The Lua code is included and I believe it should be quite clear by itself.


Here the results of the benchmark:

LuaJIT2 with FFI, 0m47.805s
LuaJIT2 with GSL, 1m20.958s
LuaJIT2 with Lua tables, 0m9.319s
Lua 5.1.4, 0m26.644s
C code with GSL, (compiled with -O2): 0m0.607s

It is possible that I've made some errors in the benchmark, I will
wait for comments but it seems to me that te results make sense.

What Is remark is:
LuaJIT2 gives better results when using Lua tables because probably
can do the better optimizations. In this case it is x15 times slower
than plain C code.

Performance of Lua 5.1.4 is x2.85 times slower that LuaJIT2, not so
bad indeed. Is my benchmark flawed ?

For the other side I was surprised by the FFI performance that are
quite bad. I guess it is due to less effective optimizations made by
the JIT compiler but slower then plain Lua ??

Any comment is welcome on this results. I know that I may have made
many errors in the benchmark so don't be too much harsh with me :-)

--
Francesco

rk4.lua (4K) Download Attachment
rk4_benchmark.c (1K) Download Attachment
rk4.c (10K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT2 performance for number crunching

Leo Razoumov
On Sat, Feb 12, 2011 at 19:39, Francesco Abbate <[hidden email]> wrote:
> Hi all,
>
> As many have already pointed out there is a big problem when using
> external C libraries that repeatedly calls Lua callback functions,
> like Leo Razoumov was mentioning for numerical integration routines
> that requires a pointer to a C function to evaluate the function. The
> problem is that in these circumstances LuaJIT is not able to optimize
> the Lua callback function because it is called from the C code.
>

Mike Pall recently pointed out a clean approach to facilitate
Lua->C->Lua calls using C-side coroutines.
http://lua-users.org/lists/lua-l/2011-02/msg00455.html
Most importantly, this method does not preclude LuaJIT optimizations.

--Leo--

Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT2 performance for number crunching

Mike Pall-24
In reply to this post by Francesco Abbate
Francesco Abbate wrote:
> A this point a remark is needed about the implementation of the
> algorithm. It does works for system of any dimension and it works by
> using an array of doubles to store the computed values and
> derivatives. As in this example we have only a 2-element
> vector it would have been easy to unpack the components to have a more
> efficient Lua code. The idea is that we don't want
> to do that because we want to have an algorithm that can work for ODE
> systems of any dimension, from 1 to N where N is possibly a big
> integer.

Well, that's the main problem. LuaJIT is not tuned to deal with
tons of loops that run only 2 iterations. It unrolls them, but
there's a limit to that and this hits here.

In C++ one would use templates for that purpose. This instantiates
a copy of the whole code for a specific number of dimensons.
That's not as wasteful as it sounds, since you probably only ever
use a finite set of dimensions, e.g. dim=2 and dim=3

In Lua we can do the same: specialize the Lua code at runtime for
the number of dimensions, memoize the code in a table and dispatch
based on the dimension. I.e. string.format + loadstring +
memoization table.

> Here the results of the benchmark:
>
> LuaJIT2 with FFI, 0m47.805s
> LuaJIT2 with GSL, 1m20.958s

The tiny Lua function is never compiled. It runs interpreted and
the overhead for the FFI is big when not compiled. But apparently
still slightly less than the overhead of C code with userdatas.

> LuaJIT2 with Lua tables, 0m9.319s
> Lua 5.1.4, 0m26.644s
> C code with GSL, (compiled with -O2): 0m0.607s

You're lucky it runs even that fast. None of the Lua code is
compiled, due to all of those tiny loops! Have a look with -jv.

Just for the sake of the experiment, I've expanded the code for
two dimensions by hand (attached below). This is still problematic
due to all of these loads and stores to y[1], y[2] etc., where
local variables would do. For the real thing you should definitely
expand this to y_1, y_2 etc..

Since this is really just one giant loop, all those stores and
guards cause too many snapshots, so you'll need to run it with
-Omaxsnap=200 for the plain Lua array. The FFI causes less guards,
so it runs with the default settings.

5.68s rk4.lua     Lua tables
5.68s rk4.lua     FFI
0.24s rk4_2d.lua  Lua tables with -Omaxsnap=200
0.23s rk4_2d.lua  FFI

Now this is ~25x faster than before, which also means it's faster
than the pure C code!

And if you'd use local variables instead of all those tiny arrays,
performance would be even better.

> For the other side I was surprised by the FFI performance that are
> quite bad. I guess it is due to less effective optimizations made by
> the JIT compiler but slower then plain Lua ??

You're comparing apples and oranges: a C program plus the C -> Lua
call overhead plus the (uncompiled) FFI overhead with the overhead
of a pure Lua program.

But this actually does prove my point: rewriting mixed C/Lua code
in pure Lua (+FFI) is the way to go.

--Mike

rk4_2d.lua (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT2 performance for number crunching

Leo Razoumov
On Sun, Feb 13, 2011 at 07:37, Mike Pall <[hidden email]> wrote:

> Well, that's the main problem. LuaJIT is not tuned to deal with
> tons of loops that run only 2 iterations. It unrolls them, but
> there's a limit to that and this hits here.
>
> In C++ one would use templates for that purpose. This instantiates
> a copy of the whole code for a specific number of dimensons.
> That's not as wasteful as it sounds, since you probably only ever
> use a finite set of dimensions, e.g. dim=2 and dim=3
>
> In Lua we can do the same: specialize the Lua code at runtime for
> the number of dimensions, memoize the code in a table and dispatch
> based on the dimension. I.e. string.format + loadstring +
> memoization table.
>

It would be great if this specialize/memoize/dispatch optimization for
small number of dimensions could make it into LuaJIT-2.0 release.

As my Physics professor used to say -- dimensionality can be
classified in "1,2,3,4 and many". Covering for 2,3,4 via
specializations would greatly help efforts of efficiently implementing
numerical algorithms in Lua.

--Leo--

Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT2 performance for number crunching

Mike Pall-24
Leo Razoumov wrote:
> > In Lua we can do the same: specialize the Lua code at runtime for
> > the number of dimensions, memoize the code in a table and dispatch
> > based on the dimension. I.e. string.format + loadstring +
> > memoization table.
>
> It would be great if this specialize/memoize/dispatch optimization for
> small number of dimensions could make it into LuaJIT-2.0 release.

Umm, you misunderstood: this out of the scope of what a compiler
automatically can or should do for you. C++ compilers don't write
templates for you, either. :-) It's pretty easy to do it yourself
in Lua, far easier than writing C++ templates.

--Mike

Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT2 performance for number crunching

Miles Bader-2
In reply to this post by Mike Pall-24
Mike Pall <[hidden email]> writes:
> Well, that's the main problem. LuaJIT is not tuned to deal with
> tons of loops that run only 2 iterations. It unrolls them, but
> there's a limit to that and this hits here.
>
> In C++ one would use templates for that purpose. This instantiates
> a copy of the whole code for a specific number of dimensons.
> That's not as wasteful as it sounds, since you probably only ever
> use a finite set of dimensions, e.g. dim=2 and dim=3

To be fair, gcc at least will happily unroll small fixed loops like
this, and usually does, to the extent that I've stopped trying to use
any kind of special technique, and just write the obvious loops when
programming in C++.

-miles

--
My books focus on timeless truths.  -- Donald Knuth

Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT2 performance for number crunching

Mike Pall-24
Miles Bader wrote:

> Mike Pall <[hidden email]> writes:
> > In C++ one would use templates for that purpose. This instantiates
> > a copy of the whole code for a specific number of dimensons.
> > That's not as wasteful as it sounds, since you probably only ever
> > use a finite set of dimensions, e.g. dim=2 and dim=3
>
> To be fair, gcc at least will happily unroll small fixed loops like
> this, and usually does, to the extent that I've stopped trying to use
> any kind of special technique, and just write the obvious loops when
> programming in C++.

Only if the loop count is known to be constant. But it isn't here,
since it's user-specified.

--Mike

Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT2 performance for number crunching

Leo Razoumov
In reply to this post by Mike Pall-24
On Sun, Feb 13, 2011 at 10:18, Mike Pall <[hidden email]> wrote:

> Leo Razoumov wrote:
>> > In Lua we can do the same: specialize the Lua code at runtime for
>> > the number of dimensions, memoize the code in a table and dispatch
>> > based on the dimension. I.e. string.format + loadstring +
>> > memoization table.
>>
>> It would be great if this specialize/memoize/dispatch optimization for
>> small number of dimensions could make it into LuaJIT-2.0 release.
>
> Umm, you misunderstood: this out of the scope of what a compiler
> automatically can or should do for you. C++ compilers don't write
> templates for you, either. :-) It's pretty easy to do it yourself
> in Lua, far easier than writing C++ templates.
>

Sorry for my misunderstanding but I thought that LuaJIT-2 can unroll
fixed length loops like for i=1,3 do ... end.
It is a pity that Lua does not have a concept of a constant variable
so that when I write  for i=1,n do ... end I really mean that value of
n cannot change. Is there any room to hint LuaJIT about certain
logical constrains of the executing code to assist it in
optimizations?

--Leo--

Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT2 performance for number crunching

Mike Pall-24
Leo Razoumov wrote:
> On Sun, Feb 13, 2011 at 10:18, Mike Pall <[hidden email]> wrote:
> > Umm, you misunderstood: this out of the scope of what a compiler
> > automatically can or should do for you. C++ compilers don't write
> > templates for you, either. :-) It's pretty easy to do it yourself
> > in Lua, far easier than writing C++ templates.
>
> Sorry for my misunderstanding but I thought that LuaJIT-2 can unroll
> fixed length loops like for i=1,3 do ... end.

Yes, it even unrolls loops with unknown, but short iteration
counts. But not forever and not at a global scope. The heuristics
are just not set up for doing this dozens of times inside a single
root trace. This would really hurt in the general case.

> It is a pity that Lua does not have a concept of a constant variable
> so that when I write  for i=1,n do ... end I really mean that value of
> n cannot change. Is there any room to hint LuaJIT about certain
> logical constrains of the executing code to assist it in
> optimizations?

This is one meta-level higher than the compiler operates on. It's
not hard to do something like this for simple examples. But I see
some problems in real-world use cases: specialization and dispatch
for 'n' needs to be done as early as possible and not repeated.
But the original 'n' is probably loaded from somewhere else,
passed around and so on. That would lose its pseudo-const property
and kill later specializations. Not so easy ...

--Mike

Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT2 performance for number crunching

Francesco Abbate
In reply to this post by Mike Pall-24
2011/2/13 Mike Pall <[hidden email]>:
> Well, that's the main problem. LuaJIT is not tuned to deal with
> tons of loops that run only 2 iterations. It unrolls them, but
> there's a limit to that and this hits here.

Hi Mike,

thank you very much for your answer. Now everything is much more clear
for me. I've got the feeling that something in the code was preventing
LuaJIT from performing the needed optimizations and the results with
FFI was especially surprising to me but now everything is clear.

I have already understood in past that in order to let LuaJIT optimize
effectively the code you have to avoid use of closures and adopt a
more procedural style. Now I understand better that LuaJIT only
optimize the innermost loops and this explain why it does largely fail
to optimize rk4 code.

> In C++ one would use templates for that purpose. This instantiates
> a copy of the whole code for a specific number of dimensons.
> That's not as wasteful as it sounds, since you probably only ever
> use a finite set of dimensions, e.g. dim=2 and dim=3

Well, for me there is some confusion here. It is true that you can use
templates with the dimension of the system as a parameter to let the
C++ compiler more aggressive optimization but if you use just plain C
loops with a non-constant bound the code will be still optimized and
the difference of performance will be a minor one. I mean, optimizing
C/C++ compiler always optimize *all* the code. The problem here is
that LuaJIT2 will only optimize the innermost loops and do almost
nothing to optimize the code in the outer loop.

> In Lua we can do the same: specialize the Lua code at runtime for
> the number of dimensions, memoize the code in a table and dispatch
> based on the dimension. I.e. string.format + loadstring +
> memoization table.

Yes, you can do that and this is an interesting possibility but it
does also increase significantly the level of complexity. You cannot
just implement your algorithm but you have to generate at run-time the
code for your routine in a way that LuaJIT can optimize. This can be
done in same cases but I think that it cannot be proposed as a generic
method to implement algorithms.

>> Here the results of the benchmark:
>>
>> LuaJIT2 with FFI, 0m47.805s
>> LuaJIT2 with GSL, 1m20.958s
>
> The tiny Lua function is never compiled. It runs interpreted and
> the overhead for the FFI is big when not compiled. But apparently
> still slightly less than the overhead of C code with userdatas.

Unfortunately the C code have to check if the index is within the
limits and it does have to check the type of the userdata. This latter
is a very expensive operations because of the way it is normally done
in Lua as described in PIL.

>> LuaJIT2 with Lua tables, 0m9.319s
>> Lua 5.1.4, 0m26.644s
>> C code with GSL, (compiled with -O2): 0m0.607s
>
> You're lucky it runs even that fast. None of the Lua code is
> compiled, due to all of those tiny loops! Have a look with -jv.
>
> Just for the sake of the experiment, I've expanded the code for
> two dimensions by hand (attached below). This is still problematic
> due to all of these loads and stores to y[1], y[2] etc., where
> local variables would do. For the real thing you should definitely
> expand this to y_1, y_2 etc..
>
> Since this is really just one giant loop, all those stores and
> guards cause too many snapshots, so you'll need to run it with
> -Omaxsnap=200 for the plain Lua array. The FFI causes less guards,
> so it runs with the default settings.
>
> 5.68s rk4.lua     Lua tables
> 5.68s rk4.lua     FFI
> 0.24s rk4_2d.lua  Lua tables with -Omaxsnap=200
> 0.23s rk4_2d.lua  FFI
>
> Now this is ~25x faster than before, which also means it's faster
> than the pure C code!
>
> And if you'd use local variables instead of all those tiny arrays,
> performance would be even better.

Ok, now it is clear. I see that when you unroll the innermost loops
the code is *really* optimize and very close (or may be better) than C
code. Unfortunately, as you know, the RK4 algorithm cannot be
implemented like you have done because it will work only of ODE system
of dimension 2. The alternative is, as you suggested, to generate
dynamically the code and let the JIT compiler optimize the code. This
is definitely interesting but my feeling is that we cannot propose
that as a generic method to implement algorithms.

> But this actually does prove my point: rewriting mixed C/Lua code
> in pure Lua (+FFI) is the way to go.

Well, this is what I've done but it was not working!!! :-)
...so you should say: rewriting mixed C/Lua code in pure Lua (+FFI)
and program everything in procedural way using only a single big loop
to be sure that LuaJIT optimize it :-)

Seriously, this discussion was very important for me to better
understand the strength and the limitations of LuaJIT2. For me the
problem with nested loops is a showstopper to propose Lua for
numerical algorithm implementation. As you explained, if you use some
special techniques it can work and you can obtain excellent
performance but the price is a higher level of complexity and much
less expressive code. I can also imagine a case where the dimension of
the problem is quite big, let say 100, what about generating ~ 600
local variables and unroll a huge loop to let LuaJIT optimize it ? As
everyone will understand this is really awkward and not robust for
generic algorithm implementations.

In my point of view the potential of the LuaJIT compiler will be much
greater for numerical algorithms and also non-numerical ones when it
will be able to optimize at least two ot three levels of nested loops.
To be honest I don't have any idea about how difficult it would be to
be done but from a user point of view this is what would be needed.

Just to be clear, I don't want to diminish the outstanding technical
value of LuaJIT2 and of its FFI implementation. We all know that what
Mike have done till know is of exceptional quality and if I have made
the benchmark was just to better understand its potential for
numerical algorithm implementation.

--
Francesco

Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT2 performance for number crunching

CrazyButcher
2011/2/13 Francesco Abbate <[hidden email]>:
> Well, for me there is some confusion here. It is true that you can use
> templates with the dimension of the system as a parameter to let the
> C++ compiler more aggressive optimization but if you use just plain C
> loops with a non-constant bound the code will be still optimized and
> the difference of performance will be a minor one. I mean, optimizing
> C/C++ compiler always optimize *all* the code.

This sounds wrong to me. There is a huge difference between unrolling
small loops and using actual runtime checks in C/C++ as well.
Especially if you do this for many small ops. It's really embarrassing
what difference that will make.

> The problem here is
> that LuaJIT2 will only optimize the innermost loops and do almost
> nothing to optimize the code in the outer loop.

I am not sure what you mean here. The jit will have optimized paths
for the various codepaths it recognizes and thinks its worth
optimizing.

have you read through this?
http://lua-users.org/lists/lua-l/2008-02/msg00051.html

> I can also imagine a case where the dimension of
> the problem is quite big, let say 100, what about generating ~ 600
> local variables and unroll a huge loop to let LuaJIT optimize it ?

you do realize that at this magnitude of the loop would become "hot"
again and no unroll required. Just like your c compiler would
sometimes unroll code and sometimes leave the loop in, depending on
the constant.

Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT2 performance for number crunching

Mike Pall-24
In reply to this post by Francesco Abbate
Francesco Abbate wrote:
> The problem here is
> that LuaJIT2 will only optimize the innermost loops and do almost
> nothing to optimize the code in the outer loop.

This is wrong. It anchors code at loops, but it compiles and
optimizes _all_ kinds of hot code paths.

The region selection heuristics fail for the specific example
because of the sheer mass of way-too-short loops.

> > In Lua we can do the same: specialize the Lua code at runtime for
> > the number of dimensions, memoize the code in a table and dispatch
> > based on the dimension. I.e. string.format + loadstring +
> > memoization table.
>
> Yes, you can do that and this is an interesting possibility but it
> does also increase significantly the level of complexity. You cannot
> just implement your algorithm but you have to generate at run-time the
> code for your routine in a way that LuaJIT can optimize. This can be
> done in same cases but I think that it cannot be proposed as a generic
> method to implement algorithms.

I beg to differ. It's a common idiom in Lua to generate code at
runtime. And it's quite easy, too.

If someone tells you to use templates for C++ because that will
make your code an order of magnitude faster -- do you tell him
that you're too lazy to implement that or would you rather take
that order of magnitude speedup?

> Well, this is what I've done but it was not working!!! :-)

Because you didn't try hard enough. Naive translations of programs
written in other languages almost never perform well. An idiomatic
rewrite in Lua would look quite different. It could also take
advantage of multiple-return values and other features the
language has to offer.

I showed you how to make it even faster than the C version. It's
your decision, whether you'll heed that advice or not.

> ...so you should say: rewriting mixed C/Lua code in pure Lua (+FFI)
> and program everything in procedural way using only a single big loop
> to be sure that LuaJIT optimize it :-)

This is wrong advice in general, as I've explained above.

It does help if your inner loops with non-constant bounds iterate
more than two times, but that's a general wisdom which applies to
any language.

> In my point of view the potential of the LuaJIT compiler will be much
> greater for numerical algorithms and also non-numerical ones when it
> will be able to optimize at least two ot three levels of nested loops.

Your assumptions are wrong, so your conclusion is wrong. LuaJIT is
quite competitive on SciMark, which basically consists solely of
nested loops.

--Mike

Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT2 performance for number crunching

Francesco Abbate
2011/2/13 Mike Pall <[hidden email]>:
> This is wrong. It anchors code at loops, but it compiles and
> optimizes _all_ kinds of hot code paths.
>
> The region selection heuristics fail for the specific example
> because of the sheer mass of way-too-short loops.

Ok, I understand that I've done an oversimplification. Yet the
heuristic implemented in LuaJIT failed to understand the part of the
code that needs to be optimized. It is also very difficult for the
user guess what LuaJIT is able to optimize well and what is not.

> I beg to differ. It's a common idiom in Lua to generate code at
> runtime. And it's quite easy, too.
>
> If someone tells you to use templates for C++ because that will
> make your code an order of magnitude faster -- do you tell him
> that you're too lazy to implement that or would you rather take
> that order of magnitude speedup?

hmmm... I understand but I'm not convinced. An optimizing C of Fortran
compiler does not make this sort of jokes for correctly written
algorithms. For me it would be difficult to explain to other people
why they shouldn't write small nested loops like is made in the RK4
algorithm.

>> Well, this is what I've done but it was not working!!! :-)
>
> Because you didn't try hard enough. Naive translations of programs
> written in other languages almost never perform well. An idiomatic
> rewrite in Lua would look quite different. It could also take
> advantage of multiple-return values and other features the
> language has to offer.
>
> I showed you how to make it even faster than the C version. It's
> your decision, whether you'll heed that advice or not.

No, you didn't show me anything because your solution is not
acceptable. The algorithm needs to work for ODE system of arbtrary
dimension N and your solution was only for the case N=2.

> Your assumptions are wrong, so your conclusion is wrong. LuaJIT is
> quite competitive on SciMark, which basically consists solely of
> nested loops.

I assume that the Lua translations of the RK4 algorithm was perfectly
fine as it was the original C code and LuaJIT2 failed completely to
optimize the implemented algorithm. For me this means that LuaJIT2
have some major limitations for implementing numerical algorithms.

I know perfectly well the performance of LuaJIT2 in the SciMark2
benchmarks and also at the computer language game. For examples I know
about the spectral-norm test where LuaJIT2 rivals with Fortran
optimized code. But you have to be cautious with the all these
benchmarks because they test the compiler for relatively simple
algorithms. For example the FFT algorithm in SciMark2 takes just one
screen page of code. If I wrote the test with the RK4 is because I was
willing to test LuaJIT2 with a more complicated algorithms from a real
library of numerical routines.

For me it would be more interesting to know why the heuristic
algorithm failed with the RK4 algorithm and if this could be
eventually improved.

--
Francesco

Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT2 performance for number crunching

Mike Pall-24
Francesco Abbate wrote:
> It is also very difficult for the
> user guess what LuaJIT is able to optimize well and what is not.

I told you to use -jv, so you don't need to guess.

> > I showed you how to make it even faster than the C version. It's
> > your decision, whether you'll heed that advice or not.
>
> No, you didn't show me anything because your solution is not
> acceptable. The algorithm needs to work for ODE system of arbtrary
> dimension N and your solution was only for the case N=2.

You don't do it by hand of course. I showed you how to generalize
it. I'm not going to do the work for you.

> I assume that the Lua translations of the RK4 algorithm was perfectly
> fine as it was the original C code and LuaJIT2 failed completely to
> optimize the implemented algorithm.

No it was not. Learn how to program in Lua, not blindly transcribe
(suboptimal) C code.

I'm sick of this whining. I will not reply to you anymore.

--Mike

Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT2 performance for number crunching

Francesco Abbate
2011/2/13 Mike Pall <[hidden email]>:
> No it was not. Learn how to program in Lua, not blindly transcribe
> (suboptimal) C code.
>
> I'm sick of this whining. I will not reply to you anymore.

:-)

good way to terminate the discussion!

--
Francesco

Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT2 performance for number crunching

Florian Weimer
In reply to this post by Francesco Abbate
* Francesco Abbate:

> hmmm... I understand but I'm not convinced. An optimizing C of Fortran
> compiler does not make this sort of jokes for correctly written
> algorithms. For me it would be difficult to explain to other people
> why they shouldn't write small nested loops like is made in the RK4
> algorithm.

I wonder why you don't use C or Fortran, then.  This is not intended
as a snide remark.  I doubt you can use Lua effectively as a tool if
you don't meet it halfway.

Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT2 performance for number crunching

Francesco Abbate
2011/2/13 Florian Weimer <[hidden email]>:
> I wonder why you don't use C or Fortran, then.  This is not intended
> as a snide remark.  I doubt you can use Lua effectively as a tool if
> you don't meet it halfway.

I use C, C++, Lua and few other programming language but I doesn't use
Fortran :-)

I don't have any problem with Lua, I was just wondering if it was
possible to use it to implement efficiently numeric algorithms but
Mike told me that I'm a bad Lua programmer and may be also bad in
general... :-)

--
Francesco

Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT2 performance for number crunching

Leo Razoumov
On Sun, Feb 13, 2011 at 17:34, Francesco Abbate <[hidden email]> wrote:

> 2011/2/13 Florian Weimer <[hidden email]>:
>> I wonder why you don't use C or Fortran, then.  This is not intended
>> as a snide remark.  I doubt you can use Lua effectively as a tool if
>> you don't meet it halfway.
>
> I use C, C++, Lua and few other programming language but I doesn't use
> Fortran :-)
>
> I don't have any problem with Lua, I was just wondering if it was
> possible to use it to implement efficiently numeric algorithms ...

If performance is your main concern then C/C++/Fortran and sometimes
Assembler are your only real choices especially is you a coding
against a fixed deadline. With LuaJIT2 or any other smart dynamic
optimizers fir that matter there are too many things beyond your
immediate control. I personally, would like to use LuaJIT2 for
numerics in a way I use MATLAB/Octave -- as a rapid fire mathematics
at my finger tips. Great runtime performance is a big bonus but I
would not bet my house on it.

That said, there is one field IMHO where C/Fortran cannot easily
replace LuaJIT -- certain classes of adaptive numerical algorithms.
The fact that functions are first class and can be created out of thin
air at runtime by other functions opens many possibilities for
adaptive numerics (e.g. genetic algorithms...) that are much harder to
do in C/Fortran world. LuaJIT2 appeal  is its ability to trace/compile
these newly generated functions to boost performance. Sometimes it
will work sometimes not. At this point in time it looks as though that
LuaJIT2 is the best tool for the job.

--Leo--

Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT2 performance for number crunching

Francesco Abbate
2011/2/13 Leo Razoumov <[hidden email]>:
> If performance is your main concern then C/C++/Fortran and sometimes
> Assembler are your only real choices especially is you a coding
> against a fixed deadline. With LuaJIT2 or any other smart dynamic
> optimizers fir that matter there are too many things beyond your
> immediate control. I personally, would like to use LuaJIT2 for
> numerics in a way I use MATLAB/Octave -- as a rapid fire mathematics
> at my finger tips. Great runtime performance is a big bonus but I
> would not bet my house on it.

Hi,

just to be more clear. I've implemented an ODE integrator in GSL Shell
by using the GSL library but the perfomance is a disaster because of
we need to provide a pointer to a C function that evaluate the
derivates for the ODE system. The performance is very bad because we
do a transition

Lua -> C -> Lua

and the C function call many times a single Lua function. These is
actually a sort of worst case scenario that LuaJIT is not able to
optimize.

I've recently integrated LuaJIT2-beta6 in GSL shell (see luajit2
branch in GIT repository) and I was exploring the possibility of a
native Lua implementation of ODE integration algorithms like
Runge-Kutta or other higher order algorithm.

The idea is that we should avoid the Lua/C interface because it does
kill the perfomance and Lua/C/Lua is even worst. It is otherwise known
that in some case the perfomance on LuaJIT2 can be on the par with
C/Fortran optimized code for pure procedural numeric intensive
algorithm, see the SciMark2 benchmark and the spectral-norm test at
the computer benchmark game.

It turns out that I've made un attempt with the RK4 algorithm
translated straightforwardly from C to Lua. It was a partial failure
because LuaJIT2 failed to optimize correctly the main loop but Mike
shown me a way to make LuaJIT2 optimizer works. The only problem that
I need to generate the lua code dynamically to specialize the code for
the dimension of the ODE system. This is in same way similar to C++
template programming where you parametrize with an integer to obtain
more specialized code.

Since LuaJIT2 is already integrated in GSL Shell (this is a great
thing for me) and it works like a charm I'm going to try to
implemented a full featured ODE integrator based on RK4 with dynamic
code generation as suggested by Mike. The idea is that the performance
could be *really* on the par or better to what I obtain with C code
using the GSL library. If the RK4 experiment works than I can
implement later more advanced algorithms like is done in the GSL
library.

By the way if someone want to help with this task he is very welcome.
My plan is the following:
- take the rk4_2d.lua code (modified by Mike)
- optimize it further to take advantage of Lua possiblities to return
  multiple values for returning the derivates
- fine tune this code
- use the basic stepper to implement an adaptive stepper with a nice
  Lua interface
- when the code works for a system N=2 take the code and transform it in a
  template to generate dynamically the code for any N between 1 and lets say
  10. For N > 10 we should probably avoid to expand the local variable and
  store each component in a FFI double VLA

So, now you know all my secret plan to dominate the world (of ODE
integration) ... :-)

--
Francesco

Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT2 performance for number crunching

Geoff Leyland
On 14/02/2011, at 9:08 PM, Francesco Abbate wrote:

> By the way if someone want to help with this task he is very welcome.
> My plan is the following:
> - take the rk4_2d.lua code (modified by Mike)
> - optimize it further to take advantage of Lua possiblities to return
>  multiple values for returning the derivates
> - fine tune this code
> - use the basic stepper to implement an adaptive stepper with a nice
>  Lua interface
> - when the code works for a system N=2 take the code and transform it in a
>  template to generate dynamically the code for any N between 1 and lets say
>  10. For N > 10 we should probably avoid to expand the local variable and
>  store each component in a FFI double VLA

Any particular reason to use RK4?  My trusty old copy of Numerical Recipes suggests higher order methods with implicit error estimation are better.

Cheers,
Geoff
1234 ... 8