LuaJIT FFI math 2x faster than built-in Lua math

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

LuaJIT FFI math 2x faster than built-in Lua math

Adam Strzelecki
I've made interesting discovery, it seems that using math functions from ffi.C, i.e. ffi.C.sin is 2x faster than using built-in math. I have investigated that in context of optimized stateless ffi callbacks idea here: http://lua-users.org/lists/lua-l/2011-12/msg00436.html

Therefore go few question:
• Does it make sense at all to use math.fun in luajit for math intensive scripts?
• Can math.fun calls to be same fast as ffi.C.fun (direct calls)?
• If we load some other library via ffi.load(name, true), does setting global to true implies direct optimized calls via ffi.C.other_lib_fun as well (i.e. optimized lapack calls)?

-- benchmark -------------------------------------
$ time luajit bench.lua
3.100828

real 0m4.300s
user 0m4.298s
sys 0m0.002s

$ time luajit bench.lua ffi
3.100828

real 0m2.615s
user 0m2.613s
sys 0m0.002s

-- code -------------------------------------
local S = 100000000
local s = 0

if arg[1] == 'ffi' then
        local ffi = require('ffi')
        ffi.cdef [[
                double sin(double x);
        ]]
        for n = 0, S-1 do
                s = s + ffi.C.sin(n/3)
        end
else
        for n = 0, S-1 do
                s = s + math.sin(n/3)
        end
end

print(string.format("%.6f", s))

Cheers,
--
Adam Strzelecki


Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT FFI math 2x faster than built-in Lua math

Mike Pall-34
Adam Strzelecki wrote:
> I've made interesting discovery, it seems that using math
> functions from ffi.C, i.e. ffi.C.sin is 2x faster than using
> built-in math.

This is a) not a new discovery, b) not generalizable and c) based
upon an unscientific test case.

First, you're using an expensive FP division and second you're
calling sin() with arguments outside its base range, which is
rather uncommon in practice.

math.sin() calls the x87 fsin instruction, which is known to be
slow on some CPUs, especially when reducing ranges. ffi.C.sin(),
as implemented in most x64 math libraries, uses SSE and a faster
range-reduction algorithm. The call overhead itself is negligible,
since sin() is an expensive operation.

Also, I get very different timings:

    math.sin()  ffi.C.sin()
x86    4.13         4.75
x64    4.13         5.35

I.e. math.sin() is faster on my CPU.

Try the same thing without a division and with sqrt() and you'll
see that math.sqrt() is always faster.


IMHO using sin() in benchmarks is the floating-point equivalent of
Fibonacci benchmarks: totally worthless.

--Mike

Reply | Threaded
Open this post in threaded view
|

Re: LuaJIT FFI math 2x faster than built-in Lua math

Adam Strzelecki
> math.sin() calls the x87 fsin instruction, which is known to be
> slow on some CPUs, especially when reducing ranges. ffi.C.sin(),
> as implemented in most x64 math libraries, uses SSE and a faster
> range-reduction algorithm. The call overhead itself is negligible,
> since sin() is an expensive operation.

Thank you for precise explanation. Nice to learn something new that single FPU instructions can be sometimes slower than "software" implementation via SSE, which one however can have lower precision in some situations. I tried that "benchmark" on SSE-less machine (well actually via Parallels), math.sin was a bit faster than ffi.C.sin (similar to your benchmark).

> Try the same thing without a division and with sqrt() and you'll
> see that math.sqrt() is always faster.

Scored exactly the same here on my machine. Seems both are calling FPU's fsqrt.

> IMHO using sin() in benchmarks is the floating-point equivalent of
> Fibonacci benchmarks: totally worthless.

Yeah, I know these are worthless. But I wouldn't dare to ask about this if the difference was not so noticeable. Now I know it is tradeoff between slower fsin and lower precision SSE x64 implementation. Anyway this seems to be just for trigonometric functions, as rest math GLIBC functions seem to be calling FPU.

Thanks again for valuable answer,
--
Adam