NumericLua optimization trick: lazy copying

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view

NumericLua optimization trick: lazy copying

Paul Chiusano
Hi all,

for q=1,1000 do
        x = I + matrix.inv(x)

A few days ago there was some discussion about how the above loop will
end up making lots of copies of x. I think there's a way you can avoid
this without requiring hacks to the lua parser, or any sort of
sophisticated code analysis, or Yet Another Metamethod Request, while
still allowing one to keep the same natural syntax (no need for
x:div(...), etc) The basic idea is to perform all operations in place,
and only make copies if the original value is requested. The key is
that most operations have inverses which can be used to 'backtrack' to
the original state, if it ends up being needed. I'll explain:

Each matrix/vector has two parts: a 'core', and a 'pending' list - the
core is the actual data structure itself; the 'pending' list is just a
list of operations (lua functions) which must be applied to a copy of
the core before it can perform any operation. So, given

-- a = [...]
b = a + 1

initially, a's core is [...] and its pending list is empty. The
expression a + 1 modifies a's core [...] in place, but appends a
'subtract 1' operation to a's pending list. b gets the modified core
and an empty pending list. If later, we call one of a's methods, (like
doing a[4], say), then we copy the core which it currently shares with
b, then apply the pending subtraction operation to this new core, (
thus recovering the original value of a), and then finally go ahead
with returning a[4]. On the other hand, if a was just a temporary
variable, it'll get garbage collected and we'll have saved ourselves
from making an unneeded copy.

Being lazy about making copies like this can 'naturally' result in
some nice optimizations. a = a + b will do the operation in place, and
the old value will simply become garbage. Or in the expression 'a + 1
* 2 + b', we won't make copies for the temporary results of
subexpressions. Best of all, the programmer can still just write
things in the most natural style, and these sorts of optimizations
will just happen automagically behind the scenes.

Anyway, it's kind of a zany idea but it could work. A lot of the
common operations have inverses, and those that don't can still be
copied as before or, of course, manually mutated in place. With Lua's
first class functions / closures, it'd be possible to pull something
like this off!



Reply | Threaded
Open this post in threaded view

Re: NumericLua optimization trick: lazy copying

Mike Pall-5-2

Paul Chiusano wrote:
> >for q=1,1000 do
> >        x = I + matrix.inv(x)
> >end
> Each matrix/vector has two parts: a 'core', and a 'pending' list - the
> core is the actual data structure itself; the 'pending' list is just a
> list of operations (lua functions) which must be applied to a copy of
> the core before it can perform any operation.

I've used lazy evaluation schemes in the past to avoid the
temporary object creation problem. But I did it the other way
round: keep track of an expression tree (or list) and give
returned objects just a short reference to it.

Only evaluate the expression when:
1. the tree (or list) grows too large,
2. a result outside of the domain is needed (e.g. a dot product
   should return a number and not a one-element object) or
3. the last operation needs a copy anyway.

This works well for common linear expressions like Y = A*i + B*j
where the intermediate results are not needed. Only three very
short and constant size reference objects need to be created in
this case. These are easy to recycle by the memory allocator.

In the case of a vector/matrix library for Lua I suggest using
fixed size userdata objects. They can hold up to a fixed size
expression in a plain list (in postfix order). This avoids
problems with tree reference counting and should be good enough
for most practical purposes.

The only thing you need to figure out now, is how to evaluate an
expression with the minimum amount of copies and the best use of
the underlying domain operations. Extra bonus points if you JIT
compile the expression and the loops around it. :-)

In fact I did something like this (by hand) with the Lua entry
for the "pidigit" shootout benchmark:

(Ignore the highlighting -- it thinks #x is a comment.)

The particular PI calculation algorithm required by the specs
needs multi-precision numbers. Alas, Lua does not have such an
object type by default. The one external library used by the
original program was written in pure Lua and very slow. The other
bindings to C libraries had installation problems (the shootout
maintainers don't have endless time at their hands) or didn't
work well with Lua 5.1. Sigh.

So I wrote a self-standing implementation in pure Lua with a few
tricks. There are only a handful of linear expressions needed for
the calculation and they are dynamically pasted into the
surrounding loops and compiled on-the-fly.

Ok, in this case I hand-selected the expressions. But they could
have been derived automatically by a lazy evaluation approach.

Doing the full expressions inline is way better than a plain
mapping onto the usual +-*/ primitives. It avoids many object
copies and in turn reduces cache trashing which is a major
bottleneck in this benchmark.

The performance is quite competitive when compiled with LuaJIT.
Especially considering that practically all other entries use
dedicated MPN libraries written in C! And the winning entries all
link to GMP which uses hand-tuned assembler code:

So in most cases the performance of the language itself is not
measured, just the performance of the bignum binding.