# NumericLua optimization trick: lazy copying

2 messages
 ```Hi all, ``````for q=1,1000 do x = I + matrix.inv(x) end `````` A few days ago there was some discussion about how the above loop will end up making lots of copies of x. I think there's a way you can avoid this without requiring hacks to the lua parser, or any sort of sophisticated code analysis, or Yet Another Metamethod Request, while still allowing one to keep the same natural syntax (no need for x:div(...), etc) The basic idea is to perform all operations in place, and only make copies if the original value is requested. The key is that most operations have inverses which can be used to 'backtrack' to the original state, if it ends up being needed. I'll explain: Each matrix/vector has two parts: a 'core', and a 'pending' list - the core is the actual data structure itself; the 'pending' list is just a list of operations (lua functions) which must be applied to a copy of the core before it can perform any operation. So, given -- a = [...] b = a + 1 initially, a's core is [...] and its pending list is empty. The expression a + 1 modifies a's core [...] in place, but appends a 'subtract 1' operation to a's pending list. b gets the modified core and an empty pending list. If later, we call one of a's methods, (like doing a[4], say), then we copy the core which it currently shares with b, then apply the pending subtraction operation to this new core, ( thus recovering the original value of a), and then finally go ahead with returning a[4]. On the other hand, if a was just a temporary variable, it'll get garbage collected and we'll have saved ourselves from making an unneeded copy. Being lazy about making copies like this can 'naturally' result in some nice optimizations. a = a + b will do the operation in place, and the old value will simply become garbage. Or in the expression 'a + 1 * 2 + b', we won't make copies for the temporary results of subexpressions. Best of all, the programmer can still just write things in the most natural style, and these sorts of optimizations will just happen automagically behind the scenes. Anyway, it's kind of a zany idea but it could work. A lot of the common operations have inverses, and those that don't can still be copied as before or, of course, manually mutated in place. With Lua's first class functions / closures, it'd be possible to pull something like this off! Thoughts? -Paul ```
 ```Hi, Paul Chiusano wrote: > >for q=1,1000 do > > x = I + matrix.inv(x) > >end > > Each matrix/vector has two parts: a 'core', and a 'pending' list - the > core is the actual data structure itself; the 'pending' list is just a > list of operations (lua functions) which must be applied to a copy of > the core before it can perform any operation. I've used lazy evaluation schemes in the past to avoid the temporary object creation problem. But I did it the other way round: keep track of an expression tree (or list) and give returned objects just a short reference to it. Only evaluate the expression when: 1. the tree (or list) grows too large, 2. a result outside of the domain is needed (e.g. a dot product should return a number and not a one-element object) or 3. the last operation needs a copy anyway. This works well for common linear expressions like Y = A*i + B*j where the intermediate results are not needed. Only three very short and constant size reference objects need to be created in this case. These are easy to recycle by the memory allocator. In the case of a vector/matrix library for Lua I suggest using fixed size userdata objects. They can hold up to a fixed size expression in a plain list (in postfix order). This avoids problems with tree reference counting and should be good enough for most practical purposes. The only thing you need to figure out now, is how to evaluate an expression with the minimum amount of copies and the best use of the underlying domain operations. Extra bonus points if you JIT compile the expression and the loops around it. :-) [ In fact I did something like this (by hand) with the Lua entry for the "pidigit" shootout benchmark: http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=pidigits&lang=luajit&id=2 (Ignore the highlighting -- it thinks #x is a comment.) The particular PI calculation algorithm required by the specs needs multi-precision numbers. Alas, Lua does not have such an object type by default. The one external library used by the original program was written in pure Lua and very slow. The other bindings to C libraries had installation problems (the shootout maintainers don't have endless time at their hands) or didn't work well with Lua 5.1. Sigh. So I wrote a self-standing implementation in pure Lua with a few tricks. There are only a handful of linear expressions needed for the calculation and they are dynamically pasted into the surrounding loops and compiled on-the-fly. Ok, in this case I hand-selected the expressions. But they could have been derived automatically by a lazy evaluation approach. Doing the full expressions inline is way better than a plain mapping onto the usual +-*/ primitives. It avoids many object copies and in turn reduces cache trashing which is a major bottleneck in this benchmark. The performance is quite competitive when compiled with LuaJIT. Especially considering that practically all other entries use dedicated MPN libraries written in C! And the winning entries all link to GMP which uses hand-tuned assembler code: http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=pidigits&lang=all So in most cases the performance of the language itself is not measured, just the performance of the bignum binding. ] Bye, Mike ```