reusing objects for tight iteration loops

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

reusing objects for tight iteration loops

Josh Haberman
Let's suppose I'm iterating over a large data set, and for every "row"
of the set I want to execute some Lua code.

I could implement an iterator, then my users could use for/in:

  for row in fast_iterator(data_stream) do
    do_something(row.foo, row.bar)
  end

If you optimize this pattern enough, you start getting to the point
where the malloc/GC/cache effects of allocating a new "row" every time
start to become significant.

If this were C, the next step in optimization would be to re-use the
"row" struct every time, something like:

  row_t row;
  while (get_next_item(&row, iterator)) {
    do_something(&row);
  }

This avoids the per-row allocation cost. But this doesn't translate to
Lua very well:

  local items = {}
  -- Suppose fast_iterator() is optimized to return the same
  -- row over and over, but with the fields overwritten with new values.
  for row in fast_iterator(data_stream) do
    -- Oops! My table "items" ends up with n of the same value,
    -- with all fields set to the values of the last row.
    items[#items + 1] = row
  end

We can try to work around this, but things are still dicey if the row
has sub-objects:

  local row = FastIteratorObject(iterator)
  while row:next_object()
    -- Now it's clear that "next_object()" mutates the row to be the
    -- contents of the next row.
    -- But imagine that row:next_object() can change row.bar.baz:
    do_something(row.bar)
  end

So I was just wondering if anyone had any out-of-the-box ideas about
mitigating this. One idea I had was to make the inner function a
string:

  fast_iterate(iterator, "function (row) print(row.foo) end")

This allows me to precisely control the function's environment, so I
can prevent references to "row" from escaping through the global
environment or upvalues. The downside is that then none of the
program's functions or variables will be visible, which limits the
usefulness of this approach pretty severely.

Another approach would be to see whether "row" has any references at
the end of the function; if the object didn't actually escape the
inner function then I could reuse it without any surprise to the user.
But Lua has no API for doing anything like this.

Not sure there's a good answer here, I'm just thinking out loud and
would love to hear anyone's brilliant idea.

Thanks,
Josh

Reply | Threaded
Open this post in threaded view
|

Re: reusing objects for tight iteration loops

Javier Guerra Giraldez
On Thu, Jan 30, 2014 at 7:54 PM, Josh Haberman <[hidden email]> wrote:
> If you optimize this pattern enough, you start getting to the point
> where the malloc/GC/cache effects of allocating a new "row" every time
> start to become significant.


sometimes, when i have an "accesor" function that returns a table, i
add an optional 'base' table parameter to let the user choose if they
want to reuse a table allocate a new one:

function get_element(stream, e)
    e = e or {}
    -- fill e fields
end

so, if you omit the base table, you get a new one:

elm = get_element(stream)

in a loop, you can either get different elements each time:

local tbl={}
while not_empty(stream) do
    tbl[#tbl+1] = get_element(stream)
end

or reuse the same table:

local e=nil
while not_empty(stream) do
    e=get_element(stream,e)
    -- use e
end

--
Javier

Reply | Threaded
Open this post in threaded view
|

Re: reusing objects for tight iteration loops

David Favro
In reply to this post by Josh Haberman
Rather than trying so hard to make it seem like it's actually a new row, I
think you're better off to just document for your users that it is the same
table re-used: either they can copy it if necessary, or you can offer two
versions of the iterator, one which is optimized and one "traditional".

Other than that, you could adapt the idea presented here to get a speed-up
of the every-row-is-a-new-table version:
http://lua-users.org/lists/lua-l/2012-12/msg00613.html
But, beware of users who try to iterate over the items of the row using
pairs(): you can handle this in 5.2 with __pairs, but in 5.1 it won't be easy.

-- David


Reply | Threaded
Open this post in threaded view
|

Re: reusing objects for tight iteration loops

Daurnimator
In reply to this post by Josh Haberman
On 31 January 2014 11:54, Josh Haberman <[hidden email]> wrote:
This avoids the per-row allocation cost. But this doesn't translate to
Lua very well:

  local items = {}
  -- Suppose fast_iterator() is optimized to return the same
  -- row over and over, but with the fields overwritten with new values.
  for row in fast_iterator(data_stream) do
    -- Oops! My table "items" ends up with n of the same value,
    -- with all fields set to the values of the last row.
    items[#items + 1] = row
  end

Easiest (technical) solution:
Just document that row changes, if you want a copy add a row:clone() 
If references can escape then you inherently can't reuse row.

We can try to work around this, but things are still dicey if the row
has sub-objects:

  local row = FastIteratorObject(iterator)
  while row:next_object()
    -- Now it's clear that "next_object()" mutates the row to be the
    -- contents of the next row.
    -- But imagine that row:next_object() can change row.bar.baz:
    do_something(row.bar)
  end

So I was just wondering if anyone had any out-of-the-box ideas about
mitigating this. One idea I had was to make the inner function a
string:

  fast_iterate(iterator, "function (row) print(row.foo) end")

This allows me to precisely control the function's environment, so I
can prevent references to "row" from escaping through the global
environment or upvalues. The downside is that then none of the
program's functions or variables will be visible, which limits the
usefulness of this approach pretty severely.

You don't have to go as far as having users pass a function as a string.
You can check if a function has upvalues via lua_getupvalue
==> If it does, then use the non-mutating iterator
 
D
Reply | Threaded
Open this post in threaded view
|

Re: reusing objects for tight iteration loops

Josh Haberman
On Thu, Jan 30, 2014 at 6:09 PM, Daurnimator <[hidden email]> wrote:

> On 31 January 2014 11:54, Josh Haberman <[hidden email]> wrote:
>> So I was just wondering if anyone had any out-of-the-box ideas about
>> mitigating this. One idea I had was to make the inner function a
>> string:
>>
>>   fast_iterate(iterator, "function (row) print(row.foo) end")
>>
>> This allows me to precisely control the function's environment, so I
>> can prevent references to "row" from escaping through the global
>> environment or upvalues. The downside is that then none of the
>> program's functions or variables will be visible, which limits the
>> usefulness of this approach pretty severely.
>
>
> You don't have to go as far as having users pass a function as a string.
> You can check if a function has upvalues via lua_getupvalue
> ==> If it does, then use the non-mutating iterator

That is an interesting idea that I like in concept, but can't globals
also let a value escape?

local captured = nil
-- Global function
function capture(x)
  captured = x
end

fast_iterate(
  -- Function has no upvalues, but row escapes.
  function(row)
    capture(row)
  end)

Reply | Threaded
Open this post in threaded view
|

Re: reusing objects for tight iteration loops

William Ahern
In reply to this post by Josh Haberman
On Thu, Jan 30, 2014 at 04:54:25PM -0800, Josh Haberman wrote:

> Let's suppose I'm iterating over a large data set, and for every "row"
> of the set I want to execute some Lua code.
>
> I could implement an iterator, then my users could use for/in:
>
>   for row in fast_iterator(data_stream) do
>     do_something(row.foo, row.bar)
>   end
>
> If you optimize this pattern enough, you start getting to the point
> where the malloc/GC/cache effects of allocating a new "row" every time
> start to become significant.

The idiomatic way to do this, based on several different libraries I've
seen, is to add a mode where the iterator returns a list of values on the
stack, rather than stuffing everything into a table. Have the iterator
function take a possibly empty list of key arguments. If the list is empty,
just return a row. This is the "convenience" mode. If the list is not empty,
then it's a list of values to return.

This can even be faster than reusing a table.


Reply | Threaded
Open this post in threaded view
|

Re: reusing objects for tight iteration loops

Josh Haberman
On Thu, Jan 30, 2014 at 6:47 PM, William Ahern
<[hidden email]> wrote:

> On Thu, Jan 30, 2014 at 04:54:25PM -0800, Josh Haberman wrote:
>> If you optimize this pattern enough, you start getting to the point
>> where the malloc/GC/cache effects of allocating a new "row" every time
>> start to become significant.
>
> The idiomatic way to do this, based on several different libraries I've
> seen, is to add a mode where the iterator returns a list of values on the
> stack, rather than stuffing everything into a table. Have the iterator
> function take a possibly empty list of key arguments. If the list is empty,
> just return a row. This is the "convenience" mode. If the list is not empty,
> then it's a list of values to return.

Yes I like this pattern, though it doesn't generalize as well in the
sense that you can't call utility functions that take a "row" as an
argument. But I do like it.

I would love pointers to any of the libraries you've seen that do this.