Looking for ideas to lower binding overhead for lua-re2. in plain-text)

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Looking for ideas to lower binding overhead for lua-re2. in plain-text)

lua.greatwolf
>
> You should be using luaL_checklstring() here. You can pass a pointer _and_
> the length to the StringPiece constructor:
>
> size_t n;
> const char *p = luaL_checklstring(L, 2, &n);
> StringPiece subject(p, (int)n);
>
> Here's a snippet from the RE2 bindings I've written. These routines
> implement the gmatch iterator:
>  
> ...
Thank you for the feedback. Yes in my actual code I am going to be using
"luaL_checklstring". My "re_countmatch" is really just a throwaway
function that I'm using to help isolate possible performance
bottlenecks. I'm finding that trying to shave off those last couple
100ms is proving to be very difficult.

Did you happen to do any performance benchmarking and tuning on your
bindings to see how it compares with C/C++ usage or another language
binding to re2?

I came to the conclusion that 'luaL_checkstring' and its variants like
*_tostring, *_tolstring, etc. is the probable cause for that last 100ms
of overhead for long strings after experimenting, testing and observing
the following:

 - The pyre2 binding's done with Cython. The conversion from a pystring
into C-style string is done using 'PyObject_AsCharBuffer' from Python's
C API which is almost analogous to lua's 'luaL_tolstring' functions.
 - In both cases, when benchmarking pattern match count, both lua code
and python code has to make the same number of transitions into C and
back again whenever they call "countmatch" or "findall" that performs
the match operation.
 - I thought compiling the pattern might account for the difference. So
I precompiled all the patterns that was going to be used outside the
loop. Timing only the loop doing the match count with the precompiled
re2 pattern actually made no measurable difference.
 - I noticed pyre2 wasn't always returning a unique object address when
compiling a pattern. It looks like it's using some kind of memory pool
scheme where it reuses space from dead deallocated pattern objects. I
thought maybe creating a new re2 object for every pattern and hitting up
'lua_newuserdata' everytime might account for the difference. So I
changed my 'luaRE2_newobject' to return the same userdata instead.
Testing again, made no measurable difference.
 - I originally used the iterator approach like what you've presented in
your code and benchmarked that. Here's a sample output of the benchmark.
Each line has the re2 dna pattern to match followed by the number of
matches found in the provided dna seq string from stdin.

        agggtaaa|tttaccct             356
        [cgt]gggtaaa|tttaccc[acg] 1250
        a[act]ggtaaa|tttacc[agt]t  4252
        ag[act]gtaaa|tttac[agt]ct  2894
        agg[act]taaa|ttta[agt]cct  5435
        aggg[acg]aaa|ttt[cgt]ccct 1537
        agggt[cgt]aa|tt[acg]accct 1431
        agggta[cgt]a|t[acg]taccct 1608
        agggtaa[cgt]|[acg]ttaccct 2178

Those numbers also meant how many times the code had to go from lua -> C
and back again because 'luare2_nextmatch' had to be called that number
of times before finally returning nil. I found this to be almost 400ms
slower on my machine per iteration than if I simply constructed and
returned a table containing the captures.
 - And finally, when I amortized the luaL_checkstring call by making it
static like in my 're2_countmatch' or eliminated it completely then the
benchmark runtime drops to ~1.417ms per iteration! So it seems like the
call to 'luaL_checkstring', at least when given very long string,
accounts for those couple 100ms that I'm trying to shave off!

I haven't tested this on another environment out of windows, like linux,
so I cannot say whether these observations will hold.

On another note, have you considered releasing your luare2 bindings to
the public? I'm also curious how 'luare2_captures' is defined?
'RE2::FindAndConsumeN' expects an array of RE2::Arg pointers. Is
'luare2_captures' a class that has an implicit conversion operator that
allows you to pass it in as if it was "RE2::Arg *[]"?

Thanks


Reply | Threaded
Open this post in threaded view
|

Re: Looking for ideas to lower binding overhead for lua-re2. in plain-text)

William Ahern
On Fri, Aug 23, 2013 at 05:42:30PM -0700, [hidden email] wrote:

> >
> >You should be using luaL_checklstring() here. You can pass a pointer _and_
> >the length to the StringPiece constructor:
> >
> > size_t n;
> > const char *p = luaL_checklstring(L, 2, &n);
> > StringPiece subject(p, (int)n);
> >
> >Here's a snippet from the RE2 bindings I've written. These routines
> >implement the gmatch iterator:
> >  
> >...
> Thank you for the feedback. Yes in my actual code I am going to be using
> "luaL_checklstring". My "re_countmatch" is really just a throwaway
> function that I'm using to help isolate possible performance
> bottlenecks. I'm finding that trying to shave off those last couple
> 100ms is proving to be very difficult.
>
> Did you happen to do any performance benchmarking and tuning on your
> bindings to see how it compares with C/C++ usage or another language
> binding to re2?

No. It was enough for us that RE2 was several times faster than the PCRE
library. Plus, there were other issues we ran into, like memory usage of
RE2, that distracted us. Perhaps we'll end up using your module.

I'm actually currently working on a library that will utilize PCRE, RE2, and
Ragel. Ragel can be an order of magnitude faster than even RE2, but many
PCRE expressions cannot be converted, and certain subpatterns (e.g.
repetitions of large character classes) cannot be compiled in a reasonable
amount of time when in a large union. So we ended up writing a regular
expression parser that analyzes and rewrites our Perl expressions, and
assigns them to be run either as PCRE, RE2, or Ragel. We've improved
throughput over 10x, and still have a ways to go.

<snip>

> - I originally used the iterator approach like what you've presented in
> your code and benchmarked that. Here's a sample output of the benchmark.
> Each line has the re2 dna pattern to match followed by the number of
> matches found in the provided dna seq string from stdin.
>
>        agggtaaa|tttaccct             356
>        [cgt]gggtaaa|tttaccc[acg] 1250
>        a[act]ggtaaa|tttacc[agt]t  4252
>        ag[act]gtaaa|tttac[agt]ct  2894
>        agg[act]taaa|ttta[agt]cct  5435
>        aggg[acg]aaa|ttt[cgt]ccct 1537
>        agggt[cgt]aa|tt[acg]accct 1431
>        agggta[cgt]a|t[acg]taccct 1608
>        agggtaa[cgt]|[acg]ttaccct 2178
>
> Those numbers also meant how many times the code had to go from lua -> C
> and back again because 'luare2_nextmatch' had to be called that number
> of times before finally returning nil. I found this to be almost 400ms
> slower on my machine per iteration than if I simply constructed and
> returned a table containing the captures.

I'm not surprised. But I still prefer using iterators because it makes the
Lua code cleaner, and the :gmatch style interface is well docmented in the
manual. I habitually default to interators unless someone complains directly
to me, or it's clearly more appropriate to return a table.

Some of the engineers on my team dislike the way I make them use iterators,
but it makes their code more concise IMO. Plus, they tend to overemphasize
CPU performance over memory pressure. We do network scanning, and a smart
module will exit early after deciding what to do. It doesn't make sense to
load a table with thousands of elements if, after processing the hundreth,
the rest are dropped on the floor. And if filling an entire table is faster
overall, there's nothing stopping them from doing that--I don't care if they
hack on my Lua bindings, and I wish they'd do it more often.

<snip>
> On another note, have you considered releasing your luare2 bindings to
> the public?

I'm not very experienced with C++. I'd need someone to carefully re-examine
my C++ bindings before I'd ever publish the code. I just ran into a bug last
month in my bindings because a C construct I was using behaved completely
differently in C++, and it was never caught before because our C++ engineers
didn't understand what the C construct was doing--and they believed the
pernicious lie that C is just a subset of C++.

> I'm also curious how 'luare2_captures' is defined?

I'll send that to you offline.