precompiled regex's to lua

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

precompiled regex's to lua

Asko Kauppi

I didn't find any reference to discussing precompiled regular expressions, and Lua.

Some background:

In huge log file handling, Lua loses to Perl (not by much!) seemingly because it has no concept of precompiling, and then applying the regular expression patterns.

In Perl, one can:

	my $re= qr/^\d+\s/;
	$var =~ $re;	# $re is a precompiled, optimized regex, applied to $var
    or:
	$var =~ /^\d+\s/o;	# 'o' for once, compile once, cache, reuse

Lua:
	string.match( var, "%d+%s" )	-- is "%d+%s" analyzed anew each time?


Is Lua losing in speed, since it has not this concept, or have the authors been so clever, it's already "there", just invisible? :) We're talking BIG files, say 1 million lines, or more.

-asko



Reply | Threaded
Open this post in threaded view
|

Re: precompiled regex's to lua

Karel Tuma
just benchmark it :)
lua of course wins, but not always.

the thing with compiling regexs, they're FSA (finite state automata)
basically tree of states, where nodes can point meshed to other nodes.
FSA compiler is considered non-trivial in terms of implementation size.

but lua patterns are sort of NFA (nonfinite), they're already compiled as they
are. due to that referring back to some state or sub-expressions
is very limited making lua patterns "less powerful" than pcre's, but
enough for the usual daywork and when you need something more,
you've to code the logic all by yourself.
to be exact FSA is slower for "simple" expressions where NFA on the
complex ones, some pcre implementations even use both and choose
between them depending on the expression (!).

hey, this is lua, we want the simple and fast, right?

add to this perl's suckyness on much everything else and lua is the winner
(using lua for parsing 1Gb+ logs using mmap(), look at
http://blua.leet.cz/sep/STRHOOK_PATCH.patch to get the idea)

On Thu, Nov 09, 2006 at 09:58:22PM +0100, Asko Kauppi wrote:
> 
> I didn't find any reference to discussing precompiled regular  
> expressions, and Lua.
> 
> Some background:
> 
> In huge log file handling, Lua loses to Perl (not by much!) seemingly  
> because it has no concept of precompiling, and then applying the  
> regular expression patterns.
> 
> In Perl, one can:
> 
> 	my $re= qr/^\d+\s/;
> 	$var =~ $re;	# $re is a precompiled, optimized regex, applied to 
> 	$var
>     or:
> 	$var =~ /^\d+\s/o;	# 'o' for once, compile once, cache, reuse
> 
> Lua:
> 	string.match( var, "%d+%s" )	-- is "%d+%s" analyzed anew each 
> 	time?
> 
> 
> Is Lua losing in speed, since it has not this concept, or have the  
> authors been so clever, it's already "there", just invisible? :)    
> We're talking BIG files, say 1 million lines, or more.
> 
> -asko
> 
> 

Reply | Threaded
Open this post in threaded view
|

Re: precompiled regex's to lua

Edgar Toernig
In reply to this post by Asko Kauppi
Asko Kauppi wrote:
> 
> I didn't find any reference to discussing precompiled regular  
> expressions, and Lua.

It may sound weird by I was thinking about that when I was
looking for a nice and powerful case statement: precompiled
lex-like finite automatons as a super case statement.
Something like

	case <expr>
		of <intconst> <statements>
		of <stringconst> statements>
		of <regexp> <statements>    # regexp i.e. /[0-9]*/
	end

The compiler could precompile and optimize _all_ regexps of a case
statement into a _single_ NFA.  I couldn't find a sane method
though how to pass the semantic data (captures, start/end pos) to
the bodies of the 'of' statements.  The lex/awk way of $1, $2 etc
doesn't look very appealing (and doesn't work well for nested
case statements either).

And then naturally comes the next step: one wants to regard the
case-expression as a stream of data.  So a start position is
required.  "case <expr> [<startpos>]"?  Not so nice.  Or a "continue"
or "again" statement to restart the case at the current end-pos?
Don't know.

But IMHO the basic idea is tempting - grouping multiple regexps
via a case statement to allow better optimization and getting the
simple regexp matching as a by-product:

	case foo of /^[0-9]/ dosomething end

Ciao, ET.

Reply | Threaded
Open this post in threaded view
|

Re: precompiled regex's to lua

Mike Pall-67
Hi,

Edgar Toernig wrote:
> It may sound weird by I was thinking about that when I was
> looking for a nice and powerful case statement: precompiled
> lex-like finite automatons as a super case statement.

You may want to have a look at how pattern matching in Haskell
works. Also check their lazy regex combinators.

One idea is to integrate the pattern matching language into the
core language. Then you can easily tag captures with variable
names and make them available to the bodies. This also simplifies
compiling and optimizing the pattern matches together with the
core language constructs.

But I guess it would look really ugly in any language with an
Algol-like syntax (such as Lua).

Bye,
     Mike

Reply | Threaded
Open this post in threaded view
|

Re: precompiled regex's to lua

Asko Kauppi
In reply to this post by Karel Tuma

Thanks for the background, so I'm kind of comparing apples and oranges?

Anyways, I had done the benchmarking, and up until the last yards Lua was indeed always faster. With use of qr// in Perl, it wasn't.

Which lead me to think, is there anything that could be done. In general terms, I too would prefer Lua over anything, but it also needs to be "faster or the same speed" as competitor. Otherwise, a transition would certainly -and justifyably- be questioned.

The actual issue is CSV parsing (comma separated values). Just simple, "%d+,%d+,[^,]," kind. Ideas for a better approach in Lua then string.match? (making a C module just for cutting out the , fields would be Fast but is not really an alternative, at least not in comparison to other languages)

-asko



On Thu, 9 Nov 2006 23:23:12 +0100
 Karel Tuma <[hidden email]> wrote:
just benchmark it :)
lua of course wins, but not always.

the thing with compiling regexs, they're FSA (finite state automata) basically tree of states, where nodes can point meshed to other nodes. FSA compiler is considered non-trivial in terms of implementation size.

but lua patterns are sort of NFA (nonfinite), they're already compiled as they are. due to that referring back to some state or sub-expressions is very limited making lua patterns "less powerful" than pcre's, but enough for the usual daywork and when you need something more,
you've to code the logic all by yourself.
to be exact FSA is slower for "simple" expressions where NFA on the complex ones, some pcre implementations even use both and choose
between them depending on the expression (!).

hey, this is lua, we want the simple and fast, right?

add to this perl's suckyness on much everything else and lua is the winner
(using lua for parsing 1Gb+ logs using mmap(), look at
http://blua.leet.cz/sep/STRHOOK_PATCH.patch to get the idea)

On Thu, Nov 09, 2006 at 09:58:22PM +0100, Asko Kauppi wrote:

I didn't find any reference to discussing precompiled regular expressions, and Lua.

Some background:

In huge log file handling, Lua loses to Perl (not by much!) seemingly because it has no concept of precompiling, and then applying the regular expression patterns.

In Perl, one can:

	my $re= qr/^\d+\s/;
$var =~ $re; # $re is a precompiled, optimized regex, applied to $var
    or:
$var =~ /^\d+\s/o; # 'o' for once, compile once, cache, reuse

Lua:
string.match( var, "%d+%s" ) -- is "%d+%s" analyzed anew each time?


Is Lua losing in speed, since it has not this concept, or have the authors been so clever, it's already "there", just invisible? :) We're talking BIG files, say 1 million lines, or more.

-asko




Reply | Threaded
Open this post in threaded view
|

Re: precompiled regex's to lua

D Burgess-4
How does gmatch compare?

db

Reply | Threaded
Open this post in threaded view
|

Re: precompiled regex's to lua

Andreas Stenius-2
In reply to this post by Asko Kauppi
I just wrote a simple app to read out data from a CSV-file (although ; separated).. haven't benchmarked it, but it parses a few thousand lines from a bunch of files in "no time" :)

The nitty gritty is in this function:

local function read_lines(f)
	local i = rows.__n
	for line in f:lines() do
		local value = line:gmatch( "([^;]*);?" )
		local empty_row = true
		for c = 1, #fc do
			local v = value()
			if v and v:len() > 0 then
				empty_row = false
				rows[fc[c]][i+1] = v
			end				
		end

		if not empty_row then
			i = i + 1
		end
	end

	rows.__n = i
end

Ps. the fc holds the columns for the current file (parsed out in a different function), and the rows table holds the values for all files _and_ columns, so it's a kind of combo-mombo krunchbin (yadi-yadi).. ;)

//A

[hidden email] skrev:

Thanks for the background, so I'm kind of comparing apples and oranges?

Anyways, I had done the benchmarking, and up until the last yards Lua was indeed always faster. With use of qr// in Perl, it wasn't.

Which lead me to think, is there anything that could be done. In general terms, I too would prefer Lua over anything, but it also needs to be "faster or the same speed" as competitor. Otherwise, a transition would certainly -and justifyably- be questioned.

The actual issue is CSV parsing (comma separated values). Just simple, "%d+,%d+,[^,]," kind. Ideas for a better approach in Lua then string.match? (making a C module just for cutting out the , fields would be Fast but is not really an alternative, at least not in comparison to other languages)

-asko



On Thu, 9 Nov 2006 23:23:12 +0100
 Karel Tuma <[hidden email]> wrote:
just benchmark it :)
lua of course wins, but not always.

the thing with compiling regexs, they're FSA (finite state automata)
basically tree of states, where nodes can point meshed to other nodes.
FSA compiler is considered non-trivial in terms of implementation size.

but lua patterns are sort of NFA (nonfinite), they're already compiled as they
are. due to that referring back to some state or sub-expressions
is very limited making lua patterns "less powerful" than pcre's, but
enough for the usual daywork and when you need something more,
you've to code the logic all by yourself.
to be exact FSA is slower for "simple" expressions where NFA on the
complex ones, some pcre implementations even use both and choose
between them depending on the expression (!).

hey, this is lua, we want the simple and fast, right?

add to this perl's suckyness on much everything else and lua is the winner
(using lua for parsing 1Gb+ logs using mmap(), look at
http://blua.leet.cz/sep/STRHOOK_PATCH.patch to get the idea)

On Thu, Nov 09, 2006 at 09:58:22PM +0100, Asko Kauppi wrote:

I didn't find any reference to discussing precompiled regular expressions, and Lua.

Some background:

In huge log file handling, Lua loses to Perl (not by much!) seemingly because it has no concept of precompiling, and then applying the regular expression patterns.

In Perl, one can:

    my $re= qr/^\d+\s/;
$var =~ $re; # $re is a precompiled, optimized regex, applied to $var
    or:
    $var =~ /^\d+\s/o;    # 'o' for once, compile once, cache, reuse

Lua:
string.match( var, "%d+%s" ) -- is "%d+%s" analyzed anew each time?


Is Lua losing in speed, since it has not this concept, or have the authors been so clever, it's already "there", just invisible? :) We're talking BIG files, say 1 million lines, or more.

-asko





Reply | Threaded
Open this post in threaded view
|

Re: precompiled regex's to lua

Roberto Ierusalimschy
In reply to this post by Asko Kauppi
> The actual issue is CSV parsing (comma separated values). 
> Just simple, "%d+,%d+,[^,]," kind.  Ideas for a better 
> approach in Lua then string.match?

Even with string.match there are several alternatives. Because Lua does
not precompile the patterns, small differences among them may change
the performance. You could try other patterns for your problem to
check whether you can gain some speed. (Probably results will not
extend to different platforms...) For instance, you can try ".-," or
"%d%d*,".

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: precompiled regex's to lua

Asko Kauppi
In reply to this post by D Burgess-4

Here's the inofficial results of a small (20+ lines) test, run on three candidates.

Some observations:
     - Perl is about 2x faster than Lua
- C is about 8x faster than Perl (NOT using regex, but csv in C wouldn't need either..) - Perl performance is _not_ dependent on the way one describes the regex (= it seems to automatically cache them, anyways!)

System:   PowerBook G4 1,6GHz  OS X

The exact times, especially 'real' are not that important. The magnitudes are.


# Perl (qr//): 28s (31.37 real 26.58 user 1.19 sys) # Perl (//o): 28s (43.45 real 26.70 user 1.58 sys) # Perl (//): 28s (30.32 real 26.50 user 1.17 sys)

# Lua (match): 50s (70.44 real 48.70 user 1.71 sys) # Lua (gmatch): 1min 17s (88.00 real 75.11 user 1.62 sys)

# C (gets, strchr, -O3): 4s (4.48 real 3.33 user 0.74 sys)

For comparison, running "wc -l" on the file to get the disk overhead:

# wc (raw read):  2.7s  (30.96 real 1.49 user 1.22 sys)

-asko



On Fri, 10 Nov 2006 15:29:32 +1100
 "David Burgess" <[hidden email]> wrote:
How does gmatch compare?

db


Reply | Threaded
Open this post in threaded view
|

Re: precompiled regex's to lua

Roberto Ierusalimschy
> Here's the inofficial results of a small (20+ lines) test, 
> run on three candidates.

When sending results, it is quite useful to know results of what. Could
you send the source of these tests?

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: precompiled regex's to lua

Asko Kauppi

The choice of not sharing the source was intentional, since it was made as part of my paid duties, to which I do not own rights to. I can ask for permission, and will most likely be granted it, but would it really matter? I don't see Lua faring multiple times better, getting the magnitudes was the point of the ad hoc test. On other areas than regex, I see Lua performing very favorably, compared to anything.

Next week, I will also test Perl CSV modules, which I've earlier seen very poor performers. Perhaps I've used the wrong flavor?

-asko


On Fri, 10 Nov 2006 10:08:24 -0200
 [hidden email] (Roberto Ierusalimschy) wrote:
Here's the inofficial results of a small (20+ lines) test, run on three candidates.

When sending results, it is quite useful to know results of what. Could
you send the source of these tests?

-- Roberto


Reply | Threaded
Open this post in threaded view
|

Re: precompiled regex's to lua

Shmuel Zeigerman-2
In reply to this post by Roberto Ierusalimschy
Roberto Ierusalimschy wrote:
When sending results, it is quite useful to know results of what. Could
you send the source of these tests?

For what it worth :)

-- A simple test with synthesized CSV:

local pcre = require "rex_pcre"
local rep, gmatch = string.rep, (string.gmatch or string.gfind)

local subj = rep (rep ("abcd", 100) .. ",", 100) .. "a"

local num = assert (tonumber (arg[1]))

--> 700 uSec/iteration (Lrexlib PCRE binding)
if arg[2] == "pcre" then
  local r = pcre.new ("[^,]+\\,?") -- precompiling regex
  for i = 1, num do
    r:gmatch (subj, function(m, t) end)
  end
end

--> 1500 uSec/iteration (Lua 5.0)
--> 2200 uSec/iteration (Lua 5.1)
if arg[2] == "lua" then
  for i = 1, num do
    for c in gmatch (subj, "[^,]+%,?") do end
  end
end


--
Shmuel

Reply | Threaded
Open this post in threaded view
|

Re: precompiled regex's to lua

Shmuel Zeigerman-2
The test was extended; see its results below.
(The test source is attached).

[ Pentium 1600 MHz, 256 MB RAM, Windows XP SP2 ]

  Execution times (microseconds per iteration)
-----------------------------------------------
    650  :  Lrexlib (PCRE-6.7/normal build)
   1000  :  Lrexlib (PCRE-6.7/no recursion build)
   1500  :  Lua 5.0
   2200  :  Lua 5.1
  15000  :  Lrexlib (POSIX/John Maddock/boost-1.31.0)
  35000  :  Lrexlib (POSIX/Henry Spencer)

--
Shmuel

Attachment: test1.lua
Description: application/binary

Reply | Threaded
Open this post in threaded view
|

Re: precompiled regex's to lua

Shmuel Zeigerman-2
Further testing has shown dependency on compiler/
runtime libraries the tested libraries were built with.

See the updated results:

[ Pentium 1600 MHz, 256 MB RAM, Windows XP SP2 ]

  Execution times (microseconds per iteration)
-----------------------------------------------
    650  :  Lrexlib (PCRE-6.7/normal build; gcc 3.4.2)
   1000  :  Lrexlib (PCRE-6.7/no recursion build; gcc 3.4.2)
   1500  :  Lua 5.0/5.1 (from latest LuaBinaries)
   2200  :  Lua 5.1 (gcc 3.4.2)
  15000  :  Lrexlib (POSIX/John Maddock/boost-1.31.0; bcc32 5.5.1)
  35000  :  Lrexlib (POSIX/Henry Spencer; compiler unknown)

--
Shmuel