finding keys in a table

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

finding keys in a table

Jelle Huisman
Hello,

I need a frequency list of all characters in a text file and I try to use a table for this but I'm not totally sure if this is the best way to solve my problem. Anyway I setup a table "count" with the characters as key and a counter as value. When I iterate over my data I want to:
(1) lookup if the current character already exists as a key in the table.
(2a) If it does exist I increment the value for key currentchar with 1
(2b) else I add a new entry: currentchar=1
(3) Later I sort the table and print my freq.list.

The two questions I have are: is this the best/most efficient/fastest way to do this? Second: How do I check if a key exists? (I don't think I can use an inverted table because I'll end up with duplicate keys, or not?)

Thanks for some advice,

Jelle

Reply | Threaded
Open this post in threaded view
|

Re: finding keys in a table

Alex Davies
Exactly as you list:

From: "Jelle Huisman" <[hidden email]>
(1) lookup if the current character already exists as a key in the table.
(2a) If it does exist I increment the value for key currentchar with 1
(2b) else I add a new entry: currentchar=1
(3) Later I sort the table and print my freq.list.

if tab[char] ~= nil then  -- (1) is the char in the table?
 tab[char] = tab[char] + 1  -- (2a) then increment
else
 tab[char] = 1  -- (2b) else add new entry
end

(It could also be written like this, for a bit more speed: )
local count = tab[char]
tab[char] = (count or 0) + 1 -- initializes count if necessary (an explanation can be found at: http://www.lua.org/pil/3.3.html)

Because you can't sort key/value pairs, sorting is a little bit more complicated:

local reverse = {}
for k,v in pairs(tab) do
 reverse[#reverse+1] = { k, v }
end
table.sort(reverse, function(a, b) a[2] < b[2] end)
Reply | Threaded
Open this post in threaded view
|

Re: finding keys in a table

Evan DeMond
In reply to this post by Jelle Huisman
Second: How do I check if a key exists? (I don't think I can use an
inverted table because I'll end up with duplicate keys, or not?)

You should just be able to check your_table["key"] or alternately your_table.key. If it has no value (hasn't been set yet), it'll return nil.

Lua 5.1.2  Copyright (C) 1994-2007 Lua.org, PUC-Rio
> chars = {}
> chars.a = 5
> if chars.a then print("yes") end
yes
> if chars.b then print("yes") end
> = chars.b
nil
Reply | Threaded
Open this post in threaded view
|

Re: finding keys in a table

Hans Hagen
In reply to this post by Alex Davies
Alex Davies wrote:
Exactly as you list:

From: "Jelle Huisman" <[hidden email]>
(1) lookup if the current character already exists as a key in the table.
(2a) If it does exist I increment the value for key currentchar with 1
(2b) else I add a new entry: currentchar=1
(3) Later I sort the table and print my freq.list.

if tab[char] ~= nil then  -- (1) is the char in the table?
 tab[char] = tab[char] + 1  -- (2a) then increment
else
 tab[char] = 1  -- (2b) else add new entry
end

(It could also be written like this, for a bit more speed: )
local count = tab[char]
tab[char] = (count or 0) + 1 -- initializes count if necessary (an explanation can be found at: http://www.lua.org/pil/3.3.html)

actually,

    tab[char] = (count or 0) + 1

is slower than

    if count then
        tab[char] = count + 1
    else
        tab[char] = 1
    end

(the if then variant takes 60% of the time of the or case; the interpreter needs an extra LOADK)

in a similar fashion, parallel assignments can be faster

Hans

-----------------------------------------------------------------
                                          Hans Hagen | PRAGMA ADE
              Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                             | www.pragma-pod.nl
-----------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: finding keys in a table

Mike Pall-82
local t = setmetatable({}, { __index = function() return 0 end })
for c in string.gmatch(io.read("*a"), ".") do t[c] = t[c] + 1 end
table.foreach(t, print)

--Mike

Reply | Threaded
Open this post in threaded view
|

Re: finding keys in a table

Javier Guerra Giraldez
On 2/21/08, Mike Pall <[hidden email]> wrote:
> local t = setmetatable({}, { __index = function() return 0 end })
>  for c in string.gmatch(io.read("*a"), ".") do t[c] = t[c] + 1 end
>  table.foreach(t, print)

very neat!

now, which way is the fastest?


-- 
Javier

Reply | Threaded
Open this post in threaded view
|

Re: finding keys in a table

Alex Davies
In reply to this post by Hans Hagen
Hmmm. I actually meant:

tab[char] = (tab[char] or 0) + 1

Although it shouldn't make any difference.

I'm a bit confused though, because I measured:
 for i = 1, 1024*1024*64 do
   test.a = (test.a or 0) + 1
 end
versus:
 for i = 1, 1024*1024*64 do
   local count = test.a
   if count then
     test.a = count + 1
   else
     test.a = 1
   end
 end

And on my compiler/computer at least, the top one is consistently faster. No idea why, given that both give the same opcodes for the main loop:
 Gettable, Test, Add, Settable

Of course, if you mean that the first time through the loop would be faster in the bottom case you would be spot on. Still, if the loop takes any amount of time, I'd figure the first hit for each letter insignificant. Just an imo. In any case the top way is a lot neater. =)

Javier: Mike's way is fastest and neatest =)

- Alex

----- Original Message ----- From: "Hans Hagen" <[hidden email]>
actually,

    tab[char] = (count or 0) + 1

is slower than

    if count then
        tab[char] = count + 1
    else
        tab[char] = 1
    end

(the if then variant takes 60% of the time of the or case; the interpreter needs an extra LOADK)

in a similar fashion, parallel assignments can be faster

Hans

Reply | Threaded
Open this post in threaded view
|

Re: finding keys in a table

Jelle Huisman
In reply to this post by Alex Davies
Alex Davies wrote:
Exactly as you list:
Hello Alex, and others,

Sorry to be slow with my response: I had a meeting going on at the same time... :-) Your solution is almost the same as what I had first but I somehow managed to break my code. This gives me a new start. It's nice to have different implementations, I'll test it with a larger chuck of my data.

Thanks again,

Jelle

Reply | Threaded
Open this post in threaded view
|

Re: finding keys in a table

Mark Hamburg-4
In reply to this post by Hans Hagen
on 2/21/08 7:15 AM, Hans Hagen at [hidden email] wrote:

> actually,
> 
>      tab[char] = (count or 0) + 1
> 
> is slower than
> 
>      if count then
>          tab[char] = count + 1
>      else
>          tab[char] = 1
>      end
> 
> (the if then variant takes 60% of the time of the or case; the
> interpreter needs an extra LOADK)
> 
> in a similar fashion, parallel assignments can be faster

That seems surprising. Admittedly for a small input set, we avoid doing the
addition but I would expect in a long running case, we're mostly on the
first branch of the if and since or is short-circuiting it should amount to
essentially the same computation.

My guess, however, would be that the __index approach is faster in the long
run because while it makes the initialization case more expensive (it
requires a Lua function call), it does the detection as part of the standard
table fetch logic.

Mark



Reply | Threaded
Open this post in threaded view
|

Re: finding keys in a table

Hans Hagen
In reply to this post by Javier Guerra Giraldez
Javier Guerra wrote:
On 2/21/08, Mike Pall <[hidden email]> wrote:
local t = setmetatable({}, { __index = function() return 0 end })
 for c in string.gmatch(io.read("*a"), ".") do t[c] = t[c] + 1 end
 table.foreach(t, print)

very neat!

now, which way is the fastest?

hard to measure, but the lpeg variant is slightly faster

data = string.rep("a",10*1000*1000)

collectgarbage("collect")
t = os.clock()
cc = setmetatable({}, { __index = function() return 0 end })
for c in data:gmatch(".") do cc[c] = cc[c] + 1 end
print(os.clock()-t)

collectgarbage("collect")
t = os.clock()
cc = setmetatable({}, { __index = function() return 0 end })
lpeg.match((lpeg.P(1) / function(c) cc[c] = cc[c] + 1 end)^0,data)
print(os.clock()-t)

collectgarbage("collect")
t = os.clock()
local function add(c) cc[c] = cc[c] + 1 end
cc = setmetatable({}, { __index = function() return 0 end })
lpeg.match((lpeg.P(1) / add)^0,data)
print(os.clock()-t)



-----------------------------------------------------------------
                                          Hans Hagen | PRAGMA ADE
              Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                             | www.pragma-pod.nl
-----------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: finding keys in a table

Mike Pall-82
Hans Hagen wrote:
> cc = [...]

Do you know about the keyword "local"?

If not, then do NOT post Lua benchmarks. Ever.

--Mike

Reply | Threaded
Open this post in threaded view
|

Re: finding keys in a table

Doug Rogers-4
In reply to this post by Mike Pall-82
Mike Pall wrote:
> local t = setmetatable({}, { __index = function() return 0 end })
> for c in string.gmatch(io.read("*a"), ".") do t[c] = t[c] + 1 end
> table.foreach(t, print)

I like it! You could replace the middle line with:

io.read("*a"):gsub("(.)", function(c) t[c] = t[c] + 1 end)

Gotta love those string functions.

I tried to clean up the special character output with a new last line:

table.foreach(t, function (c, n) print(("%q %4d"):format(c, n)) end)

I was surprised to find that "%q" did not escape the newline (linefeed)
character with "\n" but with "\
". The reference manual describes this as the proper behavior, even
including an embedded newline in its example. I've read the manual front
to back a number of times, yet I was still surprised by this.

It seems like using "\n" would give an exact representation of the
string even if it is written to a file that is shared between machines
with different line termination characteristics. My experience with this
sort of thing is that there was a reason for it, but the only reason I
can come up with (allowing multi-line quoted strings) is easily
dismissed by using '[['.

Is it done this way explicitly so it can automatically pick up the
LF/CR+LF of the host system when the string is re-read?

It would be an interesting project (not volunteering - yet!) to create
an Annotated Lua Reference Manual that includes rationales for various
design decisions. Both the reference manual and PIL have a few examples
of such annotations. Ada had such an annotated reference manual and it
was invaluable when I was learning it. It made me think from the
compiler's point of view, which is always helpful.

As always, it's about the time...

Doug

-- 
Innovative Concepts, Inc. www.innocon.com 703-893-2007 x220

Reply | Threaded
Open this post in threaded view
|

Re: finding keys in a table

Hans Hagen
In reply to this post by Mike Pall-82
Mike Pall wrote:
Hans Hagen wrote:
cc = [...]

Do you know about the keyword "local"?

sure

If not, then do NOT post Lua benchmarks. Ever.

does not make much difference in the relative speed here; the lpeg one seems slightly faster (which does not surprise me that much)

(and i've done quite some benchmarking, on large data sets as well as on rather complex lua code; not that it matters; this is not production code but just a quick comparison between gmatch and lpeg)

-----------------------------------------------------------------
                                          Hans Hagen | PRAGMA ADE
              Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                             | www.pragma-pod.nl
-----------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: finding keys in a table

Alex Davies
In reply to this post by Doug Rogers-4
It's funny you mentioned it though, because the first time it did that to me I was surprised too. I'm guessing it's because it's easier to print that way. Imagine reading the output of an escaped 100kB text file otherwise. It definitely doesn't modify the strings \n's to the host systems variants - that would be counterproductive. (I check lstrlib just to be sure)

- Alex

----- Original Message ----- From: "Doug Rogers" <[hidden email]>

It seems like using "\n" would give an exact representation of the
string even if it is written to a file that is shared between machines
with different line termination characteristics. My experience with this
sort of thing is that there was a reason for it, but the only reason I
can come up with (allowing multi-line quoted strings) is easily
dismissed by using '[['.

Is it done this way explicitly so it can automatically pick up the
LF/CR+LF of the host system when the string is re-read?

Reply | Threaded
Open this post in threaded view
|

Re: finding keys in a table

Doug Rogers-4
Alex Davies wrote:
> Imagine reading the output of an escaped 100kB text file
> otherwise.

I guess I don't consider that a use case for s:format("%q") but rather a
use case for io.write('[[' .. s .. ']]'). The former only seems useful
when I want to write a short string that is quoted for later reading.

Doug

> ----- Original Message ----- From: "Doug Rogers" <[hidden email]>
>> It seems like using "\n" would give an exact representation of the
>> string even if it is written to a file that is shared between machines
>> with different line termination characteristics. My experience with this
>> sort of thing is that there was a reason for it, but the only reason I
>> can come up with (allowing multi-line quoted strings) is easily
>> dismissed by using '[['.
>>
>> Is it done this way explicitly so it can automatically pick up the
>> LF/CR+LF of the host system when the string is re-read?


-- 
Innovative Concepts, Inc. www.innocon.com 703-893-2007 x220

Reply | Threaded
Open this post in threaded view
|

Re: finding keys in a table

Hans Hagen
In reply to this post by Mark Hamburg-4
Mark Hamburg wrote:

My guess, however, would be that the __index approach is faster in the long
run because while it makes the initialization case more expensive (it
requires a Lua function call), it does the detection as part of the standard
table fetch logic.

indeed, of course at the expense of an extra table, but that seldom is an issue; the __index method is of course also cleaner

Hans

-----------------------------------------------------------------
                                          Hans Hagen | PRAGMA ADE
              Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                             | www.pragma-pod.nl
-----------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: finding keys in a table

Luiz Henrique de Figueiredo
In reply to this post by Doug Rogers-4
> I guess I don't consider that a use case for s:format("%q") but rather a
> use case for io.write('[[' .. s .. ']]').

If s comes from user input, then this is unsafe if later loaded back into Lua:
s might be "]]; MALICIOUS CODE HERE ; [[" or something like that.

Reply | Threaded
Open this post in threaded view
|

Patch for escaping '\n' with string.format("%Q") (was Re: finding keys in a table)

Doug Rogers-4
Luiz Henrique de Figueiredo wrote:
>> I guess I don't consider that a use case for s:format("%q") but rather a
>> use case for io.write('[[' .. s .. ']]').

> If s comes from user input, then this is unsafe if later loaded back into Lua:
> s might be "]]; MALICIOUS CODE HERE ; [[" or something like that.

It would be too complex to have a '%Q' format that automatically escaped
'[-*[', otherwise Q could be q's big sister, just like [[]] are to '".

For anyone interested I've attached a patch for lstrlib.c that will
allow use of '%Q' to mean "escape newlines, too".

Doug

-- 
Innovative Concepts, Inc. www.innocon.com 703-893-2007 x220
diff -r -u lua-5.1.3-orig/src/lstrlib.c lua-5.1.3/src/lstrlib.c
--- lua-5.1.3-orig/src/lstrlib.c	2007-12-28 10:32:23.000000000 -0500
+++ lua-5.1.3/src/lstrlib.c	2008-02-21 14:01:20.000000000 -0500
@@ -692,7 +692,7 @@
 #define MAX_FORMAT	(sizeof(FLAGS) + sizeof(LUA_INTFRMLEN) + 10)
 
 
-static void addquoted (lua_State *L, luaL_Buffer *b, int arg) {
+static void addquoted (lua_State *L, luaL_Buffer *b, int arg, int nlescape) {
   size_t l;
   const char *s = luaL_checklstring(L, arg, &l);
   luaL_addchar(b, '"');
@@ -700,7 +700,7 @@
     switch (*s) {
       case '"': case '\\': case '\n': {
         luaL_addchar(b, '\\');
-        luaL_addchar(b, *s);
+        luaL_addchar(b, nlescape ? 'n' : *s);
         break;
       }
       case '\r': {
@@ -789,8 +789,8 @@
           sprintf(buff, form, (double)luaL_checknumber(L, arg));
           break;
         }
-        case 'q': {
-          addquoted(L, &b, arg);
+        case 'q': case 'Q': {
+          addquoted(L, &b, arg, *(strfrmt-1) == 'Q');
           continue;  /* skip the 'addsize' at the end */
         }
         case 's': {
Reply | Threaded
Open this post in threaded view
|

Re: finding keys in a table

Kristofer Karlsson
In reply to this post by Luiz Henrique de Figueiredo
This variant takes care of those cases quite neatly, while preserving
ease of use for most cases.

function escape(s) return "[[" .. s:gsub("%]%]", "]]..']]'..[[") .. "]]" end

On Thu, Feb 21, 2008 at 6:49 PM, Luiz Henrique de Figueiredo
<[hidden email]> wrote:
> > I guess I don't consider that a use case for s:format("%q") but rather a
>  > use case for io.write('[[' .. s .. ']]').
>
>  If s comes from user input, then this is unsafe if later loaded back into Lua:
>  s might be "]]; MALICIOUS CODE HERE ; [[" or something like that.
>

Reply | Threaded
Open this post in threaded view
|

Re: finding keys in a table

Romulo Bahiense
In reply to this post by Jelle Huisman
For performance I'd just intialize all characters to zero:

local count = {}

for i = 0, 255 do
    count[ string.char( i ) ] = 0
end

for c in string.gmatch( io.read( "*a" ), "." ) do
    count[ c ] = count[ c ] + 1
end

-- And to iterate over them:

for i = 0, 255 do
    local c = string.char( i )
    if count[ c ] > 0 then
        print( c, count[ c ] )
    end
end

-- Or just remove the zeroes

for i = 255, 0, -1 do
    local c = string.char( i )
    if count[ c ] == 0 then
        count[ c ] = nil
    end
end
table.foreach( count, print )


*not tested* ;)

--rb

On Thu, Feb 21, 2008 at 11:24 AM, Jelle Huisman <[hidden email]> wrote:
> Hello,
>
>  I need a frequency list of all characters in a text file and I try to
>  use a table for this but I'm not totally sure if this is the best way to
>  solve my problem. Anyway I setup a table "count" with the characters as
>  key and a counter as value. When I iterate over my data I want to:
>  (1) lookup if the current character already exists as a key in the table.
>  (2a) If it does exist I increment the value for key currentchar with 1
>  (2b) else I add a new entry: currentchar=1
>  (3) Later I sort the table and print my freq.list.
>
>  The two questions I have are: is this the best/most efficient/fastest
>  way to do this?
>  Second: How do I check if a key exists? (I don't think I can use an
>  inverted table because I'll end up with duplicate keys, or not?)
>
>  Thanks for some advice,
>
>  Jelle
>
>