String tokenization function

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

String tokenization function

Marco Atzori
I created various functions for string manipulation in Lua, where the
strings are composed of tokens separated by an ASCII character (which
may be the point, comma, semicolon etc.). Two of these are the functions
numtok() and gettok(). Here is the code:
http://nopaste.linux-dev.org/?824658

And this is the way it should work:

gettok(sInput, iPosition, Separator, iRange)

Where

     sInput = string to manipulate
     iPosition = position of the token inside the string. If lesser than
0, it will be considered the position from the last token to the first.
If equal to 0, returns the whole string.
     Separator = ASCII code of the token separator
     iRange = optional: if specified, returns the token from position to
range. If equal to 0, return all tokens from position to the end of the
string.

local text = "apple.banana.cherry.grape.orange"

     gettok(text,1,46)
         apple

     gettok(text,-2,46)
         grape

     gettok(text,2,46,4)
         banana.cherry.grape

     gettok(text,-1,46,-3)
         cherry.grape.orange


Could you give me some advice on improving the code? In particular, I
need your help to figure out how I can avoid writing each time the same
piece of code (line 23 to 28, line 39 to 44, line 48 to 53) without
using, as much as possible, the tables.

Can you help me?

Reply | Threaded
Open this post in threaded view
|

Re: String tokenization function

Matthew Wild
On 7 November 2015 at 14:42, Marco Atzori <[hidden email]> wrote:
> I created various functions for string manipulation in Lua, where the
> strings are composed of tokens separated by an ASCII character (which may be
> the point, comma, semicolon etc.). Two of these are the functions numtok()
> and gettok(). Here is the code: http://nopaste.linux-dev.org/?824658

This code is really complex and inefficient. Are there reasons you are
doing it like this? Why can't you use tables?

If you have actual constraints, revealing them might allow people to
help you better. If you don't have any constraints and just want it to
perform well, then there are some really much simpler and more
efficient solutions than your implementation.

Regards,
Matthew

Reply | Threaded
Open this post in threaded view
|

Re: String tokenization function

Marco Atzori
Il 07/11/2015 16:46, Matthew Wild ha scritto:
> If you have actual constraints, revealing them might allow people to
> help you better. If you don't have any constraints and just want it to
> perform well, then there are some really much simpler and more
> efficient solutions than your implementation.
This code is part of a videogames addon, and is executed every frame
refresh (about every tenth of a second) for which using tables created
in continuation is a considerable waste of memory. If there are simpler
solutions, show them to me as well. I asked your help for this reason ;-)

Reply | Threaded
Open this post in threaded view
|

Re: String tokenization function

Scott Akciom


On Sat, Nov 7, 2015 at 10:52 AM, Marco Atzori <[hidden email]> wrote:
Il 07/11/2015 16:46, Matthew Wild ha scritto:
If you have actual constraints, revealing them might allow people to
help you better. If you don't have any constraints and just want it to
perform well, then there are some really much simpler and more
efficient solutions than your implementation.
This code is part of a videogames addon, and is executed every frame refresh (about every tenth of a second) for which using tables created in continuation is a considerable waste of memory. If there are simpler solutions, show them to me as well. I asked your help for this reason ;-)

In trying to avoid table allocation multiple closures and multiple strings are allocated. I'm fairly certain rewriting the code using tables will not only be cleaner but faster with less memory allocation.
Reply | Threaded
Open this post in threaded view
|

Re: String tokenization function

Dirk Laurie-2
In reply to this post by Marco Atzori
2015-11-07 17:52 GMT+02:00 Marco Atzori <[hidden email]>:

> Il 07/11/2015 16:46, Matthew Wild ha scritto:
>>
>> If you have actual constraints, revealing them might allow people to
>> help you better. If you don't have any constraints and just want it to
>> perform well, then there are some really much simpler and more
>> efficient solutions than your implementation.
>
> This code is part of a videogames addon, and is executed every frame refresh
> (about every tenth of a second) for which using tables created in
> continuation is a considerable waste of memory. If there are simpler
> solutions, show them to me as well. I asked your help for this reason ;-)

[from original post]
   gettok(text,-1,46,-3)

I cannot envisage why that argument '46' should be needed at all.

I would use a table created once and for all for each string.

local tok_meta = { __index = { gettok =
   function(tbl,i,j)
     local n=#tbl
     j = j or i
     if i and i<0 then i=n+i+1 end
     if j and j<0 then j=n+j+1 end
   return table.concat(tbl,tbl.sep,i,j) end }}

local tokens = function(str,sep)
   local fields={sep=sep}
   str:gsub(("[^%s]+"):format(sep),function(field)
      fields[#fields+1]=field
      end)
   return setmetatable(fields,tok_meta)
end

local s = tokens( "apple.banana.cherry.grape.orange",".")

> s:gettok(1)
apple
> s:gettok(-2)
grape
> s:gettok(2,4)
banana.cherry.grape
> s:gettok(-3,-1)
cherry.grape.orange

Reply | Threaded
Open this post in threaded view
|

Re: String tokenization function

Marco Atzori
Il 07/11/2015 18:54, Dirk Laurie ha scritto:

> 2015-11-07 17:52 GMT+02:00 Marco Atzori <[hidden email]>:
>> Il 07/11/2015 16:46, Matthew Wild ha scritto:
>>> If you have actual constraints, revealing them might allow people to
>>> help you better. If you don't have any constraints and just want it to
>>> perform well, then there are some really much simpler and more
>>> efficient solutions than your implementation.
>> This code is part of a videogames addon, and is executed every frame refresh
>> (about every tenth of a second) for which using tables created in
>> continuation is a considerable waste of memory. If there are simpler
>> solutions, show them to me as well. I asked your help for this reason ;-)
> [from original post]
>     gettok(text,-1,46,-3)
>
> I cannot envisage why that argument '46' should be needed at all.
>
> I would use a table created once and for all for each string.
>
> local tok_meta = { __index = { gettok =
>     function(tbl,i,j)
>       local n=#tbl
>       j = j or i
>       if i and i<0 then i=n+i+1 end
>       if j and j<0 then j=n+j+1 end
>     return table.concat(tbl,tbl.sep,i,j) end }}
>
> local tokens = function(str,sep)
>     local fields={sep=sep}
>     str:gsub(("[^%s]+"):format(sep),function(field)
>        fields[#fields+1]=field
>        end)
>     return setmetatable(fields,tok_meta)
> end
>
> local s = tokens( "apple.banana.cherry.grape.orange",".")
>
>> s:gettok(1)
> apple
>> s:gettok(-2)
> grape
>> s:gettok(2,4)
> banana.cherry.grape
>> s:gettok(-3,-1)
> cherry.grape.orange
>
Because 46 is the ASCII code of the dot. If I want, for example,
tokenize the string with another character (which can be a letter,
number, comma, semicolon, hash or any ASCII character from 1 to 255)
with your code this is not possible because it is limited to only dot as
a separator between tokens.

local text = "Roses-are-red-violets-are-blue"

As you can see, your code is unable to get "violets", when you just need
enter 45 instead of 46.


Reply | Threaded
Open this post in threaded view
|

Re: String tokenization function

Dirk Laurie-2
2015-11-07 20:05 GMT+02:00 Marco Atzori <[hidden email]>:
> Il 07/11/2015 18:54, Dirk Laurie ha scritto:

>> local s = tokens( "apple.banana.cherry.grape.orange",".")

> Because 46 is the ASCII code of the dot. If I want, for example, tokenize
> the string with another character (which can be a letter, number, comma,
> semicolon, hash or any ASCII character from 1 to 255) with your code this is
> not possible because it is limited to only dot as a separator between
> tokens.
>
> local text = "Roses-are-red-violets-are-blue"
>
> As you can see, your code is unable to get "violets", when you just need
> enter 45 instead of 46.

I paid your question the compliment of actually implementing and
testing my answer. It is a disappointment that you have summarily
dismissed that answer without bothering to read it carefullly.

   s = tokens("Roses-are-red-violets-are-blue","-")

> s:gettok(4)
violets

Reply | Threaded
Open this post in threaded view
|

Re: String tokenization function

Philipp Janda
In reply to this post by Marco Atzori
Am 07.11.2015 um 16:52 schröbte Marco Atzori:

> Il 07/11/2015 16:46, Matthew Wild ha scritto:
>> If you have actual constraints, revealing them might allow people to
>> help you better. If you don't have any constraints and just want it to
>> perform well, then there are some really much simpler and more
>> efficient solutions than your implementation.
> This code is part of a videogames addon, and is executed every frame
> refresh (about every tenth of a second) for which using tables created
> in continuation is a considerable waste of memory. If there are simpler
> solutions, show them to me as well. I asked your help for this reason ;-)
>

This one doesn't allocate any memory (except for the result string --
and the separator if you insist on passing ASCII codes around):

     local s_find, s_sub, s_char = string.find, string.sub, string.char

     local function numtok( s, c )
       local pos, n = 0, 0
       while pos do
         n, pos = n+1, s_find( s, c, pos+1, true )
       end
       return n
     end

     local function gettok( s, b, c, e )
       if b == 0 then return s end
       c = s_char( c ) -- XXX why pass ASCII codes anyway???!
       e = e or b -- default value for e
       if b < 0 or e < -1 then
         local cnt = numtok( s, c )
         if b < 0 then b = b + cnt + 1 end -- make b positive
         if e < -1 then e = e + cnt + 1 end -- make e positive
       end
       if e > 0 and b > e then b, e = e, b end -- fix order
       local bpos, pos, n = 1, 0, 1
       while n < b and pos do -- find first requested token
         n, pos = n+1, s_find( s, c, pos+1, true )
       end
       if pos then
         bpos = pos+1
         if e < 1 then pos = nil end -- to the end of the input string
         while n < e+1 and pos do -- find end of last requested token
           n, pos = n+1, s_find( s, c, pos+1, true )
         end
       end
       return s_sub( s, bpos, (pos or #s+1)-1 )
     end

However, if you extract tokens from each input string at least `n`
times, Dirk's approach is probably faster for a very small `n`. Happy
benchmarking!

Philipp




Reply | Threaded
Open this post in threaded view
|

Re: String tokenization function

Marco Atzori
In reply to this post by Dirk Laurie-2
Il 07/11/2015 19:25, Dirk Laurie ha scritto:

> 2015-11-07 20:05 GMT+02:00 Marco Atzori <[hidden email]>:
>> Il 07/11/2015 18:54, Dirk Laurie ha scritto:
>>> local s = tokens( "apple.banana.cherry.grape.orange",".")
>> Because 46 is the ASCII code of the dot. If I want, for example, tokenize
>> the string with another character (which can be a letter, number, comma,
>> semicolon, hash or any ASCII character from 1 to 255) with your code this is
>> not possible because it is limited to only dot as a separator between
>> tokens.
>>
>> local text = "Roses-are-red-violets-are-blue"
>>
>> As you can see, your code is unable to get "violets", when you just need
>> enter 45 instead of 46.
> I paid your question the compliment of actually implementing and
> testing my answer. It is a disappointment that you have summarily
> dismissed that answer without bothering to read it carefullly.
>
>     s = tokens("Roses-are-red-violets-are-blue","-")
>
>> s:gettok(4)
> violets
>
Really sorry for my "bad" answer, I was in a hurry and did not read
carefully your answer. Thanks for the tips, and I still ask pardon for first

Marco

Reply | Threaded
Open this post in threaded view
|

Re: String tokenization function

Matthew Wild
In reply to this post by Marco Atzori
On 7 November 2015 at 15:52, Marco Atzori <[hidden email]> wrote:

> Il 07/11/2015 16:46, Matthew Wild ha scritto:
>>
>> If you have actual constraints, revealing them might allow people to
>> help you better. If you don't have any constraints and just want it to
>> perform well, then there are some really much simpler and more
>> efficient solutions than your implementation.
>
> This code is part of a videogames addon, and is executed every frame refresh
> (about every tenth of a second) for which using tables created in
> continuation is a considerable waste of memory. If there are simpler
> solutions, show them to me as well. I asked your help for this reason ;-)

I have another question: is your API so flexible because it needs to
be? The 'range' option, especially the possibility of it and/or the
start position being negative, is responsible for a lot of the
function's complexity and performance too.

If you are just using this function for one specific thing on each
frame refresh, you could probably get big performance improvements by
making it less generic.

Regards,
Matthew

Reply | Threaded
Open this post in threaded view
|

Re: String tokenization function

Vaughan McAlley-2
In reply to this post by Marco Atzori
On 8 November 2015 at 02:52, Marco Atzori <[hidden email]> wrote:

> Il 07/11/2015 16:46, Matthew Wild ha scritto:
>>
>> If you have actual constraints, revealing them might allow people to
>> help you better. If you don't have any constraints and just want it to
>> perform well, then there are some really much simpler and more
>> efficient solutions than your implementation.
>
> This code is part of a videogames addon, and is executed every frame refresh
> (about every tenth of a second) for which using tables created in
> continuation is a considerable waste of memory. If there are simpler
> solutions, show them to me as well. I asked your help for this reason ;-)
>

When I was writing addons for World of Warcraft, I was staggered at
the amount of Lua code that was being run every frame, combat event or
chat event by both the native UI and addons. I tried to make my code
as lean as possible, but a lot of wasn’t, particularly pure Lua
implementations of what would otherwise be written in C. My
not-very-modern computer ran it all flawlessly, so I don’t imagine
creating a table each frame will adversely affect performance. A lot
of garbage will be produced anyway, and the GC can deal with it.

Vaughan

Reply | Threaded
Open this post in threaded view
|

Re: String tokenization function

Peter Pimley
I only skim-read the code, but if you were worried about the tradeoff of the cost of repeated work versus memory usage then you could hold previous results in a weak table.  See Programming in Lua for details.

On 9 November 2015 at 05:14, Vaughan McAlley <[hidden email]> wrote:
On 8 November 2015 at 02:52, Marco Atzori <[hidden email]> wrote:
> Il 07/11/2015 16:46, Matthew Wild ha scritto:
>>
>> If you have actual constraints, revealing them might allow people to
>> help you better. If you don't have any constraints and just want it to
>> perform well, then there are some really much simpler and more
>> efficient solutions than your implementation.
>
> This code is part of a videogames addon, and is executed every frame refresh
> (about every tenth of a second) for which using tables created in
> continuation is a considerable waste of memory. If there are simpler
> solutions, show them to me as well. I asked your help for this reason ;-)
>

When I was writing addons for World of Warcraft, I was staggered at
the amount of Lua code that was being run every frame, combat event or
chat event by both the native UI and addons. I tried to make my code
as lean as possible, but a lot of wasn’t, particularly pure Lua
implementations of what would otherwise be written in C. My
not-very-modern computer ran it all flawlessly, so I don’t imagine
creating a table each frame will adversely affect performance. A lot
of garbage will be produced anyway, and the GC can deal with it.

Vaughan