Spelling corrector

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Spelling corrector

Paco Zamora-Martínez
Hi all,

I'm implementing, just for fun, a Lua version of the spelling corrector of Peter Norvig.


Orignal spelling corrector in Python: http://norvig.com/spell-correct.html

I'm doing it using coroutines to avoid the overhead of several list traversals, but I'm facing efficiency problems. The Lua code is between 3 and 4 times slower than the original Python code. I'm wondering if it is possible to discuss where is the code performing bad. The bottleneck of the code is in the function 'edits1', which is executed several times to modify the string with 2 edition operations.

I have tested the following changes:

1) Implemented using tables instead of coroutines. The performance is close in both implementations.

2) Use table.concat instead of string concatenation. It performs worst with table.concat.

3) Use string.format instead of string concatenation. It performs worst with string.format.

Many thanks, I hope anyone could enjoy with this exercise :-)

Reply | Threaded
Open this post in threaded view
|

Re: Spelling corrector

Enrico Colombini
On 07/07/2014 13.30, Paco Zamora Martínez wrote:
> The bottleneck of the code is in the function 'edits1'

 From a superficial glance at spell.lua:

 > yield( splits[i-1][1]..word[i+1]..word[i]..splits[i+2][2] )

Do you really need to concatenate here, or could you instead pass a
table with separate entries and concatenate only when needed?

Also, you could try using table.concat() instead of .. (it could be
faster... or not).

You could also keep an eye on memory usage to see if this is really an
allocation problem.

Lots of temporary strings seem to be created (e.g. by
word_str:sub(...)); perhaps the code could be restructured to reduce
their number as much as possible?

Last but not least:

 > for i=1,#word_str do for j=1,#alphabet do

I did not try to follow the code, so I may be dead wrong, but nested
loops like this usually raise a O(n^2) efficiency red flag, especially
because you perform string concatenation inside them.

--
   Enrico


Reply | Threaded
Open this post in threaded view
|

Re: Spelling corrector

Paco Zamora-Martínez
Thanks Enrico, I tried some of your comments about string concatenation. I will reply inline:
 
>From a superficial glance at spell.lua:

> yield( splits[i-1][1]..word[i+1]..word[i]..splits[i+2][2] )

Do you really need to concatenate here, or could you instead pass a table with separate entries and concatenate only when needed?


It is really necessary to concatenate here. The distortion procedure of a word may produce in different ways the same distorted word result. So, in this case the yield function is a wrapper over the coroutine.yield function, which checks that a word has not been generated twice, avoiding multiple yields with the same word. Not doing this check will be even worst :(
 
Also, you could try using table.concat() instead of .. (it could be faster... or not).

I tried table.concat and string.format, but both perform worst. This was counter-intuitive to me, because Lua string concat generates copies of intermediate strings. However, seems that for short strings and small number of concatenated strings, string __concat performs better than string.format or table.concat. Does anyone know if my observation is true? 
 
You could also keep an eye on memory usage to see if this is really an allocation problem

collectgarbage("count") is around 20MB/40MB. By using table.concat this numbers are similar, but the performance decreases a lot, close to two times slower.
 
Lots of temporary strings seem to be created (e.g. by word_str:sub(...)); perhaps the code could be restructured to reduce their number as much as possible?

I'm wondering if it is possible. It will be possible to use indices instead of concatenated strings, but in this way the code will be forced to check repetitions working over the indices, and at the ends, the code will be more complicated and slower :'(


Last but not least:

> for i=1,#word_str do for j=1,#alphabet do

I did not try to follow the code, so I may be dead wrong, but nested loops like this usually raise a O(n^2) efficiency red flag, especially because you perform string concatenation inside them.

Certainly, the bottleneck is the large number of words generated by both nested loops (replaces and inserts). It is needed to loop over N*M being N the user word length and M=26 the length of English alphabet. As said by Peter Norvig, talking about 'edits1' function number of generated words:

This can be a big set. For a word of length n, there will be n deletions, n-1 transpositions, 26n alterations, and 26(n+1) insertions, for a total of 54n+25 (of which a few are typically duplicates). For example, len(edits1('something')) -- that is, the number of elements in the result of edits1('something') -- is 494.





--
  Enrico


Reply | Threaded
Open this post in threaded view
|

Re: Spelling corrector

Hisham
On 7 July 2014 at 10:12, Paco Zamora Martínez <[hidden email]> wrote:
>>
>> > yield( splits[i-1][1]..word[i+1]..word[i]..splits[i+2][2] )
>>
> I tried table.concat and string.format, but both perform worst. This was
> counter-intuitive to me, because Lua string concat generates copies of
> intermediate strings. However, seems that for short strings and small number
> of concatenated strings, string __concat performs better than string.format
> or table.concat. Does anyone know if my observation is true?

The "folk wisdom" about copies of intermediate strings in Lua is often
mis-stated, I think.

("aa"):upper() .. ("bb"):upper() .. ("cc"):upper() .. ("dd"):upper()

It translates to a single concatenation bytecode in both Lua and
LuaJIT, so it produces the following strings:

"aa"
"bb"
"cc"
"AA"
"BB"
"CC"
"DD"
"AABBCCDD"

This does generate intermediate strings:

local s = ""
for _, w in ipairs({"aa", "bb", "cc", "dd"})
   s = s .. w:upper()
end

It produces

""
"aa"
"bb"
"cc"
"AA"
"BB"
"CC"
"AABB"
"AABBCC"
"AABBCCDD"

And this second pattern is the one that people tell to avoid when they
talk about "intermediate strings". For that, one should do instead:


local t = {}
for _, w in ipairs({"aa", "bb", "cc", "dd"})
   table.insert(s, w:upper())
end
local s = table.concat(t)

That will produce:

"aa"
"bb"
"cc"
"AA"
"BB"
"CC"
"AABBCCDD"

plus an extra table. Of course this is an oversimplified example for
illustration purposes, but often the loop is long and the naive
approach above can produce a huge pyramid of intermediate strings.

Over the years, the sensible advice was somehow distorted into some
"all string concatenation is evil" cargo-cult, but that is not true,
especially for short sequences of concatenations in an expression.
Using a..b..c will usually be cheaper and produce less garbage than
either string.format("%s%s%s", a, b, c) or table.concat({a, b, c}).

I guess I should turn this email into a blog post titled "Lua string
concatenation considered not harmful" :)

-- Hisham

Reply | Threaded
Open this post in threaded view
|

Re: Spelling corrector

Dirk Laurie-2
2018-07-13 22:43 GMT+02:00 Hisham <[hidden email]>:

> On 7 July 2014 at 10:12, Paco Zamora Martínez <[hidden email]> wrote:
>>>
>>> > yield( splits[i-1][1]..word[i+1]..word[i]..splits[i+2][2] )
>>>
>> I tried table.concat and string.format, but both perform worst. This was
>> counter-intuitive to me, because Lua string concat generates copies of
>> intermediate strings. However, seems that for short strings and small number
>> of concatenated strings, string __concat performs better than string.format
>> or table.concat. Does anyone know if my observation is true?
>
> The "folk wisdom" about copies of intermediate strings in Lua is often
> mis-stated, I think.
>
> ("aa"):upper() .. ("bb"):upper() .. ("cc"):upper() .. ("dd"):upper()
>
> It translates to a single concatenation bytecode in both Lua and
> LuaJIT, so it produces the following strings:
>
> "aa"
> "bb"
> "cc"
> "AA"
> "BB"
> "CC"
> "DD"
> "AABBCCDD"
>
> This does generate intermediate strings:
>
> local s = ""
> for _, w in ipairs({"aa", "bb", "cc", "dd"})
>    s = s .. w:upper()
> end
>
> It produces
>
> ""
> "aa"
> "bb"
> "cc"
> "AA"
> "BB"
> "CC"
> "AABB"
> "AABBCC"
> "AABBCCDD"
>
> And this second pattern is the one that people tell to avoid when they
> talk about "intermediate strings". For that, one should do instead:
>
>
> local t = {}
> for _, w in ipairs({"aa", "bb", "cc", "dd"})
>    table.insert(s, w:upper())
> end
> local s = table.concat(t)
>
> That will produce:
>
> "aa"
> "bb"
> "cc"
> "AA"
> "BB"
> "CC"
> "AABBCCDD"
>
> plus an extra table. Of course this is an oversimplified example for
> illustration purposes, but often the loop is long and the naive
> approach above can produce a huge pyramid of intermediate strings.
>
> Over the years, the sensible advice was somehow distorted into some
> "all string concatenation is evil" cargo-cult, but that is not true,
> especially for short sequences of concatenations in an expression.
> Using a..b..c will usually be cheaper and produce less garbage than
> either string.format("%s%s%s", a, b, c) or table.concat({a, b, c}).
>
> I guess I should turn this email into a blog post titled "Lua string
> concatenation considered not harmful" :)

Now that you have resurrected an old issue, let me do likewise :-)

There is a difference between table.concat on the one hand, and
the concatenation operator and the API function lua_concat on
the other. table.concat fails if anything except strings and numbers
is among the things to be concatenated. The other two respect the
__concat metamethod.

Would it not be  agood thing to expose lua_concat at the Lua level,
e.g. by a function concat(...)?

Reply | Threaded
Open this post in threaded view
|

Re: Spelling corrector

Dirk Laurie-2
In reply to this post by Hisham
2018-07-13 22:43 GMT+02:00 Hisham <[hidden email]>:
> On 7 July 2014 at 10:12, Paco Zamora Martínez <[hidden email]> wrote:

> I guess I should turn this email into a blog post titled "Lua string
> concatenation considered not harmful" :)

"Considered not harmful" essays considered not harmful? [1]

-- Dirk

[1] https://meyerweb.com/eric/comment/chech.html