string immutability

classic Classic list List threaded Threaded
32 messages Options
12
Reply | Threaded
Open this post in threaded view
|

string immutability

Thijs Schreijer
List,

I'm writing some code that might potentially handle very large strings. To write this properly/efficient I have a question on string immutability.

What the wiki says [1];
> Immutability of strings
>
> Lua strings are immutable and interned. This has some implications.
> To create a string that differs in only one-character from an
> existing 100 MB string requires creating an entirely new 100 MB
> string since the original string cannot be modified.

Does this mean that in the following snippet;

local first  = [[ some 100mb stuff here ]]
local second = first
local third  = first .. "x"

There actually only 1 100mb memory block is allocated for both variables 'first' and 'second' as they share the same value (and point to the same memory block because it is only a simple assignment)?
And for 'third' a new memory block is allocated to store a full copy of the 100mb data? Because it assigns an expression?

If 'first' and 'second' share the memory, it means I can freely assign my large string or pass it around as a function argument (as long as I don't modify it). But if they each get their own copy in memory, then I should be very careful and probably limit the string to a single shared upvalue.

Any help is appreciated.
Thijs

[1] http://lua-users.org/wiki/ImmutableObjects 

Reply | Threaded
Open this post in threaded view
|

Re: string immutability

Enrico Colombini
On 02/01/2014 9.57, Thijs Schreijer wrote:
> local first  = [[ some 100mb stuff here ]]
 > local second = first
 > local third  = first .. "x"

You have two strings here: 'first' and 'second' are the same string,
'third' is a different string.

Assignment does not duplicate strings. Moreover, because strings are
interned, there is a single copy of strings with equal contents; if you
produce another identical string by different means (e.g. by
concatenation), after the interning operation a single copy of the
string will exist (please correct me if I am wrong here).

".." is a dangerous operator when doing many concatenations or when
operating on large strings.
table.concat and/or the C-level luaL_Buffer can be useful to reduce
memory usage in these cases.

--
   Enrico

Reply | Threaded
Open this post in threaded view
|

Re: string immutability

Scott Morgan
In reply to this post by Thijs Schreijer
On 02/01/14 08:57, Thijs Schreijer wrote:
> I'm writing some code that might potentially handle very large
> strings. To write this properly/efficient I have a question on string
> immutability.

You might want to look at the 'rope' patten.

http://en.wikipedia.org/wiki/Rope_%28computer_science%29
http://lua-users.org/wiki/LuaRopes

Scott

Reply | Threaded
Open this post in threaded view
|

Re: string immutability

Fabien-3
In reply to this post by Enrico Colombini
The C "buffer" interface in Lua is designed specifically for this: you can append cheaply in it, and the price to pay is, you lose equality-checks-as-cheap-as-a-pointer-comparison. In practice, if you're not performance-bound, using a list-of-strings as your buffer (instead of a single string) gives good enough results without having to use C. table.concat(), which converts the latter into the former, is very effective (although you often never have to convert you list of strings into a single one)


On Thu, Jan 2, 2014 at 10:28 AM, Enrico Colombini <[hidden email]> wrote:
On 02/01/2014 9.57, Thijs Schreijer wrote:
local first  = [[ some 100mb stuff here ]]
> local second = first
> local third  = first .. "x"

You have two strings here: 'first' and 'second' are the same string, 'third' is a different string.

Assignment does not duplicate strings. Moreover, because strings are interned, there is a single copy of strings with equal contents; if you produce another identical string by different means (e.g. by concatenation), after the interning operation a single copy of the string will exist (please correct me if I am wrong here).

".." is a dangerous operator when doing many concatenations or when operating on large strings.
table.concat and/or the C-level luaL_Buffer can be useful to reduce memory usage in these cases.

--
  Enrico




--
Fabien Fleutot
+---
| 33 chemin de Mange-Pommes
| 31520 Ramonville Saint-Agne -- France
| mobile: +33 6 28 06 09 97
| office: +33 5 61 00 06 49
| home: +33 5 61 75 05 67
Reply | Threaded
Open this post in threaded view
|

RE: string immutability

Thijs Schreijer
In reply to this post by Scott Morgan

> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On
> Behalf Of Scott Morgan
> Sent: donderdag 2 januari 2014 11:30
> To: [hidden email]
> Subject: Re: string immutability
>
> On 02/01/14 08:57, Thijs Schreijer wrote:
> > I'm writing some code that might potentially handle very large
> > strings. To write this properly/efficient I have a question on string
> > immutability.
>
> You might want to look at the 'rope' patten.
>
> http://en.wikipedia.org/wiki/Rope_%28computer_science%29
> http://lua-users.org/wiki/LuaRopes
>
> Scott

Thanks for the interesting read. But in this case I only need to traverse the string sequentially, no modifications. I just wanted to know whether I could easily pass it around without using and extravagant amount of memory.

Thijs

Reply | Threaded
Open this post in threaded view
|

RE: string immutability

Thijs Schreijer
In reply to this post by Enrico Colombini
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On
> Behalf Of Enrico Colombini
> Sent: donderdag 2 januari 2014 10:29
> To: Lua mailing list
> Subject: Re: string immutability
>
> On 02/01/2014 9.57, Thijs Schreijer wrote:
> > local first  = [[ some 100mb stuff here ]]
>  > local second = first
>  > local third  = first .. "x"
>
> You have two strings here: 'first' and 'second' are the same string, 'third'
> is a different string.
>

That's the answer I was looking for. Thx.

> Assignment does not duplicate strings. Moreover, because strings are
> interned, there is a single copy of strings with equal contents; if you
> produce another identical string by different means (e.g. by concatenation),
> after the interning operation a single copy of the string will exist (please
> correct me if I am wrong here).

I don't know how it works in practice, but what you're suggesting means that after each string operation the resulting string would have to be compared to every other string in the current Lua state... I doubt whether that would be possible performance wise.

Considering the example again:
 local first  = [[ some 100mb stuff here ]]
 local second = first
 local third  = first .. ""   -- changed here, result is the same

My guess is it would still result in 2 strings, one being the value for `first` and `second`, the second one being the value for `third`, which would not be compared to the first because it is the result of an expression.

But maybe someone can enlighten us :)

Thijs

>
> ".." is a dangerous operator when doing many concatenations or when
> operating on large strings.
> table.concat and/or the C-level luaL_Buffer can be useful to reduce memory
> usage in these cases.
>
> --
>    Enrico

Reply | Threaded
Open this post in threaded view
|

Re: string immutability

Dirk Laurie-2
2014/1/2 Thijs Schreijer <[hidden email]>:

>
>> Assignment does not duplicate strings. Moreover, because strings are
>> interned, there is a single copy of strings with equal contents; if you
>> produce another identical string by different means (e.g. by concatenation),
>> after the interning operation a single copy of the string will exist (please
>> correct me if I am wrong here).
>
> I don't know how it works in practice, but what you're suggesting means that after each string operation the resulting string would have to be compared to every other string in the current Lua state... I doubt whether that would be possible performance wise.
>
> Considering the example again:
>  local first  = [[ some 100mb stuff here ]]
>  local second = first
>  local third  = first .. ""   -- changed here, result is the same
>
> My guess is it would still result in 2 strings, one being the value for `first` and `second`, the second one being the value for `third`, which would not be compared to the first because it is the result of an expression.
>
> But maybe someone can enlighten us :)

It works just like table keys. A hash code is computed from the
string. Only if that is the same as a string already in the table (in
which case the strings are likely to be equal) are the actual strings
compared. O(#str) average speed.

Reply | Threaded
Open this post in threaded view
|

Re: string immutability

Roberto Ierusalimschy
> It works just like table keys. A hash code is computed from the
> string. Only if that is the same as a string already in the table (in
> which case the strings are likely to be equal) are the actual strings
> compared. O(#str) average speed.

It is worth reminding that this behavior changed in Lua 5.2. Now, only
"short" strings are interned. (By default, a "short string" is up to
40 bytes.)

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: string immutability

Javier Guerra Giraldez
In reply to this post by Scott Morgan
On Thu, Jan 2, 2014 at 5:29 AM, Scott Morgan <[hidden email]> wrote:
> You might want to look at the 'rope' patten.
>
> http://en.wikipedia.org/wiki/Rope_%28computer_science%29
> http://lua-users.org/wiki/LuaRopes


definitely ropes are the answer for efficiently working with big text
contents, and LuaRopes implement most of the pattern as described on
the paper (a tree with strings at the leaves)

but in many cases, i've found that a simple array of strings works
beautifully as a "poor man's rope".  just pass around the array, and
at the end either output in a loop or call table.append() to create
the string just once.

if doing middle insertions or complex replacements, then go for full
ropes; but if most of the processing is just concatenating at the end,
the clarity of a string array is a big plus.

--
Javier

Reply | Threaded
Open this post in threaded view
|

Re: string immutability

steve donovan
On Thu, Jan 2, 2014 at 3:38 PM, Javier Guerra Giraldez
<[hidden email]> wrote:
> if doing middle insertions or complex replacements, then go for full
> ropes; but if most of the processing is just concatenating at the end,
> the clarity of a string array is a big plus.

I suspect that even insertions into a string array are efficient
enough for most purposes.

Recently Stroustrup was pointing out that inserting into a resizeable
array (std::vector) is just as good as a linked-list (std::list) for
40,000 elements, due to locality penalties for the list.  I suspect
that my education is getting increasingly out of date ;)

Reply | Threaded
Open this post in threaded view
|

Re: string immutability

Enrico Colombini
In reply to this post by Roberto Ierusalimschy
On 02/01/2014 14.23, Roberto Ierusalimschy wrote:
> It is worth reminding that this behavior changed in Lua 5.2. Now, only
> "short" strings are interned. (By default, a "short string" is up to
> 40 bytes.)

Thanks for the reminder, I forgot that.

--
   Enrico

Reply | Threaded
Open this post in threaded view
|

Re: string immutability

Enrico Colombini
In reply to this post by Thijs Schreijer
Thijs, here is an example:

---------------------------------
function GCAndShow(prompt)
     -- (I don't remember how to force a full collection,
     -- this hack is good enough for this example)
     collectgarbage()
     collectgarbage()
     collectgarbage()
     local kb = collectgarbage('count')
     io.write(string.format('%s: %.1f KB\n', prompt, kb))
end

GCAndShow('initial state')

s1 = string.rep('x', 999999)
GCAndShow('allocated a large string')

s2 = s1
GCAndShow('assigned the same string')

s3 = string.rep('y', 999999)
GCAndShow('allocated another, different, large string')

s4 = string.rep('x', 999998) .. 'x'
assert(s4 == s1)
GCAndShow('constructed a large string identical to an existing string')

s1 = nil; s2 = nil; s3=nil; s4 = nil;
GCAndShow('deleted all strings')

---------------------------------

Lua 5.1 prints:

initial state: 20.6 KB
allocated a large string: 1012.4 KB
assigned the same string: 997.4 KB
allocated another, different, large string: 1989.0 KB
constructed a large string identical to an existing string: 1989.0 KB
deleted all strings: 20.8 KB

---------------------------------

Lua 5.2 prints:

initial state: 15.0 KB
allocated a large string: 991.6 KB
assigned the same string: 991.6 KB
allocated another, different, large string: 1968.2 KB
constructed a large string identical to an existing string: 1968.2 KB
deleted all strings: 15.0 KB

---------------------------------

As you can see, in both cases a string built by concatenation does not
take up extra memory if it is identical to an existing one. Change the
final 'x' of the s4 expression to an 'y' (also comment out the assert)
and you will see the difference.

--
   Enrico

Reply | Threaded
Open this post in threaded view
|

RE: string immutability

Thijs Schreijer


> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On
> Behalf Of Enrico Colombini
> Sent: donderdag 2 januari 2014 15:41
> To: Lua mailing list
> Subject: Re: string immutability
>
> Thijs, here is an example:
>
> ---------------------------------
> function GCAndShow(prompt)
>      -- (I don't remember how to force a full collection,
>      -- this hack is good enough for this example)
>      collectgarbage()
>      collectgarbage()
>      collectgarbage()
>      local kb = collectgarbage('count')
>      io.write(string.format('%s: %.1f KB\n', prompt, kb)) end
>
> GCAndShow('initial state')
>
> s1 = string.rep('x', 999999)
> GCAndShow('allocated a large string')
>
> s2 = s1
> GCAndShow('assigned the same string')
>
> s3 = string.rep('y', 999999)
> GCAndShow('allocated another, different, large string')
>
> s4 = string.rep('x', 999998) .. 'x'
> assert(s4 == s1)
> GCAndShow('constructed a large string identical to an existing string')
>
> s1 = nil; s2 = nil; s3=nil; s4 = nil;
> GCAndShow('deleted all strings')
>
> ---------------------------------
>
> Lua 5.1 prints:
>
> initial state: 20.6 KB
> allocated a large string: 1012.4 KB
> assigned the same string: 997.4 KB
> allocated another, different, large string: 1989.0 KB constructed a large
> string identical to an existing string: 1989.0 KB deleted all strings: 20.8
> KB
>
> ---------------------------------
>
> Lua 5.2 prints:
>
> initial state: 15.0 KB
> allocated a large string: 991.6 KB
> assigned the same string: 991.6 KB
> allocated another, different, large string: 1968.2 KB constructed a large
> string identical to an existing string: 1968.2 KB deleted all strings: 15.0
> KB
>
> ---------------------------------
>
> As you can see, in both cases a string built by concatenation does not take
> up extra memory if it is identical to an existing one. Change the final 'x'
> of the s4 expression to an 'y' (also comment out the assert) and you will
> see the difference.
>
> --
>    Enrico

Thx. Very clear!

Thijs
Reply | Threaded
Open this post in threaded view
|

Re: string immutability

Scott Morgan
In reply to this post by Javier Guerra Giraldez
On 02/01/14 13:38, Javier Guerra Giraldez wrote:

> On Thu, Jan 2, 2014 at 5:29 AM, Scott Morgan <[hidden email]> wrote:
>> You might want to look at the 'rope' patten.
>>
>> http://en.wikipedia.org/wiki/Rope_%28computer_science%29
>> http://lua-users.org/wiki/LuaRopes
>
>
> definitely ropes are the answer for efficiently working with big text
> contents, and LuaRopes implement most of the pattern as described on
> the paper (a tree with strings at the leaves)
>
> but in many cases, i've found that a simple array of strings works
> beautifully as a "poor man's rope".  just pass around the array, and
> at the end either output in a loop or call table.append() to create
> the string just once.


Definitely, the binary tree stuff is overkill when Lua presents such a
simple solution with plain tables.concat

Scott


Reply | Threaded
Open this post in threaded view
|

Re: string immutability

Jerome Vuarand
In reply to this post by Enrico Colombini
2014/1/2 Enrico Colombini <[hidden email]>:

> ---------------------------------
>
> Lua 5.2 prints:
>
> initial state: 15.0 KB
> allocated a large string: 991.6 KB
> assigned the same string: 991.6 KB
> allocated another, different, large string: 1968.2 KB
> constructed a large string identical to an existing string: 1968.2 KB
> deleted all strings: 15.0 KB
>
> ---------------------------------
>
> As you can see, in both cases a string built by concatenation does not take
> up extra memory if it is identical to an existing one. Change the final 'x'
> of the s4 expression to an 'y' (also comment out the assert) and you will
> see the difference.

Are you sure you ran the test using Lua 5.2? I did and here are my results:

initial state: 24.2 KB
allocated a large string: 1000.7 KB
assigned the same string: 1000.7 KB
allocated another, different, large string: 1977.3 KB
constructed a large string identical to an existing string: 2953.9 KB
deleted all strings: 24.1 KB

As expected by the change in 5.2, the re-creation of a string
identical to an existing one increase memory usage by one megabyte,
contrary to your results.

FWIW I'm using "Lua 5.2.3  Copyright (C) 1994-2013 Lua.org, PUC-Rio",
slightly patched, on Windows x64.

Reply | Threaded
Open this post in threaded view
|

Re: string immutability

Enrico Colombini
On 02/01/2014 17.37, Jerome Vuarand wrote:

> Are you sure you ran the test using Lua 5.2? I did and here are my results:
>
> initial state: 24.2 KB
> allocated a large string: 1000.7 KB
> assigned the same string: 1000.7 KB
> allocated another, different, large string: 1977.3 KB
> constructed a large string identical to an existing string: 2953.9 KB
> deleted all strings: 24.1 KB
>
> As expected by the change in 5.2, the re-creation of a string
> identical to an existing one increase memory usage by one megabyte,
> contrary to your results.

Interesting. I wondered why it stayed the same, after Roberto's remark,
but I put it down to my ignorance. I tested with 5.2.0, though (invoked
through a .bat file under Windows XP):

---------------------------------------------------
C:\E\programmi\vari\lua string example>lua52 -v
Lua 5.2.0  Copyright (C) 1994-2011 Lua.org, PUC-Rio

C:\E\programmi\vari\lua string example>lua52 stringexp.lua
initial state: 15.0 KB
allocated a large string: 991.6 KB
assigned the same string: 991.6 KB
allocated another, different, large string: 1968.2 KB
constructed a large string identical to an existing string: 1968.2 KB
deleted all strings: 15.0 KB
---------------------------------------------------

I just re-downloaded 5.2.0 and 5.2.1 from LuaBinaries: the change seems
to have been introduced in 5.2.1.

If so (apologies for my short memory), maybe the page at
http://www.lua.org/versions.html is a bit too terse.

--
   Enrico

Reply | Threaded
Open this post in threaded view
|

Re: string immutability

Roberto Ierusalimschy
> I just re-downloaded 5.2.0 and 5.2.1 from LuaBinaries: the change
> seems to have been introduced in 5.2.1.

Yes, it was. Sorry for the confusion. (It was part of the "answer" to
the "hash DOS attack" hysteria...)

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: string immutability

Steve Litt
In reply to this post by Thijs Schreijer
On Thu, 2 Jan 2014 12:54:02 +0000
Thijs Schreijer <[hidden email]> wrote:


> I don't know how it works in practice, but what you're suggesting
> means that after each string operation the resulting string would
> have to be compared to every other string in the current Lua state...
> I doubt whether that would be possible performance wise.
>
> Considering the example again:
>  local first  = [[ some 100mb stuff here ]]
>  local second = first
>  local third  = first .. ""   -- changed here, result is the same

Regardless of interning, just do this:

if string_to_append ~= ""
    newstring = oldstring .. string_to_append
else
    newstring = oldstring

The preceding does the right thing in every case. I'm almost positive
that you can reassign strings all over the place, and you have just one
copy of the data. I *know* I've used that to my advantage on tables,
and I think it works on strings.

But your best bet is simply make a little test program to find out for
sure.

SteveT

Steve Litt                *  http://www.troubleshooters.com/
Troubleshooting Training  *  Human Performance

Reply | Threaded
Open this post in threaded view
|

Re: string immutability

Tom N Harris
In reply to this post by Enrico Colombini
On Thursday, January 02, 2014 10:28:43 AM Enrico Colombini wrote:
> ".." is a dangerous operator when doing many concatenations or when
> operating on large strings.
> table.concat and/or the C-level luaL_Buffer can be useful to reduce
> memory usage in these cases.

The VM concat instruction is very efficient so a chained .. is no worse than
table.concat. Where you get in trouble is building a string piecemeal in a
loop.

Using a variation of your example...

    function GCAndShow(prompt)
      local before = collectgarbage "count"
      collectgarbage"collect"
      local after = collectgarbage "count"
      io.write(string.format('%s: %.1fKiB (%.1fKiB garbage)\n',
                             prompt, after, before-after))
    end
    GCAndShow "initial state"

    local s1 = string.rep('x', 99999)
    GCAndShow "allocated a large string"

    local s2 = s1..s1..s1..s1..s1..s1..s1..s1..s1..s1
    GCAndShow "concatenated string with `..'"

    local s3 = table.concat{s1,s1,s1,s1,s1,s1,s1,s1,s1,s1}
    assert(s2==s3)
    GCAndShow "concatenated string with table.concat"

    local s4 = ""
    for i=1,10 do s4 = s4..s1 end
    assert(s2==s4)
    GCAndShow "concatenated string with loop"

    local s5 = string.format("%s%s%s%s%s%s%s%s%s%s",
                             s1,s1,s1,s1,s1,s1,s1,s1,s1,s1)
    assert(s2==s5)
    GCAndShow "concatenated string with string.format"

    s1,s2,s3,s4,s5 = nil
    GCAndShow "deleted all strings"


With Lua 5.1.5 (and nearly the same with LuaJIT 2.0.2)
initial state: 29.1KiB (3.6KiB garbage)
allocated a large string: 151.2KiB (163.0KiB garbage)
concatenated string with `..': 1347.5KiB (732.5KiB garbage)
concatenated string with table.concat: 1347.5KiB (732.7KiB garbage)
concatenated string with loop: 1347.5KiB (5029.5KiB garbage)
concatenated string with string.format: 1348.3KiB (732.5KiB garbage)
deleted all strings: 90.9KiB (1257.5KiB garbage)

With Lua 5.2.2
initial state: 24.9KiB (3.3KiB garbage)
allocated a large string: 122.6KiB (97.8KiB garbage)
concatenated string with `..': 1099.2KiB (0.0KiB garbage)
concatenated string with table.concat: 2075.8KiB (2344.0KiB garbage)
concatenated string with loop: 3052.3KiB (2441.4KiB garbage)
concatenated string with string.format: 4029.1KiB (3028.1KiB garbage)
deleted all strings: 25.1KiB (4004.1KiB garbage)

--
tom <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: string immutability

Rena
In reply to this post by Roberto Ierusalimschy
On Thu, Jan 2, 2014 at 8:23 AM, Roberto Ierusalimschy <[hidden email]> wrote:
> It works just like table keys. A hash code is computed from the
> string. Only if that is the same as a string already in the table (in
> which case the strings are likely to be equal) are the actual strings
> compared. O(#str) average speed.

It is worth reminding that this behavior changed in Lua 5.2. Now, only
"short" strings are interned. (By default, a "short string" is up to
40 bytes.)

-- Roberto


I hate to bring this subject back up after it sparked so much debate in the past, but I wonder if this limit could be increased to, say, 64 or 128 bytes? In my C libraries' __index and __newindex methods I avoid doing a lot of strcmp()s to compare the key to each valid field name, by storing all valid field names in a table in the registry. String interning means that instead of using strcmp() to check the key I can just use a standard equality test. But that will fail if I have a field with a more than 40-character name (not unimaginable) and Lua doesn't intern the string, meaning the pointer I get as my key isn't the same as the string I stored in the registry.

--
Sent from my Game Boy.
12