Avoiding string duplication

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Avoiding string duplication

Chris Smith
Hi all,

I’m trying to understand a bit better the handling of strings in Lua with respect to memory allocation and what is safe or unsafe when using the C API.  The context for this is that I’m dealing with some (constant) strings that are used by both C and Lua code, and I’d like to avoid unnecessary memory allocations.

If I push a string value onto the stack (e.g. lua_getglobal(L, “my_string”) ) does Lua duplicate the string or just use a reference to the original string?

If I call lua_tostring (assuming the object is an actual string and not converted from a different type) does Lua return a pointer to the original string content or does it duplicate the string in a C-compatible manner and return a pointer to that?

Given the above and assuming that I have a Lua global string “my_string” that is constant for the lifetime of the lua_State, is the following code safe or not?

const char *str = luaL_checkstring(L, -1);
lua_pop(L, 1);
return str;

Regards,
Chris

Chris Smith <[hidden email]>



Reply | Threaded
Open this post in threaded view
|

Re: Avoiding string duplication

Pedro Tammela-2
> I'm trying to understand a bit better the handling of strings in Lua with respect to \
> memory allocation and what is safe or unsafe when using the C API.  The context for \
> this is that I'm dealing with some (constant) strings that are used by both C and Lua \
> code, and I'd like to avoid unnecessary memory allocations.

There's no need to worry about the Lua side because it uses
internalization for short strings, which are most of the cases of
everyday use in the C API.

See: https://github.com/lua/lua/blob/master/lstring.c#L198

> If I push a string value onto the stack (e.g. lua_getglobal(L, "my_string") ) does \
> Lua duplicate the string or just use a reference to the original string?
>
> If I call lua_tostring (assuming the object is an actual string and not converted \
> from a different type) does Lua return a pointer to the original string content or \
> does it duplicate the string in a C-compatible manner and return a pointer to that?
>
> Given the above and assuming that I have a Lua global string "my_string" that is \
> constant for the lifetime of the lua_State, is the following code safe or not?

See https://github.com/lua/lua/blob/6aeaeb5656a006ad95b35dd7482798fdc5f02f5e/lstring.c#L251

Pedro

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding string duplication

Sean Conner
In reply to this post by Chris Smith
It was thus said that the Great Chris Smith once stated:
> Hi all,
>
> I’m trying to understand a bit better the handling of strings in Lua with
> respect to memory allocation and what is safe or unsafe when using the C
> API.  The context for this is that I’m dealing with some (constant)
> strings that are used by both C and Lua code, and I’d like to avoid
> unnecessary memory allocations.

  Lua keeps a copy of every string used.  When a string is referenced, for
example:

        lua_getglobal(L,"my_string")

and it does not exist in the list of strings, Lua will create a copy of it.
All other references to it will refer to this copy.  

> If I push a string value onto the stack (e.g. lua_getglobal(L,
> “my_string”) ) does Lua duplicate the string or just use a reference to
> the original string?

  If the string doesn't exist in the state, it will be created, otherwise, a
reference to the string in the Lua state is used.

> If I call lua_tostring (assuming the object is an actual string and not
> converted from a different type) does Lua return a pointer to the original
> string content or does it duplicate the string in a C-compatible manner
> and return a pointer to that?

  It returns a pointer to a constant string.  This will point to the
internal copy of the string Lua has, but will be NUL terminated.

> Given the above and assuming that I have a Lua global string “my_string”
> that is constant for the lifetime of the lua_State, is the following code
> safe or not?
>
> const char *str = luaL_checkstring(L, -1);
> lua_pop(L, 1);
> return str;

  Technically no.  References are only valid as long as they're on the
stack, or there exists a reference elsewhere in the Lua state.  Since you
can't be sure if the string has another reference somewhere (in the general
case), popping the stack like this removes the reference to the string and
it could become sugject to garbage collection.

  -spc


Reply | Threaded
Open this post in threaded view
|

Re: Avoiding string duplication

Andrew Gierth
In reply to this post by Chris Smith
>>>>> "Chris" == Chris Smith <[hidden email]> writes:

 Chris> Hi all,

 Chris> I’m trying to understand a bit better the handling of strings in
 Chris> Lua with respect to memory allocation and what is safe or unsafe
 Chris> when using the C API. The context for this is that I’m dealing
 Chris> with some (constant) strings that are used by both C and Lua
 Chris> code, and I’d like to avoid unnecessary memory allocations.

 Chris> If I push a string value onto the stack (e.g. lua_getglobal(L,
 Chris> “my_string”) ) does Lua duplicate the string or just use a
 Chris> reference to the original string?

If you push a string value _that you obtained from Lua_ onto the stack,
there is no copying, the stack value is just a reference. Remember also
that strings in Lua are immutable.

Where the copying happens is if you push a string onto the stack via a
C pointer, i.e. lua_pushstring. (Lua has some shortcuts to try and
detect whether it already has a copy of the same string, but it does
take some work to check this.)

 Chris> If I call lua_tostring (assuming the object is an actual string
 Chris> and not converted from a different type) does Lua return a
 Chris> pointer to the original string content or does it duplicate the
 Chris> string in a C-compatible manner and return a pointer to that?

If you call lua_tostring on a stack slot that contains a string rather
than a number, then what you get is a pointer to Lua's own existing copy
of the string (which already has a null terminator to ensure it's valid
to C). No copying is done.

 Chris> Given the above and assuming that I have a Lua global string
 Chris> “my_string” that is constant for the lifetime of the lua_State,
 Chris> is the following code safe or not?

 Chris> const char *str = luaL_checkstring(L, -1);
 Chris> lua_pop(L, 1);
 Chris> return str;

This is safe as long as you know that there is some reference to the
string remaining in addition to the one you popped off the stack.

If you don't know that there's another reference, then you should assume
that the char* becomes invalid as soon as you pop the stack.

BTW the global table is probably not the best place for this sort of
thing (too easy to modify, or run into collisions with other code);
maybe the registry (or a table hung off the registry) would be a better
place.

--
Andrew.

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding string duplication

Chris Smith

> On 5 Nov 2019, at 16:52, Andrew Gierth <[hidden email]> wrote:
>
> BTW the global table is probably not the best place for this sort of
> thing (too easy to modify, or run into collisions with other code);
> maybe the registry (or a table hung off the registry) would be a better
> place.

Yes, my intention is to store the strings that the C code cares about in the registry, but I wanted to confirm that this was a legitimate approach first.

Chris

Chris Smith <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding string duplication

Chris Smith
In reply to this post by Sean Conner

> On 5 Nov 2019, at 16:52, Sean Conner <[hidden email]> wrote:
>
> Technically no.  References are only valid as long as they're on the
> stack, or there exists a reference elsewhere in the Lua state.  Since you
> can't be sure if the string has another reference somewhere (in the general
> case), popping the stack like this removes the reference to the string and
> it could become sugject to garbage collection.

Thanks, that’s what I needed to know.  I can guarantee that references will be preserved, so this isn’t an issue.

Chris

Chris Smith <[hidden email]>


Reply | Threaded
Open this post in threaded view
|

Re: Avoiding string duplication

Viacheslav Usov
On Tue, Nov 5, 2019 at 6:04 PM Chris Smith <[hidden email]> wrote:

> Thanks, that’s what I needed to know.  I can guarantee that references will be preserved, so this isn’t an issue.

If I understand the intent correctly, you would like to use a string returned by Lua after it was popped off the stack, on the account that it had previously been registered.

I would recommend against this. This technique is brittle.

Note that the documentation warns explicitly: "Because Lua has garbage collection, there is no guarantee that the pointer returned by lua_tolstring will be valid after the corresponding Lua value is removed from the stack".

I doubt you will gain much by popping prematurely, but I can guarantee that any eventual bugs will be a royal pain.

Cheers,
V.
Reply | Threaded
Open this post in threaded view
|

Re: Avoiding string duplication

Chris Smith

On 5 Nov 2019, at 18:20, Viacheslav Usov <[hidden email]> wrote:

On Tue, Nov 5, 2019 at 6:04 PM Chris Smith <[hidden email]> wrote:

> Thanks, that’s what I needed to know.  I can guarantee that references will be preserved, so this isn’t an issue.

If I understand the intent correctly, you would like to use a string returned by Lua after it was popped off the stack, on the account that it had previously been registered.

No, on the account that the string is still registered.  A Lua string will be created at the start of the Lua state and that string will remain referenced within Lua and thus never garbage collected.

I would recommend against this. This technique is brittle.

Other responses are suggesting that this is okay.

Note that the documentation warns explicitly: "Because Lua has garbage collection, there is no guarantee that the pointer returned by lua_tolstring will be valid after the corresponding Lua value is removed from the stack”.

Yes, but that is not a guarantee that it will be invalid either.  The point of my question was to determine what criteria would allow a guarantee to be made, either way.

I doubt you will gain much by popping prematurely, but I can guarantee that any eventual bugs will be a royal pain.

What do you mean popping prematurely?  A C function has to ensure the stack is balanced before returning, right?  And that means popping before returning whatever was pushed during the function call.  There is no way to directly grab a string value from Lua memory; you have to first push the string onto the stack then call lua_tostring() and as a consequence you then have to pop it back off before returning.

Chris

Chris Smith <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding string duplication

Viacheslav Usov
On Wed, Nov 6, 2019 at 11:44 AM Chris Smith <[hidden email]> wrote:

> No, on the account that the string is still registered.  A Lua string will be created at the start of the Lua state and that string will remain referenced within Lua and thus never garbage collected.

This relies on implementation details that are not explicitly documented as design guarantees. To the best of my knowledge, nowhere does the documentation indicate that it is OK to use pointers to interned strings when a corresponding string value is not on the API stack. Secondly, the registry entry might be intentionally or mistakenly overwritten.

This is why I deem this brittle.

> What do you mean popping prematurely?

Popping a string value off the API stack before you are done using the corresponding string pointer.

> A C function has to ensure the stack is balanced before returning, right?

No. It is "considered good programming practice", quoting the manual, and said only in the context of a caller, not a callee.

> And that means popping before returning whatever was pushed during the function call.

No. "Whenever Lua calls C, the called function gets a new stack, which is independent of previous stacks and of stacks of C functions that are still active." So a callee can leave the stack in whatever state. All it needs to do is indicate, via its return value, how many values at the top of the stack are its return values. "Any other value in the stack below the results will be properly discarded by Lua." Quoted sentences courtesy of the fine manual.

> There is no way to directly grab a string value from Lua memory; you have to first push the string onto the stack then call lua_tostring() and as a consequence you then have to pop it back off before returning.

It would not hurt to pop it before returning, but, again, this is not required.

Cheers,
V.
Reply | Threaded
Open this post in threaded view
|

Re: Avoiding string duplication

Chris Smith

> On 6 Nov 2019, at 11:13, Viacheslav Usov <[hidden email]> wrote:
>
> Secondly, the registry entry might be intentionally or mistakenly overwritten.
>
> This is why I deem this brittle.

In this case I’m in control of both C and Lua code, so I am happy to assert that will never happen.

> > A C function has to ensure the stack is balanced before returning, right?
>
> No. It is "considered good programming practice", quoting the manual, and said only in the context of a caller, not a callee.

But I am the caller.  The context here is that I’m using Lua for configuration.  I load the configuration as a Lua file into the state and pcall() it, then the C code retrieves the values it needs from the Lua state.  The configuration might have hundreds of string values and it would be a shame if I had to resort to strdup(lua_tostring()), duplicating and handling my own memory for such a trivial task.  Leaving the strings on the stack would easily fill the stack, and I’d need more boiler plate to check that I hadn’t already left that particular string on the stack.

> No. "Whenever Lua calls C, the called function gets a new stack, which is independent of previous stacks and of stacks of C functions that are still active." So a callee can leave the stack in whatever state. All it needs to do is indicate, via its return value, how many values at the top of the stack are its return values. "Any other value in the stack below the results will be properly discarded by Lua." Quoted sentences courtesy of the fine manual.

Yes, agreed, but again I’m talking in the context of C using Lua, not Lua using C.

Chris

Chris Smith <[hidden email]>


Reply | Threaded
Open this post in threaded view
|

Re: Avoiding string duplication

Viacheslav Usov
On Wed, Nov 6, 2019 at 12:46 PM Chris Smith <[hidden email]> wrote:

> In this case I’m in control of both C and Lua code, so I am happy to assert that will never happen.

Sure, but you still have a dependency on undocumented implementation details.

> I load the configuration as a Lua file into the state and pcall() it, then the C code retrieves the values it needs from the Lua state.  The configuration might have hundreds of string values and it would be a shame if I had to resort to strdup(lua_tostring()), duplicating and handling my own memory for such a trivial task.  Leaving the strings on the stack would easily fill the stack, and I’d need more boiler plate to check that I hadn’t already left that particular string on the stack.

lua_checkstack: "typically at least several thousand elements"

I think you can just leave all those values on the API stack. That will work as long as you never have to reload your config. If you do, you will have to consider all the previously obtained string pointers invalid, nuke and reload the Lua state, and refetch the config strings. Note that keeping those string values referenced in the registry vs keeping those values on the API stack is probably going to need comparable amounts of memory. You can save on memory only if you avoid both the registry and the stack, which again would mean relying solely on implementation details. The question is, how much that would really save in the grand scheme of things, given that you are talking about hundreds of strings.

Cheers,
V.
Reply | Threaded
Open this post in threaded view
|

Re: Avoiding string duplication

Sean Conner
In reply to this post by Viacheslav Usov
It was thus said that the Great Viacheslav Usov once stated:

> On Wed, Nov 6, 2019 at 11:44 AM Chris Smith <[hidden email]> wrote:
>
> > No, on the account that the string is still registered.  A Lua string
> will be created at the start of the Lua state and that string will remain
> referenced within Lua and thus never garbage collected.
>
> This relies on implementation details that are not explicitly documented as
> design guarantees. To the best of my knowledge, nowhere does the
> documentation indicate that it is OK to use pointers to interned strings
> when a corresponding string value is not on the API stack. Secondly, the
> registry entry might be intentionally or mistakenly overwritten.

  I use the registry to save strings, and so far, for Lua 5.1, 5.2 and 5.3,
it has worked fine for me, even in long running (that is, months of time).
If said registry entry is mistakenly overwritten, then there's a bug.  If
it's intentionally overwritten, there are larger problems.

  I have a string.  There's a reference in the registry.  I don't see the
issue.

  -spc

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding string duplication

Viacheslav Usov
On Wed, Nov 6, 2019 at 5:46 PM Sean Conner <[hidden email]> wrote:

> I use the registry to save strings, and so far, for Lua 5.1, 5.2 and 5.3, it has worked fine for me, even in long running (that is, months of time).

> I have a string.  There's a reference in the registry.  I don't see the issue.

I agree that registering string values works as advertised at least in 5.3.

However, my point was about what is _not_ advertised.

Cheers,
V.
Reply | Threaded
Open this post in threaded view
|

Re: Avoiding string duplication

Roberto Ierusalimschy
According to the manual, Viacheslav is right. The "law" says that you
should not use a string pointer after you popped it from the stack.

The rationale for this is that an hypothetical garbage collector
could move strings around while collecting garbage [1]. Lua never
used a moving (or copying) collector, so everthing always worked
"as expected". Theoretically, that could change in the future.

There are some problems for a copying collector in Lua:

- Strings in a stack cannot be moved. (More specifically, strings
in stack regions corresponding to C calls cannot be moved.)

- Userdata already assume fixed pointers to their blocks of memory.

- A copying collector works on top of raw memory, while Lua does
GC on top of an allocation function.

For all these reasons, it seems very unlikely that Lua will ever
change to a copying collector. Maybe it is time to change the manual
and make official this behavior?


[1] https://en.wikipedia.org/wiki/Tracing_garbage_collection#Moving_vs._non-moving

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding string duplication

Rodrigo Azevedo
Em qua, 6 de nov de 2019 16:56, Roberto Ierusalimschy <[hidden email]> escreveu:
According to the manual, Viacheslav is right. The "law" says that you
should not use a string pointer after you popped it from the stack.

The rationale for this is that an hypothetical garbage collector
could move strings around while collecting garbage [1]. Lua never
used a moving (or copying) collector, so everthing always worked
"as expected". Theoretically, that could change in the future.

There are some problems for a copying collector in Lua:

- Strings in a stack cannot be moved. (More specifically, strings
in stack regions corresponding to C calls cannot be moved.)

- Userdata already assumes fixed pointers to their blocks of memory.

- A copying collector works on top of raw memory, while Lua does
GC on top of an allocation function.

For all these reasons, it seems very unlikely that Lua will ever
change to a copying collector. Maybe it is time to change the manual
and make official this behavior?

+1

In fact, all 'implementation-dependent', as well as 'undefined', behaviors could be rethought, aiming a 'well-defined' language.

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding string duplication

Chris Smith
In reply to this post by Roberto Ierusalimschy

> On 6 Nov 2019, at 19:05, Roberto Ierusalimschy <[hidden email]> wrote:
>
> There are some problems for a copying collector in Lua:
>
> - Strings in a stack cannot be moved. (More specifically, strings
> in stack regions corresponding to C calls cannot be moved.)
>
> - Userdata already assume fixed pointers to their blocks of memory.
>
> - A copying collector works on top of raw memory, while Lua does
> GC on top of an allocation function.
>
> For all these reasons, it seems very unlikely that Lua will ever
> change to a copying collector. Maybe it is time to change the manual
> and make official this behavior?


I would certainly like to see this made official.  However, it does raise some other interesting points and questions:

- Strings are immutable in Lua, but not in C.  C code can directly modify a Lua string and those changes will be reflected in all identical Lua String objects.  This is now defined behaviour due to the above, and is consistent with table behaviour and Userdata.

- Is there any real difference between Userdata and String objects?  Are Strings simply Userdata whose content is cached and on which string functions can be called?

- Do we need both types?  Would it be useful to be able to convert between Userdata and String objects?  That is, to explicitly uncache/duplicate a String object or to mark Userdata as containing String data?

I’m not really sure that anything I’ve said amounts to much more than navel gazing, but I think there are some interesting possibilities.

Regards,
Chris

Chris Smith <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding string duplication

Sean Conner
It was thus said that the Great Chris Smith once stated:

>
> On 6 Nov 2019, at 19:05, Roberto Ierusalimschy <[hidden email]> wrote:
> >
> > For all these reasons, it seems very unlikely that Lua will ever
> > change to a copying collector. Maybe it is time to change the manual
> > and make official this behavior?
>
> I would certainly like to see this made official.  However, it does raise
> some other interesting points and questions:
>
> - Strings are immutable in Lua, but not in C.  C code can directly modify
> a Lua string

  Only by casting away the const modifier.  Also, you can't change the size
of the string (or bad things might happen).  But sure, yeah, C can do that.

> - Is there any real difference between Userdata and String objects?  Are
> Strings simply Userdata whose content is cached and on which string
> functions can be called?
>
> - Do we need both types?  Would it be useful to be able to convert between
> Userdata and String objects?  That is, to explicitly uncache/duplicate a
> String object or to mark Userdata as containing String data?

  There was a long thread about this back in 2014:

        http://lua-users.org/lists/lua-l/2014-08/msg00657.html

  -spc

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding string duplication

Viacheslav Usov
In reply to this post by Chris Smith
On Fri, Nov 8, 2019 at 11:02 AM Chris Smith <[hidden email]> wrote:

> - Strings are immutable in Lua, but not in C.  C code can directly modify a Lua string and those changes will be reflected in all identical Lua String objects.  This is now defined behaviour

Not according to the ISO C standard.

> - Is there any real difference between Userdata and String objects?  Are Strings simply Userdata whose content is cached and on which string functions can be called?

While one could say that ultimately both just hold a bunch of bytes, the Lua string type is specialized and optimized for the semantics prescribed by Lua. User data have much looser semantics and behaviors. A notable difference is that every single user datum can have a metatable.

Cheers,
V.
Reply | Threaded
Open this post in threaded view
|

Re: Avoiding string duplication

Chris Smith
In reply to this post by Sean Conner

> On 8 Nov 2019, at 10:55, Sean Conner <[hidden email]> wrote:
>
> It was thus said that the Great Chris Smith once stated:
>>
>> On 6 Nov 2019, at 19:05, Roberto Ierusalimschy <[hidden email]> wrote:
>>>
>>> For all these reasons, it seems very unlikely that Lua will ever
>>> change to a copying collector. Maybe it is time to change the manual
>>> and make official this behavior?
>>
>> I would certainly like to see this made official.  However, it does raise
>> some other interesting points and questions:
>>
>> - Strings are immutable in Lua, but not in C.  C code can directly modify
>> a Lua string
>
>  Only by casting away the const modifier.  Also, you can't change the size
> of the string (or bad things might happen).  But sure, yeah, C can do that.

I was going to continue my argument here, but having played around a bit I’ve changed my opinion.  This has too many unintended consequences to be safe.

>> - Is there any real difference between Userdata and String objects?  Are
>> Strings simply Userdata whose content is cached and on which string
>> functions can be called?
>>
>> - Do we need both types?  Would it be useful to be able to convert between
>> Userdata and String objects?  That is, to explicitly uncache/duplicate a
>> String object or to mark Userdata as containing String data?
>
>  There was a long thread about this back in 2014:
>
> http://lua-users.org/lists/lua-l/2014-08/msg00657.html


Good grief.  I don’t think anyone wants a repeat of that, so I’ll leave this well alone save to echo the sentiment that it would be nice to be able to tell Lua that ’this Userdata is a string buffer, so please let me use string functions on it’.

Regards,
Chris

Chris Smith <[hidden email]>