Problem with concurrency, threads, Lua states and maybe GC?

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Problem with concurrency, threads, Lua states and maybe GC?

Matt 'Matic' (Lua)
I've been using Lua for quite some time - and it's a hugely impressive language especially in embedded firmware :-)

I have a problem that appears to be some kind of garbage collection (?) issue. The situation is as follows:
  • There is one "Lua Universe" from which multiple running Lua states are produced by calling lua_newthread.
  • I have applied LuaUsers extensions, according to http://lua-users.org/wiki/ThreadsTutorial
    • The LuaLock and LuaUnlock functions make use of a global mutex so that only one OS thread can be running Lua at any one time
  • There are about 12 different Lua states, all running in their own OS thread.
  • Each thread running C++ invokes Lua through lua_pcall
  • There is quite a large amount of C/C++ extensions added into Lua that are called from within each Lua context
    • This is probably just a side detail and not relevant.
  • Every now and then the asserts fail in lvm.c's luaV_execute
    • Typically, base no longer is equal to L->base (but L->ci->base is the same as L->base)
    • It's does not appear to be down to buffer overruns (I wrapped base in a protected set of variables to look for changes). If anything, it is L->base that is changing (not base itself).
    • Sometimes, base points to stale memory (I have full MISRA C runtime checking enabled in my environment) - as if it was valid, but is not anymore.
    • Eventually Lua goes stale -
      • Sometimes an OS thread crashes (though I can't prove it's the same issue at the moment)
      • Sometimes a while..forever inside a Lua will go AWOL - and variables will be pointing to the wrong place (as if the stack has changed unexpectedly)
After a couple of days of furious debugging, I find the following:
  • I have put numerous checks within lvm.c's luaV_execute to narrow down the point where base is modified
  • It appears that the modification occurs either side of dojump's luai_threadyield
    • I modified the dojump macro to check the validity of base and throw an exception if wrong
    • What I can see is that before the threadyield base was fine, and after it has changed.
  • As far as luaV_execute is concerned, base must not change (or when L->base is expected to change, then the call is wrapped in Protect)
    • I have occasions where L->base has changed in an OP_TEST opcode (where L->base was definitely not expected to change!)
    • The value of L->base has changed in other opcodes too (not just OP_TEST)
I'm trying to pin down where this "corruption" is occurring, and wondered if it might be down to garbage collection or some other 'shuffling' process. It would appear that other parts of Lua may be considering the lua_State to be 'unlocked' and therefore safe to modify - when in fact they are not (because it is running within the context of luaV_execute).

I am hoping to craft some debug code that can pin down where the change is occurring, but I'd be grateful for any pointers or suggestions otherwise!!!

Thanks :-)

Matt "Matic"
Reply | Threaded
Open this post in threaded view
|

Re: Problem with concurrency, threads, Lua states and maybe GC?

Javier Guerra Giraldez
On Tue, Dec 1, 2009 at 10:48 AM, Matt 'Matic' (Lua) <[hidden email]> wrote:
> There is quite a large amount of C/C++ extensions added into Lua that are
> called from within each Lua context
>
> This is probably just a side detail and not relevant.

does any of these extensions touch a Lua_State other than the one that
called it?  remember that the Lua core calls LuaUnlock just before
executing a C extension, so some other state will be running
concurrently with your C code; if you modify it, it will be corrupted.

--
Javier
Reply | Threaded
Open this post in threaded view
|

Re: Problem with concurrency, threads, Lua states and maybe GC?

Jerome Vuarand
2009/12/1 Javier Guerra <[hidden email]>:

> On Tue, Dec 1, 2009 at 10:48 AM, Matt 'Matic' (Lua) <[hidden email]> wrote:
>> There is quite a large amount of C/C++ extensions added into Lua that are
>> called from within each Lua context
>>
>> This is probably just a side detail and not relevant.
>
> does any of these extensions touch a Lua_State other than the one that
> called it?  remember that the Lua core calls LuaUnlock just before
> executing a C extension, so some other state will be running
> concurrently with your C code; if you modify it, it will be corrupted.

I had similar issues. If several threads use a given coroutine, Lua C
API calls from different C extensions may be intermixed, and while the
interpreter state itself is safe, the content of the stack may become
wrong. Many luaL_ functions become unsafe in that regard since they
may unlock the state in the middle of processing.

To solve the problem I decided to expose lua_lock and lua_unlock from
the Lua API (I have a patch available somewhere), so that I can call
it around all C code supposed to access shared Lua states. This
implies that lua_lock implementation works recursively, but that's
quite easy to do with most threading APIs.

Another solution not involving patching Lua is to make sure that no
two native threads use the same Lua thread (coroutine). This is why
most threading libraries for Lua create at least one Lua coroutine per
native thread created (eg. LuaThread, LuaProc).
Reply | Threaded
Open this post in threaded view
|

Re: Re: Problem with concurrency, threads, Lua states and maybe GC?

Matt 'Matic' (Lua)
Jerome Vuarand wrote:
2009/12/1 Javier Guerra [hidden email]:
  
On Tue, Dec 1, 2009 at 10:48 AM, Matt 'Matic' (Lua) [hidden email] wrote:
    
There is quite a large amount of C/C++ extensions added into Lua that are
called from within each Lua context

This is probably just a side detail and not relevant.
      
does any of these extensions touch a Lua_State other than the one that
called it?  remember that the Lua core calls LuaUnlock just before
executing a C extension, so some other state will be running
concurrently with your C code; if you modify it, it will be corrupted.
    

I had similar issues. If several threads use a given coroutine, Lua C
API calls from different C extensions may be intermixed, and while the
interpreter state itself is safe, the content of the stack may become
wrong. Many luaL_ functions become unsafe in that regard since they
may unlock the state in the middle of processing.

To solve the problem I decided to expose lua_lock and lua_unlock from
the Lua API (I have a patch available somewhere), so that I can call
it around all C code supposed to access shared Lua states. This
implies that lua_lock implementation works recursively, but that's
quite easy to do with most threading APIs.

Another solution not involving patching Lua is to make sure that no
two native threads use the same Lua thread (coroutine). This is why
most threading libraries for Lua create at least one Lua coroutine per
native thread created (eg. LuaThread, LuaProc).

  
As you both have said, it would on the surface appear as though I had two threads accessing one lua_State, or some other C/C++ corruption of Lua. However, I've put a huge amount of code to prove that wasn't the case (just doubting myself!!). Everything appears correct and coherent.

My view is that the Lua VM is making assumptions about the L->base value. Generally, L->base doesn't change, so it holds a local copy to avoid indirection and speed up the LVM. In some opcodes, the VM knows that L->base could or is definitely going to change, so the call is wrapped up in the "Protect" macro which reassigned the local copy of base after completion.

However, it would appear that there are 6 (IIRC) op codes that call dojump outside of the context of "Protect" and assume that L->base is not going to change. If you have one Lua "Universe", one lua_State per OS thread then that will be ok.
Once you add OS-level threading and have a single Lua "Universe" - even if you use lua_newthread and strictly keep each lua_State in an OS thread - that assumption is no longer valid.

Consequently, I have changed the "dojump" macro in lvm.c to now be:

    #define dojump(L,pc,i) { (pc) += (i); luai_threadyield(); base = L->base; }

Now, I know that some "dojump" calls are also wrapped inside "Protect" and therefore the "base = L->base" is going to be duplicated in those cases. Of course, my optimising compiler removes the redundancy.

Guess what - the problem has gone away and Lua is not failing its assertions anymore (and my Lua code isn't running off the rails)!!


Javier - I reckon you adding the extra lock/unlock inside your C routines is probably greatly minimising the issue because you are reducing its probability, but you may well find it's the same one that I have and can be resolved completely with the dojump patch.


Any thoughts or comments??


Matt
Reply | Threaded
Open this post in threaded view
|

Re: Re: Problem with concurrency, threads, Lua states and maybe GC?

Jerome Vuarand
2009/12/2 Matt 'Matic' (Lua) <[hidden email]>:

> Jerome Vuarand wrote:
>
> 2009/12/1 Javier Guerra <[hidden email]>:
>
>
> On Tue, Dec 1, 2009 at 10:48 AM, Matt 'Matic' (Lua) <[hidden email]>
> wrote:
>
>
> There is quite a large amount of C/C++ extensions added into Lua that are
> called from within each Lua context
>
> This is probably just a side detail and not relevant.
>
>
> does any of these extensions touch a Lua_State other than the one that
> called it?  remember that the Lua core calls LuaUnlock just before
> executing a C extension, so some other state will be running
> concurrently with your C code; if you modify it, it will be corrupted.
>
>
> I had similar issues. If several threads use a given coroutine, Lua C
> API calls from different C extensions may be intermixed, and while the
> interpreter state itself is safe, the content of the stack may become
> wrong. Many luaL_ functions become unsafe in that regard since they
> may unlock the state in the middle of processing.
>
> To solve the problem I decided to expose lua_lock and lua_unlock from
> the Lua API (I have a patch available somewhere), so that I can call
> it around all C code supposed to access shared Lua states. This
> implies that lua_lock implementation works recursively, but that's
> quite easy to do with most threading APIs.
>
> Another solution not involving patching Lua is to make sure that no
> two native threads use the same Lua thread (coroutine). This is why
> most threading libraries for Lua create at least one Lua coroutine per
> native thread created (eg. LuaThread, LuaProc).
>
>
>
> As you both have said, it would on the surface appear as though I had two
> threads accessing one lua_State, or some other C/C++ corruption of Lua.
> However, I've put a huge amount of code to prove that wasn't the case (just
> doubting myself!!). Everything appears correct and coherent.
>
> My view is that the Lua VM is making assumptions about the L->base value.
> Generally, L->base doesn't change, so it holds a local copy to avoid
> indirection and speed up the LVM. In some opcodes, the VM knows that L->base
> could or is definitely going to change, so the call is wrapped up in the
> "Protect" macro which reassigned the local copy of base after completion.
>
> However, it would appear that there are 6 (IIRC) op codes that call dojump
> outside of the context of "Protect" and assume that L->base is not going to
> change. If you have one Lua "Universe", one lua_State per OS thread then
> that will be ok.
> Once you add OS-level threading and have a single Lua "Universe" - even if
> you use lua_newthread and strictly keep each lua_State in an OS thread -
> that assumption is no longer valid.
>
> Consequently, I have changed the "dojump" macro in lvm.c to now be:
>
>     #define dojump(L,pc,i) { (pc) += (i); luai_threadyield(); base =
> L->base; }
>
> Now, I know that some "dojump" calls are also wrapped inside "Protect" and
> therefore the "base = L->base" is going to be duplicated in those cases. Of
> course, my optimising compiler removes the redundancy.
>
> Guess what - the problem has gone away and Lua is not failing its assertions
> anymore (and my Lua code isn't running off the rails)!!
>
>
> Javier - I reckon you adding the extra lock/unlock inside your C routines is
> probably greatly minimising the issue because you are reducing its
> probability, but you may well find it's the same one that I have and can be
> resolved completely with the dojump patch.
>
>
> Any thoughts or comments??

Do you have a patch of the modifications available ?
Reply | Threaded
Open this post in threaded view
|

Re: Re: Problem with concurrency, threads, Lua states and maybe GC?

Javier Guerra Giraldez
In reply to this post by Matt 'Matic' (Lua)
On Wed, Dec 2, 2009 at 11:08 AM, Matt 'Matic' (Lua) <[hidden email]> wrote:
> Javier - I reckon you adding the extra lock/unlock inside your C routines is
> probably greatly minimising the issue because you are reducing its
> probability, but you may well find it's the same one that I have and can be
> resolved completely with the dojump patch.

i think you mean Jerome, right?



--
Javier
Reply | Threaded
Open this post in threaded view
|

Re: Problem with concurrency, threads, Lua states and maybe GC?

Matt 'Matic' (Lua)
In reply to this post by Jerome Vuarand


Jerome Vuarand wrote:
2009/12/2 Matt 'Matic' (Lua) [hidden email]:
  
...
As you both have said, it would on the surface appear as though I had two
threads accessing one lua_State, or some other C/C++ corruption of Lua.
However, I've put a huge amount of code to prove that wasn't the case (just
doubting myself!!). Everything appears correct and coherent.

My view is that the Lua VM is making assumptions about the L->base value.
Generally, L->base doesn't change, so it holds a local copy to avoid
indirection and speed up the LVM. In some opcodes, the VM knows that L->base
could or is definitely going to change, so the call is wrapped up in the
"Protect" macro which reassigned the local copy of base after completion.

However, it would appear that there are 6 (IIRC) op codes that call dojump
outside of the context of "Protect" and assume that L->base is not going to
change. If you have one Lua "Universe", one lua_State per OS thread then
that will be ok.
Once you add OS-level threading and have a single Lua "Universe" - even if
you use lua_newthread and strictly keep each lua_State in an OS thread -
that assumption is no longer valid.

Consequently, I have changed the "dojump" macro in lvm.c to now be:

    #define dojump(L,pc,i) { (pc) += (i); luai_threadyield(); base =
L->base; }

Now, I know that some "dojump" calls are also wrapped inside "Protect" and
therefore the "base = L->base" is going to be duplicated in those cases. Of
course, my optimising compiler removes the redundancy.

Guess what - the problem has gone away and Lua is not failing its assertions
anymore (and my Lua code isn't running off the rails)!!


Javier - I reckon you adding the extra lock/unlock inside your C routines is
probably greatly minimising the issue because you are reducing its
probability, but you may well find it's the same one that I have and can be
resolved completely with the dojump patch.


Any thoughts or comments??
    

Do you have a patch of the modifications available ?

  
lvm.c (version 2.63.1.3 2007/12/28 15:32:23), line 360:
#define dojump(L,pc,i) {(pc) += (i); luai_threadyield(L);}
Change to:
#define dojump(L,pc,i) {(pc) += (i); luai_threadyield(L); base = L->base;}

That's all there is to it :-)

Reply | Threaded
Open this post in threaded view
|

Re: Problem with concurrency, threads, Lua states and maybe GC?

Matt 'Matic' (Lua)
In reply to this post by Javier Guerra Giraldez


Javier Guerra wrote:
On Wed, Dec 2, 2009 at 11:08 AM, Matt 'Matic' (Lua) [hidden email] wrote:
  
Javier - I reckon you adding the extra lock/unlock inside your C routines is
probably greatly minimising the issue because you are reducing its
probability, but you may well find it's the same one that I have and can be
resolved completely with the dojump patch.
    

i think you mean Jerome, right?



  
Whoops!
It's been a long few days digging in the bowels of Lua :o)
LOL!

Reply | Threaded
Open this post in threaded view
|

Re: Re: Problem with concurrency, threads, Lua states and maybe GC?

Roberto Ierusalimschy
In reply to this post by Matt 'Matic' (Lua)
> Consequently, I have changed the "dojump" macro in lvm.c to now be:
>
>    #define dojump(L,pc,i) { (pc) += (i); luai_threadyield(); *base =  
> L->base; *}
>
> /Now, I know that some "dojump" calls are also wrapped inside "Protect"  
> and therefore the "base = L->base" is going to be duplicated in those  
> cases. Of course, my optimising compiler removes the redundancy./
>
> Guess what - the problem has gone away and Lua is not failing its  
> assertions anymore (and my Lua code isn't running off the rails)!!
>
>
> [...]
>
> Any thoughts or comments??

One possible culprit is the resize of stacks in the garbage collector.
In one phase of the garbage gollection, the collector may shrink
some stacks if they seem too big. So, I guess it may happen that the
collector running in one thread shrinks the stack of another thread.
(Of course that never crossed my mind until now..)

-- Roberto
Reply | Threaded
Open this post in threaded view
|

Re: Problem with concurrency, threads, Lua states and maybe GC?

Matt 'Matic' (Lua)
Roberto Ierusalimschy wrote:
Consequently, I have changed the "dojump" macro in lvm.c to now be:

   #define dojump(L,pc,i) { (pc) += (i); luai_threadyield(); base =  
L->base; }

/Now, I know that some "dojump" calls are also wrapped inside "Protect"  
and therefore the "base = L->base" is going to be duplicated in those  
cases. Of course, my optimising compiler removes the redundancy./

Guess what - the problem has gone away and Lua is not failing its  
assertions anymore (and my Lua code isn't running off the rails)!!


[...]

Any thoughts or comments??
    

One possible culprit is the resize of stacks in the garbage collector.
In one phase of the garbage gollection, the collector may shrink
some stacks if they seem too big. So, I guess it may happen that the
collector running in one thread shrinks the stack of another thread.
(Of course that never crossed my mind until now..)

-- Roberto

  
Roberto - given the infrequent nature of the issue it would appear that it's a GC inside another thread.
Would you say the fix is sufficient? Or are there other things that cross your mind now?? :-)

(BTW, I removed the bold asterisks from the quoted text above to make the modification clearer!)
Reply | Threaded
Open this post in threaded view
|

Re: Problem with concurrency, threads, Lua states and maybe GC?

Linker
Threads are EVIL!

On Thu, Dec 3, 2009 at 04:46, Matt "Matic" Page (Lua) <[hidden email]> wrote:
Roberto Ierusalimschy wrote:
Consequently, I have changed the "dojump" macro in lvm.c to now be:

   #define dojump(L,pc,i) { (pc) += (i); luai_threadyield(); base =  
L->base; }

/Now, I know that some "dojump" calls are also wrapped inside "Protect"  
and therefore the "base = L->base" is going to be duplicated in those  
cases. Of course, my optimising compiler removes the redundancy./

Guess what - the problem has gone away and Lua is not failing its  
assertions anymore (and my Lua code isn't running off the rails)!!


[...]

Any thoughts or comments??
    
One possible culprit is the resize of stacks in the garbage collector.
In one phase of the garbage gollection, the collector may shrink
some stacks if they seem too big. So, I guess it may happen that the
collector running in one thread shrinks the stack of another thread.
(Of course that never crossed my mind until now..)

-- Roberto

  
Roberto - given the infrequent nature of the issue it would appear that it's a GC inside another thread.
Would you say the fix is sufficient? Or are there other things that cross your mind now?? :-)

(BTW, I removed the bold asterisks from the quoted text above to make the modification clearer!)



--
Regards,
Linker Lin
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Problem with concurrency, threads, Lua states and maybe GC?

steve donovan
On Thu, Dec 3, 2009 at 11:48 AM, Linker <[hidden email]> wrote:
> Threads are EVIL!

They certainly are painful when they don't work!  Of course, you are
meaning OS threads? Because coroutines are much nicer.  Now, if we
could imagine a universe in which all I/O was asynchronous then
everything could be done with coroutines.  Of course, they still have
to be scheduled, get to speak to each other politely, all the usual
concurrency challenges.
Reply | Threaded
Open this post in threaded view
|

Re: Problem with concurrency, threads, Lua states and maybe GC?

Roberto Ierusalimschy
> On Thu, Dec 3, 2009 at 11:48 AM, Linker <[hidden email]> wrote:
> > Threads are EVIL!
>
> They certainly are painful when they don't work!  Of course, you are
> meaning OS threads? Because coroutines are much nicer.  Now, if we
> could imagine a universe in which all I/O was asynchronous then
> everything could be done with coroutines.  Of course, they still have
> to be scheduled, get to speak to each other politely, all the usual
> concurrency challenges.

Not exactly "all the usual". A main difference is that coroutines are
MUCH easier to debug.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Problem with concurrency, threads, Lua states and maybe GC?

Roberto Ierusalimschy
In reply to this post by Matt 'Matic' (Lua)
> Roberto - given the infrequent nature of the issue it would appear that  
> it's a GC inside another thread.
> Would you say the fix is sufficient?

It seems to be sufficient (but see next).


> Or are there other things that cross your mind now?? :-)

No, but that is not very reassuring. That problem did not cross my mind
until you pinpointed it...

PS: did someone say that threads are evil? :)

-- Roberto
Reply | Threaded
Open this post in threaded view
|

Re: Problem with concurrency, threads, Lua states and maybe GC?

Matt 'Matic' (Lua)
Roberto Ierusalimschy wrote:
Roberto - given the infrequent nature of the issue it would appear that  
it's a GC inside another thread.
Would you say the fix is sufficient?
    

It seems to be sufficient (but see next).

  
Seems fine this side too. I'm currently thrashing the code to see if anything falls over. Looking good so far :-)
Or are there other things that cross your mind now?? :-)
    

No, but that is not very reassuring. That problem did not cross my mind
until you pinpointed it...

  
LOL! Hopefully I won't have to pinpoint any more!
BTW, just to say that Lua has been an absolute gem for our embedded ARM9 product and the integration with C & C++ has enabled some wonderful features. I really must get round to filling in a Lua project description :roll:

PS: did someone say that threads are evil? :)

  
Not evil. Just much maligned! LOL!
Reply | Threaded
Open this post in threaded view
|

Re: Problem with concurrency, threads, Lua states and maybe GC?

steve donovan
On Thu, Dec 3, 2009 at 1:41 PM, Matt 'Matic' (Lua) <[hidden email]> wrote:
> PS: did someone say that threads are evil? :)
> Not evil. Just much maligned! LOL!

So ultimately, it's the human's fault ;)  We need a 'Concurrency for
Dummies' paradigm, since we will soon run out of ueber-programmers
with all these new multicore architectures.
Reply | Threaded
Open this post in threaded view
|

Re: Problem with concurrency, threads, Lua states and maybe GC?

Roberto Ierusalimschy
In reply to this post by Matt 'Matic' (Lua)
> BTW, just to say that Lua has been an absolute gem for our embedded ARM9  
> product and the integration with C & C++ has enabled some wonderful  
> features. [...]

Thanks :)

-- Roberto
Reply | Threaded
Open this post in threaded view
|

Re: Problem with concurrency, threads, Lua states and maybe GC?

Matt 'Matic' (Lua)
In reply to this post by Roberto Ierusalimschy
Hi Roberto,
Could you please confirm that the Lua 5.2.0 work 3 includes the fix to
this GC-in-another-thread-while-jump issue?
I can see that the mechanics of the dojump have changed (i.e. using
savedpc, and not calling threadyield).
Thanks!
Matt

On 02/12/2009 19:08, Roberto Ierusalimschy wrote:

>> Consequently, I have changed the "dojump" macro in lvm.c to now be:
>>
>>     #define dojump(L,pc,i) { (pc) += (i); luai_threadyield(); *base =
>> L->base; *}
>>
>> /Now, I know that some "dojump" calls are also wrapped inside "Protect"
>> and therefore the "base = L->base" is going to be duplicated in those
>> cases. Of course, my optimising compiler removes the redundancy./
>>
>> Guess what - the problem has gone away and Lua is not failing its
>> assertions anymore (and my Lua code isn't running off the rails)!!
>>
>>
>> [...]
>>
>> Any thoughts or comments??
>>      
> One possible culprit is the resize of stacks in the garbage collector.
> In one phase of the garbage gollection, the collector may shrink
> some stacks if they seem too big. So, I guess it may happen that the
> collector running in one thread shrinks the stack of another thread.
> (Of course that never crossed my mind until now..)
>
> -- Roberto
>
>    
Reply | Threaded
Open this post in threaded view
|

Re: Problem with concurrency, threads, Lua states and maybe GC?

Roberto Ierusalimschy
> Could you please confirm that the Lua 5.2.0 work 3 includes the fix
> to this GC-in-another-thread-while-jump issue?

I moved the threadyield call to points where the GC may be called;
these points can trigger finalizers and therefore Lua must be ready to
all kinds of changes there. (This makes tight loops that do not create
objects "atomic", however.)

-- Roberto
Reply | Threaded
Open this post in threaded view
|

Re: Re: Problem with concurrency, threads, Lua states and maybe GC?

Matt 'Matic' (Lua)
Thanks for the quick response :-)
> (This makes tight loops that do not create
> objects "atomic", however.)
Can you elaborate? Could these changes affect the way my Lua & OS
threads work together? (i.e. is there anything I need to be aware of, or
change, when migrating up to 5.2)
Thanks!
Matt

On 22/07/28164 20:59, Roberto Ierusalimschy wrote:

>> Could you please confirm that the Lua 5.2.0 work 3 includes the fix
>> to this GC-in-another-thread-while-jump issue?
>>      
> I moved the threadyield call to points where the GC may be called;
> these points can trigger finalizers and therefore Lua must be ready to
> all kinds of changes there. (This makes tight loops that do not create
> objects "atomic", however.)
>
> -- Roberto
>
>    
12