Looking for some help debugging a weird crash in Lua

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Looking for some help debugging a weird crash in Lua

Chris Jones
Hey folks

I'm going over some crash reports for Hammerspoon (so we're talking about macOS here) and I've got several where the crash is happening inside Lua (5.4.2) in a way I can't explain. I would love it if someone has some ideas as to why. Unfortunately, because it's just a crash report uploaded from a user's machine with no real context, I have no idea how to reproduce it.

Firstly, here is the raw crash dump:

Exception Type: EXC_BAD_ACCESS (SIGBUS)
Exception Codes: BUS_NOOP at 0x0000000000000002
Crashed Thread: 0

Application Specific Information:
Exception 1, Code 2, Subcode 8 >
Attempted to dereference garbage pointer 0x2.

Thread 0 Crashed:
0   libsystem_kernel.dylib          0xfffe4076670e      read$NOCANCEL
1   libsystem_c.dylib               0xfffe405b4589      __srefill1
2   libsystem_c.dylib               0xfffe405af0a9      __fread
3   libsystem_c.dylib               0xfffe405aef16      fread
4   LuaSkin                         0x108dfb84f         [inlined] read_all (liolib.c:540)
5   LuaSkin                         0x108dfb84f         g_read (liolib.c:591)
6   LuaSkin                         0x108dfdd1b         luaD_precall (ldo.c:519)
7   LuaSkin                         0x108df6200         luaV_execute (lvm.c:1615)
8   LuaSkin                         0x108dfddb9         ccall (ldo.c:564)
9   LuaSkin                         0x108debc18         luai_objcttry (lobjectivec_exceptions.m:84)
10  LuaSkin                         0x108dfe26b         [inlined] luaD_rawrunprotected (ldo.c:160)
11  LuaSkin                         0x108dfe26b         luaD_pcall (ldo.c:800)
12  LuaSkin                         0x108e0164a         lua_pcallk (lapi.c:1031)
13  LuaSkin                         0x108de0d1b         -[LuaSkin protectedCallAndTraceback:nresults:] (Skin.m:368)
14  internal.so                     0x1176427f2         eventtap_callback (internal.m:48)

Starting from the bottom, eventtap_callback is called by the OS when a relevant HID event happens, and it asks our Lua wrapper (LuaSkin) to call a Lua function the user has supplied and that we stored a lua reference to. That happens in protectedCallAndTraceback (essentially just a lua_pcall() wrapper with debug.traceback in the stack as the message handler).
Then we see the pcall bits, then luai_objcttry which is from the only patch we apply to Lua - taken from Eric Wing's LuaCocoa, to allow Lua errors to fall nicely back into Objective C Land.
Then we see ccall, and tracing through to luaD_precall in ldo.c:519 we get to:

n = (*f)(L);

and f seemingly points to g_read since that is next in the backtrace, except this is where I get confused - the declaration of g_read() is:

static int g_read (lua_State *L, FILE *f, int first);

so the second two parameters aren't being passed any values. I believe this is undefined behaviour in C, but the "garbage pointer 0x2" suggests to me that clang is passing them as NULL or zero or something along those lines.

From there it's not surprising something is going to crash, but my real question here is - how come g_read() could ever be the function pointer that Lua is calling? Does that suggest some kind of other memory corruption in my code somewhere, and the fact that it happens to end up pointing at g_read() is a coincidence? I'm finding that a little hard to believe because I've had 3 reports of this crash in the last week. Surely this means I'm misreading the backtrace somehow?

So there we are, I am lost and confused, and a little disappointed that this particular crash report didn't come through with a register dump for some reason.

Cheers,
--
Cheers,

Chris
Reply | Threaded
Open this post in threaded view
|

Re: Looking for some help debugging a weird crash in Lua

Robert Burke
On Wed, Mar 10, 2021 at 9:54 AM Chris Jones <[hidden email]> wrote:
> n = (*f)(L);
>
> and f seemingly points to g_read since that is next in the backtrace, except this is where I get confused - the declaration of g_read() is:
>
> static int g_read (lua_State *L, FILE *f, int first);
>
> so the second two parameters aren't being passed any values. I believe this is undefined behaviour in C, but the "garbage pointer 0x2" suggests to me that clang is passing them as NULL or zero or something along those lines.
I suspect that f actually points to io_read or f_read:

static int io_read (lua_State *L) {
  return g_read(L, getiofile(L, IO_INPUT), 1);
}

static int f_read (lua_State *L) {
  return g_read(L, tofile(L), 2);
}

and that the compiler has replaced the tail call with a jump, so that
the function pointed to by f is absent from your stack trace.
Reply | Threaded
Open this post in threaded view
|

Re: Looking for some help debugging a weird crash in Lua

Chris Jones
In reply to this post by Chris Jones
Thanks Robert, that makes sense, so the user seems to be calling either io.read(), or :read() on an io object.

So taking that as legitimate, I'm now trying to explain the "garbage pointer 0x2" in fread(). Its signature is:

size_t fread(void *restrict ptr, size_t size, size_t nitems, FILE *restrict stream);

My first thought was maybe it's the FILE that's NULL and something inside it at offset 0x2 is what is being access, but the first field is a pointer, ie 8 bytes, making that unlikely.

So, then maybe it's the destination buffer that's the problem?

read_all() calls luaL_prepbuffer() right before fread(), to prepare the destination buffer, and that calls through to prepbuffsize() which appears to be doing pointer math on a luaL_Buffer struct, specifically foo->b + foo->n, so if b is NULL and n is 2 then we have a plausible path to the 0x2 error.

The question then becomes, how can foo->b be NULL?

Looking at prepbuffsize() I see three paths through the function, and I'm not sure I can rule any of them out as being incapable of producing a NULL.

Either way, I'm still not familiar enough with the Lua code to know how the buffer address could end up being NULL. Any ideas?

Cheers,
Chris

On Wed, 10 Mar 2021 at 00:53, Chris Jones <[hidden email]> wrote:
Hey folks

I'm going over some crash reports for Hammerspoon (so we're talking about macOS here) and I've got several where the crash is happening inside Lua (5.4.2) in a way I can't explain. I would love it if someone has some ideas as to why. Unfortunately, because it's just a crash report uploaded from a user's machine with no real context, I have no idea how to reproduce it.

Firstly, here is the raw crash dump:

Exception Type: EXC_BAD_ACCESS (SIGBUS)
Exception Codes: BUS_NOOP at 0x0000000000000002
Crashed Thread: 0

Application Specific Information:
Exception 1, Code 2, Subcode 8 >
Attempted to dereference garbage pointer 0x2.

Thread 0 Crashed:
0   libsystem_kernel.dylib          0xfffe4076670e      read$NOCANCEL
1   libsystem_c.dylib               0xfffe405b4589      __srefill1
2   libsystem_c.dylib               0xfffe405af0a9      __fread
3   libsystem_c.dylib               0xfffe405aef16      fread
4   LuaSkin                         0x108dfb84f         [inlined] read_all (liolib.c:540)
5   LuaSkin                         0x108dfb84f         g_read (liolib.c:591)
6   LuaSkin                         0x108dfdd1b         luaD_precall (ldo.c:519)
7   LuaSkin                         0x108df6200         luaV_execute (lvm.c:1615)
8   LuaSkin                         0x108dfddb9         ccall (ldo.c:564)
9   LuaSkin                         0x108debc18         luai_objcttry (lobjectivec_exceptions.m:84)
10  LuaSkin                         0x108dfe26b         [inlined] luaD_rawrunprotected (ldo.c:160)
11  LuaSkin                         0x108dfe26b         luaD_pcall (ldo.c:800)
12  LuaSkin                         0x108e0164a         lua_pcallk (lapi.c:1031)
13  LuaSkin                         0x108de0d1b         -[LuaSkin protectedCallAndTraceback:nresults:] (Skin.m:368)
14  internal.so                     0x1176427f2         eventtap_callback (internal.m:48)

Starting from the bottom, eventtap_callback is called by the OS when a relevant HID event happens, and it asks our Lua wrapper (LuaSkin) to call a Lua function the user has supplied and that we stored a lua reference to. That happens in protectedCallAndTraceback (essentially just a lua_pcall() wrapper with debug.traceback in the stack as the message handler).
Then we see the pcall bits, then luai_objcttry which is from the only patch we apply to Lua - taken from Eric Wing's LuaCocoa, to allow Lua errors to fall nicely back into Objective C Land.
Then we see ccall, and tracing through to luaD_precall in ldo.c:519 we get to:

n = (*f)(L);

and f seemingly points to g_read since that is next in the backtrace, except this is where I get confused - the declaration of g_read() is:

static int g_read (lua_State *L, FILE *f, int first);

so the second two parameters aren't being passed any values. I believe this is undefined behaviour in C, but the "garbage pointer 0x2" suggests to me that clang is passing them as NULL or zero or something along those lines.

From there it's not surprising something is going to crash, but my real question here is - how come g_read() could ever be the function pointer that Lua is calling? Does that suggest some kind of other memory corruption in my code somewhere, and the fact that it happens to end up pointing at g_read() is a coincidence? I'm finding that a little hard to believe because I've had 3 reports of this crash in the last week. Surely this means I'm misreading the backtrace somehow?

So there we are, I am lost and confused, and a little disappointed that this particular crash report didn't come through with a register dump for some reason.

Cheers,
--
Cheers,

Chris


--
Cheers,

Chris
Reply | Threaded
Open this post in threaded view
|

Re: Looking for some help debugging a weird crash in Lua

Ranier Vilela-2
Em qua., 10 de mar. de 2021 às 06:44, Chris Jones <[hidden email]> escreveu:
Thanks Robert, that makes sense, so the user seems to be calling either io.read(), or :read() on an io object.

So taking that as legitimate, I'm now trying to explain the "garbage pointer 0x2" in fread(). Its signature is:

size_t fread(void *restrict ptr, size_t size, size_t nitems, FILE *restrict stream);

My first thought was maybe it's the FILE that's NULL and something inside it at offset 0x2 is what is being access, but the first field is a pointer, ie 8 bytes, making that unlikely.

So, then maybe it's the destination buffer that's the problem?

read_all() calls luaL_prepbuffer() right before fread(), to prepare the destination buffer, and that calls through to prepbuffsize() which appears to be doing pointer math on a luaL_Buffer struct, specifically foo->b + foo->n, so if b is NULL and n is 2 then we have a plausible path to the 0x2 error.

The question then becomes, how can foo->b be NULL?

Looking at prepbuffsize() I see three paths through the function, and I'm not sure I can rule any of them out as being incapable of producing a NULL.

Either way, I'm still not familiar enough with the Lua code to know how the buffer address could end up being NULL. Any ideas?
I'm not sure that points you in the right direction.
But I had many segmentation faults, with C API lua stack manipulation, particularly with lua_pop.

Another trap is local var become garbage, like this:
function foo()
local a
local b
return b  -- a is collected by GC here
end

This function has helped me:

int print_cstack(lua_State *L, const char * const title)
{
const char * name;
    int max = lua_gettop(L);
    int i;

    assert(L != NULL && title != NULL);

    fprintf(stderr, "Stack dump: %s\n", title);
    for(i = 1 ; i <= max; ++i) {
       if (luaL_getmetafield(L, i, "__name") == LUA_TSTRING) {
            name = lua_tostring(L, -1);
            lua_pop(L, 1);
  } else {
   name = "";
  }
       fprintf(
            stderr,
            "%02d %d = %s %s %s\n",
            i,
            i - max - 1,
            luaL_typename(L, i),
            luaL_tolstring(L, i, NULL),
name);
       lua_pop(L, 1);
    }
    fprintf(stderr, "\n");
    return max;
}

regards,
Ranier Vilela



Cheers,
Chris

On Wed, 10 Mar 2021 at 00:53, Chris Jones <[hidden email]> wrote:
Hey folks

I'm going over some crash reports for Hammerspoon (so we're talking about macOS here) and I've got several where the crash is happening inside Lua (5.4.2) in a way I can't explain. I would love it if someone has some ideas as to why. Unfortunately, because it's just a crash report uploaded from a user's machine with no real context, I have no idea how to reproduce it.

Firstly, here is the raw crash dump:

Exception Type: EXC_BAD_ACCESS (SIGBUS)
Exception Codes: BUS_NOOP at 0x0000000000000002
Crashed Thread: 0

Application Specific Information:
Exception 1, Code 2, Subcode 8 >
Attempted to dereference garbage pointer 0x2.

Thread 0 Crashed:
0   libsystem_kernel.dylib          0xfffe4076670e      read$NOCANCEL
1   libsystem_c.dylib               0xfffe405b4589      __srefill1
2   libsystem_c.dylib               0xfffe405af0a9      __fread
3   libsystem_c.dylib               0xfffe405aef16      fread
4   LuaSkin                         0x108dfb84f         [inlined] read_all (liolib.c:540)
5   LuaSkin                         0x108dfb84f         g_read (liolib.c:591)
6   LuaSkin                         0x108dfdd1b         luaD_precall (ldo.c:519)
7   LuaSkin                         0x108df6200         luaV_execute (lvm.c:1615)
8   LuaSkin                         0x108dfddb9         ccall (ldo.c:564)
9   LuaSkin                         0x108debc18         luai_objcttry (lobjectivec_exceptions.m:84)
10  LuaSkin                         0x108dfe26b         [inlined] luaD_rawrunprotected (ldo.c:160)
11  LuaSkin                         0x108dfe26b         luaD_pcall (ldo.c:800)
12  LuaSkin                         0x108e0164a         lua_pcallk (lapi.c:1031)
13  LuaSkin                         0x108de0d1b         -[LuaSkin protectedCallAndTraceback:nresults:] (Skin.m:368)
14  internal.so                     0x1176427f2         eventtap_callback (internal.m:48)

Starting from the bottom, eventtap_callback is called by the OS when a relevant HID event happens, and it asks our Lua wrapper (LuaSkin) to call a Lua function the user has supplied and that we stored a lua reference to. That happens in protectedCallAndTraceback (essentially just a lua_pcall() wrapper with debug.traceback in the stack as the message handler).
Then we see the pcall bits, then luai_objcttry which is from the only patch we apply to Lua - taken from Eric Wing's LuaCocoa, to allow Lua errors to fall nicely back into Objective C Land.
Then we see ccall, and tracing through to luaD_precall in ldo.c:519 we get to:

n = (*f)(L);

and f seemingly points to g_read since that is next in the backtrace, except this is where I get confused - the declaration of g_read() is:

static int g_read (lua_State *L, FILE *f, int first);

so the second two parameters aren't being passed any values. I believe this is undefined behaviour in C, but the "garbage pointer 0x2" suggests to me that clang is passing them as NULL or zero or something along those lines.

From there it's not surprising something is going to crash, but my real question here is - how come g_read() could ever be the function pointer that Lua is calling? Does that suggest some kind of other memory corruption in my code somewhere, and the fact that it happens to end up pointing at g_read() is a coincidence? I'm finding that a little hard to believe because I've had 3 reports of this crash in the last week. Surely this means I'm misreading the backtrace somehow?

So there we are, I am lost and confused, and a little disappointed that this particular crash report didn't come through with a register dump for some reason.

Cheers,
--
Cheers,

Chris


--
Cheers,

Chris
Reply | Threaded
Open this post in threaded view
|

Re: Looking for some help debugging a weird crash in Lua

Ranier Vilela-2
In reply to this post by Chris Jones
Em qua., 10 de mar. de 2021 às 06:44, Chris Jones <[hidden email]> escreveu:
Thanks Robert, that makes sense, so the user seems to be calling either io.read(), or :read() on an io object.

So taking that as legitimate, I'm now trying to explain the "garbage pointer 0x2" in fread(). Its signature is:

size_t fread(void *restrict ptr, size_t size, size_t nitems, FILE *restrict stream);

My first thought was maybe it's the FILE that's NULL and something inside it at offset 0x2 is what is being access, but the first field is a pointer, ie 8 bytes, making that unlikely.

So, then maybe it's the destination buffer that's the problem?

read_all() calls luaL_prepbuffer() right before fread(), to prepare the destination buffer, and that calls through to prepbuffsize() which appears to be doing pointer math on a luaL_Buffer struct, specifically foo->b + foo->n, so if b is NULL and n is 2 then we have a plausible path to the 0x2 error.

The question then becomes, how can foo->b be NULL?

Looking at prepbuffsize() I see three paths through the function, and I'm not sure I can rule any of them out as being incapable of producing a NULL.

Either way, I'm still not familiar enough with the Lua code to know how the buffer address could end up being NULL. Any ideas?
As I suspected.

Whenever you encounter EXC_BAD_ACCESS, it means that you are sending a message to an object that has already been released.

This is the most common scenario, but there are exceptions as we'll discuss in a moment.


fread is calling with probably FILE var already released.

regards,
Ranier Vilela
Reply | Threaded
Open this post in threaded view
|

Re: Looking for some help debugging a weird crash in Lua

Chris Jones
I'm fairly sure that it's not FILE that's NULL, because the 0x2 offset wouldn't align with anything in the FILE struct - I think it needs to be the destination pointer that's NULL.

It's interesting that of the handful of "weird" crash logs I get, all of them are where one of our C callbacks is called by the OS and we pcall() into Lua. To me that definitely adds to the "this is a lifecycle thing". I've put in place quite a few protections already to try and make sure those things only happen when lua_State is consistent, but perhaps I am still missing something.

On Thu, 11 Mar 2021 at 02:03, Ranier Vilela <[hidden email]> wrote:
Em qua., 10 de mar. de 2021 às 06:44, Chris Jones <[hidden email]> escreveu:
Thanks Robert, that makes sense, so the user seems to be calling either io.read(), or :read() on an io object.

So taking that as legitimate, I'm now trying to explain the "garbage pointer 0x2" in fread(). Its signature is:

size_t fread(void *restrict ptr, size_t size, size_t nitems, FILE *restrict stream);

My first thought was maybe it's the FILE that's NULL and something inside it at offset 0x2 is what is being access, but the first field is a pointer, ie 8 bytes, making that unlikely.

So, then maybe it's the destination buffer that's the problem?

read_all() calls luaL_prepbuffer() right before fread(), to prepare the destination buffer, and that calls through to prepbuffsize() which appears to be doing pointer math on a luaL_Buffer struct, specifically foo->b + foo->n, so if b is NULL and n is 2 then we have a plausible path to the 0x2 error.

The question then becomes, how can foo->b be NULL?

Looking at prepbuffsize() I see three paths through the function, and I'm not sure I can rule any of them out as being incapable of producing a NULL.

Either way, I'm still not familiar enough with the Lua code to know how the buffer address could end up being NULL. Any ideas?
As I suspected.

Keeping It Simple

Whenever you encounter EXC_BAD_ACCESS, it means that you are sending a message to an object that has already been released.

This is the most common scenario, but there are exceptions as we'll discuss in a moment.


fread is calling with probably FILE var already released.

regards,
Ranier Vilela


--
Cheers,

Chris