Lua+Coco on Tiny Devices

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Lua+Coco on Tiny Devices

jbsnyder
Administrator
Hi -

This is mainly directed at Mike Pall, but I figured it would be good  
to have the results of this discussion out in the open.

I've recently been working on making Coco work with eLua (http://eluaproject.net 
), and I've managed to get the patching approach to work for setjmp as  
provided by newlib.  I'll be happy to contribute back the changes for  
this when they're a little more polished and I've had a chance to test  
them more.

The first thing I noticed is that the default stack size pretty much  
exceeded the available memory on the device I was using (64k SRAM  
total). I was able to pare it back to about 1k and get all of the test  
scripts running with reduced numbers of coroutines.  When I try going  
lower than this further (to, say, 512 or 256 bytes), though, it blows  
up.  I'm not familiar enough with Lua itself or with coco to know what  
to expect in terms of necessary stack space, but I was wondering if  
anyone had any thoughts on this.  When the coroutines are created in  
the example scripts, what exactly is needing to be preserved on the  
stack?  Can this be pared back at the cost of performance?  Also,  
could one safely allocate stack space only as needed by actual stack  
size as opposed to pre-allocation?

All in all, I was amazed at how easy it was to get up and running on a  
device that has no operating system, but I'd like to see what I can  
squeeze out of it to make the tradeoffs work better for embedded  
devices.

Thanks!

--
James Snyder
Biomedical Engineering
Northwestern University
[hidden email]
http://fanplastic.org/key.txt
ph: (847) 448-0386


PGP.sig (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Lua+Coco on Tiny Devices

Mike Pall-2
James Snyder wrote:
> I've recently been working on making Coco work with eLua
> (http://eluaproject.net), and I've managed to get the patching approach
> to work for setjmp as provided by newlib.  I'll be happy to contribute
> back the changes for this when they're a little more polished and I've
> had a chance to test them more.

Please do. Patches for better target support are always welcome.

> The first thing I noticed is that the default stack size pretty much  
> exceeded the available memory on the device I was using (64k SRAM  
> total).

Well, 64K is just barely enough to run some mini apps with plain
Lua. Coco requires additional memory for the C stacks of every
coroutine. There are other workarounds for the issues Coco solves
(e.g. yielding across pcall). Of course these workarounds may not
be as convenient or as complete as Coco, but may be more suitable
given your constraints.

> I was able to pare it back to about 1k and get all of the test  
> scripts running with reduced numbers of coroutines.  When I try going  
> lower than this further (to, say, 512 or 256 bytes), though, it blows  
> up.  I'm not familiar enough with Lua itself or with coco to know what  
> to expect in terms of necessary stack space, but I was wondering if  
> anyone had any thoughts on this.

Umm, you're playing a bit of Russian Roulette here. Since your
device doesn't have an MMU, you'll only notice a stack overflow
because some unrelated part of memory gets corrupted. Of course
this can happen with the main C stack, too.

It's hard to determine the absolute minimum amount of C stack
space needed by your app. You need to find the worst offenders and
round up generously:
- Most of the string.* functions need at least BUFSIZ bytes
  (usually 4K or 8K).
- The Lua parser may need quite a bit of stack space depending on
  the maximum syntactic nesting level.
- Some functions from the C library can be surprisingly expensive
  in terms of C stack space usage (e.g. gethostbyname).

I don't think you can go significantly below 16K without having
complete control over _all_ of the code ever run by your app.

Dedicating a quarter of your RAM for a single coroutine doesn't
sound like a good deal. That's why I said above that Coco is
probably not the best solution for a low memory environment.

> When the coroutines are created in the
> example scripts, what exactly is needing to be preserved on the stack?  
> Can this be pared back at the cost of performance?

Everything that's live on a C stack needs to be preserved.

> Also, could one
> safely allocate stack space only as needed by actual stack size as
> opposed to pre-allocation?

Nope, C stacks cannot be relocated easily. The C compiler is free
to store pointers (i.e. absolute addresses) to other parts of the
same stack in there. Alternative layouts (e.g. linked stacks) are
rarely supported by C compilers.

--Mike
Reply | Threaded
Open this post in threaded view
|

Re: Lua+Coco on Tiny Devices

Jerome Vuarand
In reply to this post by jbsnyder
2009/4/14 James Snyder <[hidden email]>:

> Hi -
>
> This is mainly directed at Mike Pall, but I figured it would be good to have
> the results of this discussion out in the open.
>
> I've recently been working on making Coco work with eLua
> (http://eluaproject.net), and I've managed to get the patching approach to
> work for setjmp as provided by newlib.  I'll be happy to contribute back the
> changes for this when they're a little more polished and I've had a chance
> to test them more.
>
> The first thing I noticed is that the default stack size pretty much
> exceeded the available memory on the device I was using (64k SRAM total). I
> was able to pare it back to about 1k and get all of the test scripts running
> with reduced numbers of coroutines.  When I try going lower than this
> further (to, say, 512 or 256 bytes), though, it blows up.  I'm not familiar
> enough with Lua itself or with coco to know what to expect in terms of
> necessary stack space, but I was wondering if anyone had any thoughts on
> this.  When the coroutines are created in the example scripts, what exactly
> is needing to be preserved on the stack?  Can this be pared back at the cost
> of performance?  Also, could one safely allocate stack space only as needed
> by actual stack size as opposed to pre-allocation?

AFAIK the stack has to be a single block of memory. So to have several
independently growing C stacks you need an MMU (to fragment the stack
while keeping it solid in terms of address space), and you will only
be able to do it on a page-by-page basis. But I guess most of your
embedded targets don't provide an MMU.

> All in all, I was amazed at how easy it was to get up and running on a
> device that has no operating system, but I'd like to see what I can squeeze
> out of it to make the tradeoffs work better for embedded devices.
>
> Thanks!
>
> --
> James Snyder
> Biomedical Engineering
> Northwestern University
> [hidden email]
> http://fanplastic.org/key.txt
> ph: (847) 448-0386
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Lua+Coco on Tiny Devices

jbsnyder
Administrator
In reply to this post by Mike Pall-2

On Apr 14, 2009, at 12:06 PM, Mike Pall wrote:

> James Snyder wrote:
>> I've recently been working on making Coco work with eLua
>> (http://eluaproject.net), and I've managed to get the patching  
>> approach
>> to work for setjmp as provided by newlib.  I'll be happy to  
>> contribute
>> back the changes for this when they're a little more polished and  
>> I've
>> had a chance to test them more.
>
> Please do. Patches for better target support are always welcome.
I'll be happy to send them back regardless of whether it turns out to  
be suitable for these devices.  There are certainly micros with more  
RAM.

>
>> The first thing I noticed is that the default stack size pretty much
>> exceeded the available memory on the device I was using (64k SRAM
>> total).
>
> Well, 64K is just barely enough to run some mini apps with plain
> Lua. Coco requires additional memory for the C stacks of every
> coroutine. There are other workarounds for the issues Coco solves
> (e.g. yielding across pcall). Of course these workarounds may not
> be as convenient or as complete as Coco, but may be more suitable
> given your constraints.
We're currently able to start up the interpreter in about 5-6k of  
SRAM, in large part through some patches that Bogdan Marinescu  
developed to keep some things in flash that would otherwise get loaded  
into RAM.   That said, we're certainly not expecting miracles in terms  
of RAM usage, most applications that one would run on eLua are fairly  
compact in terms of memory usage.

I did also try applying the RVM patch, but it didn't apply cleanly and  
the fixes seemed non-obvious at first glance.  I'm pretty sure I could  
figure them all out, but it was going to take more time, and so I  
shelved it for the moment.

>
>> I was able to pare it back to about 1k and get all of the test
>> scripts running with reduced numbers of coroutines.  When I try going
>> lower than this further (to, say, 512 or 256 bytes), though, it blows
>> up.  I'm not familiar enough with Lua itself or with coco to know  
>> what
>> to expect in terms of necessary stack space, but I was wondering if
>> anyone had any thoughts on this.
>
> Umm, you're playing a bit of Russian Roulette here. Since your
> device doesn't have an MMU, you'll only notice a stack overflow
> because some unrelated part of memory gets corrupted. Of course
> this can happen with the main C stack, too.
Hmm.. fair enough. I'm pretty sure that's what I've done in more than  
a few cases when experimenting with going below 1k, and that I was  
getting lucky with that :-)

>
> It's hard to determine the absolute minimum amount of C stack
> space needed by your app. You need to find the worst offenders and
> round up generously:
> - Most of the string.* functions need at least BUFSIZ bytes
>  (usually 4K or 8K).
> - The Lua parser may need quite a bit of stack space depending on
>  the maximum syntactic nesting level.
> - Some functions from the C library can be surprisingly expensive
>  in terms of C stack space usage (e.g. gethostbyname).
Those are helpful pointers, I suppose I'll have to think about this a  
little and maybe do some checking on the newlib functions we're  
using.  For one part, I do know that we are using a rather small  
BUFSIZ of 256 for Newlib.

>
> I don't think you can go significantly below 16K without having
> complete control over _all_ of the code ever run by your app.
>
> Dedicating a quarter of your RAM for a single coroutine doesn't
> sound like a good deal. That's why I said above that Coco is
> probably not the best solution for a low memory environment.

Good point.  What about situations where you have control over what  
the coroutine consists of.  One model we're thinking of is using this  
to deal with hardware events (button presses, changes in ADC signal  
level, etc..).  We could, I suppose, potentially make the coco  
routines a special class that could be used but not modified and could  
provide certain facilities like the event handling.  In these cases I  
presume we could know by static analysis how much those coroutines  
would use?

In terms of figuring out how much stack space should be allocated, if  
the coco routine is written in Lua, is stack usage comparable to the  
growth one might see for the main lua stack if that were just  
implemented as a function?  I'm guessing that if I were using print  
functions in lua and I created a coroutine that also printed that the  
coroutine's stack would have to contain a duplicate buffer.

>
>> When the coroutines are created in the
>> example scripts, what exactly is needing to be preserved on the  
>> stack?
>> Can this be pared back at the cost of performance?
>
> Everything that's live on a C stack needs to be preserved.
>
>> Also, could one
>> safely allocate stack space only as needed by actual stack size as
>> opposed to pre-allocation?
>
> Nope, C stacks cannot be relocated easily. The C compiler is free
> to store pointers (i.e. absolute addresses) to other parts of the
> same stack in there. Alternative layouts (e.g. linked stacks) are
> rarely supported by C compilers.
OK, that makes sense.

Thanks for the information.  Your responses have been quite helpful in  
thinking about what to do/use in this context.

Best.

--
James Snyder
Biomedical Engineering
Northwestern University
[hidden email]
http://fanplastic.org/key.txt
ph: (847) 448-0386


PGP.sig (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Lua+Coco on Tiny Devices

jbsnyder
Administrator
In reply to this post by Jerome Vuarand

On Apr 14, 2009, at 12:07 PM, Jerome Vuarand wrote:

> AFAIK the stack has to be a single block of memory. So to have several
> independently growing C stacks you need an MMU (to fragment the stack
> while keeping it solid in terms of address space), and you will only
> be able to do it on a page-by-page basis. But I guess most of your
> embedded targets don't provide an MMU.


Some of them do, but it's not guaranteed.  Also, others will sometimes  
provide some of the functionality of an MMU like memory protection,  
but I think the idea is generally to avoid adding anything to the core  
that would require platform-specific functionality.

--
James Snyder
Biomedical Engineering
Northwestern University
[hidden email]
http://fanplastic.org/key.txt
ph: (847) 448-0386


PGP.sig (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Lua+Coco on Tiny Devices

BogdanM
In reply to this post by Mike Pall-2

I'll be happy to send them back regardless of whether it turns out to be suitable for these devices.  There are certainly micros with more RAM.

Not that many. Then again, there are a lot of devboards with external RAM and they are much more suitable for this kind of treatment :)


Umm, you're playing a bit of Russian Roulette here. Since your
device doesn't have an MMU, you'll only notice a stack overflow
because some unrelated part of memory gets corrupted. Of course
this can happen with the main C stack, too.

Hmm.. fair enough. I'm pretty sure that's what I've done in more than a few cases when experimenting with going below 1k, and that I was getting lucky with that :-)

Yes, Mike is right about this. Basically the only protection we have is that we declare a stack size and we forbid our break to get past that, but if the stack itself grows below its intented limit we have a problem. And it generally manifests itself with a complete freeze. I think there are ways to detect this problem (at least partially) with dlmalloc by activating some internal consistency checks, but I didn't test this yet. I also think you can ask the compiler to do some stack checking work for you (at least on ARM where you have a register than can be used as a "stack limit"), but again, I didn't test this. I probably should :)



It's hard to determine the absolute minimum amount of C stack
space needed by your app. You need to find the worst offenders and
round up generously:
- Most of the string.* functions need at least BUFSIZ bytes
 (usually 4K or 8K).
- The Lua parser may need quite a bit of stack space depending on
 the maximum syntactic nesting level.
- Some functions from the C library can be surprisingly expensive
 in terms of C stack space usage (e.g. gethostbyname).

Those are helpful pointers, I suppose I'll have to think about this a little and maybe do some checking on the newlib functions we're using.  For one part, I do know that we are using a rather small BUFSIZ of 256 for Newlib.

There are helpful, thank you. This is how we deal with them for now:

- BUFSIZ is currently 128 in eLua (seems awfully small, but it does the job for now).
- some of the larger Lua parser structures that are traditionally allocated on the stack are allocated on the heap in eLua. This fragments the heap a lot, obviously, but thanks to dlmalloc it isn't such a big problem in practice. Still, this doesn't help that much when too much nesting is involved.

Right now we have 2K of stack for architectures with 64K of RAM or less and 8K for the lucky boards with external RAM. Seems to work well on practice, at least for what we tried on eLua so far (which, fortunately, is small enough).

Best,
Bogdan


Reply | Threaded
Open this post in threaded view
|

Re: Lua+Coco on Tiny Devices

Mike Pall-2
In reply to this post by jbsnyder
James Snyder wrote:
> Good point.  What about situations where you have control over what the
> coroutine consists of.  One model we're thinking of is using this to deal
> with hardware events (button presses, changes in ADC signal level,
> etc..).

One of the lessons learned from GUI frameworks is that controllers
should be reactive and not active.

It turns out that an active controller (using a coroutine or
thread) cannot adequately represent the inherent complexity in UI
handling. It's tempting to use a coroutine/thread for this, but in
general you have many inputs that interfere with linear control
flow in complex ways. You'll end up with a mess of event-driven
(i.e. yield to scheduler) and polled APIs.

Even a simple use case like "record a signal in regular intervals
until a button is pressed" turns out to be tricky. Implemented
with a coroutine this would probably look like this:

local function recorder_coro(adc, button, rate)
  local t = {}
  for value in adc:sample(rate) do     -- Iterator yields to scheduler.
    t[#t+1] = value
    if button:pressed() then break end -- Polled? Queued event?
  end
end

As you can see, the check for the pressed button doesn't integrate
nicely. A button press is really an event -- you might miss some
events if you're just polling the button state (if the sample rate
is very low). Ok, so let's queue button press events from an
interrupt. But then you get in trouble if multiple consumers want
to receive the same event. And that still doesn't handle the case
of an early abort (the loop is never entered until the end of a
sampling interval).

Ok, so let's teach the scheduler to wait on multiple events, like
"next sample available OR button pressed". But then you need to
find out which event triggered -- this thoroughly messes up the
nice and easy API. And simple boolean scheduling decisions just
don't suffice for the more complex cases. And so on ...

But there's a better way:

A reactive controller is purely event-driven. It can be
implemented Erlang-style as a coroutine with a central event
dispatch loop. But since Lua has decent support for closure-based
or OO-style programming, it's more comfortable to decompose the
controller into invidivual event handlers. Something like:

local function recorder_start(adc, button, rate)
  local t = {}
  adc:onsample(function(value) t[#t+1] = value end)
  button:onpress(function() adc:stop() ... process data ... end)
  adc:start(rate)
end

For a more flexible design, one could borrow the Qt-Style
signals/slots model, e.g.:

local recorder = {}
function recorder:sample(value) self[#self+1] = value end
function recorder:stop() ... process data ... end
function recorder:start(adc, button, rate)
  attach(adc.sample, recorder, "sample")
  attach(button.press, adc, "stop")   -- adc:stop() detaches receivers, too.
  attach(button.press, recorder, "stop")
  adc:start(rate)
end

All the control logic is now inside the individual controllers,
where it belongs. Writing a scheduler for an event-driven model is
much simpler. A first cut of a trivial scheduler could just poll
all event sources and call the associated handlers. Later it can
be converted into an interrupt-driven design (to save power). No
change to the controllers should be necessary.

I strongly suggest to _first_ collect several typical use cases of
varying complexity. Then write pseudo-code for the controllers
using the different paradigms. Find out which framework best
handles your needs. Only then design the framework and decide on
the implementation details (such as the use of coroutines).

--Mike
Reply | Threaded
Open this post in threaded view
|

Re: Lua+Coco on Tiny Devices

jbsnyder
Administrator

On Apr 15, 2009, at 8:36 AM, Mike Pall wrote:

> James Snyder wrote:
>> Good point.  What about situations where you have control over what  
>> the
>> coroutine consists of.  One model we're thinking of is using this  
>> to deal
>> with hardware events (button presses, changes in ADC signal level,
>> etc..).
>
> One of the lessons learned from GUI frameworks is that controllers
> should be reactive and not active.
>
> It turns out that an active controller (using a coroutine or
> thread) cannot adequately represent the inherent complexity in UI
> handling. It's tempting to use a coroutine/thread for this, but in
> general you have many inputs that interfere with linear control
> flow in complex ways. You'll end up with a mess of event-driven
> (i.e. yield to scheduler) and polled APIs.
>
> Even a simple use case like "record a signal in regular intervals
> until a button is pressed" turns out to be tricky. Implemented
> with a coroutine this would probably look like this:
>
> local function recorder_coro(adc, button, rate)
>  local t = {}
>  for value in adc:sample(rate) do     -- Iterator yields to scheduler.
>    t[#t+1] = value
>    if button:pressed() then break end -- Polled? Queued event?
>  end
> end
>
> As you can see, the check for the pressed button doesn't integrate
> nicely. A button press is really an event -- you might miss some
> events if you're just polling the button state (if the sample rate
> is very low). Ok, so let's queue button press events from an
> interrupt. But then you get in trouble if multiple consumers want
> to receive the same event. And that still doesn't handle the case
> of an early abort (the loop is never entered until the end of a
> sampling interval).
>
> Ok, so let's teach the scheduler to wait on multiple events, like
> "next sample available OR button pressed". But then you need to
> find out which event triggered -- this thoroughly messes up the
> nice and easy API. And simple boolean scheduling decisions just
> don't suffice for the more complex cases. And so on ...
>
> But there's a better way:
>
> A reactive controller is purely event-driven. It can be
> implemented Erlang-style as a coroutine with a central event
> dispatch loop. But since Lua has decent support for closure-based
> or OO-style programming, it's more comfortable to decompose the
> controller into invidivual event handlers. Something like:
>
> local function recorder_start(adc, button, rate)
>  local t = {}
>  adc:onsample(function(value) t[#t+1] = value end)
>  button:onpress(function() adc:stop() ... process data ... end)
>  adc:start(rate)
> end
>
> For a more flexible design, one could borrow the Qt-Style
> signals/slots model, e.g.:
>
> local recorder = {}
> function recorder:sample(value) self[#self+1] = value end
> function recorder:stop() ... process data ... end
> function recorder:start(adc, button, rate)
>  attach(adc.sample, recorder, "sample")
>  attach(button.press, adc, "stop")   -- adc:stop() detaches  
> receivers, too.
>  attach(button.press, recorder, "stop")
>  adc:start(rate)
> end
>
> All the control logic is now inside the individual controllers,
> where it belongs. Writing a scheduler for an event-driven model is
> much simpler. A first cut of a trivial scheduler could just poll
> all event sources and call the associated handlers. Later it can
> be converted into an interrupt-driven design (to save power). No
> change to the controllers should be necessary.
>
> I strongly suggest to _first_ collect several typical use cases of
> varying complexity. Then write pseudo-code for the controllers
> using the different paradigms. Find out which framework best
> handles your needs. Only then design the framework and decide on
> the implementation details (such as the use of coroutines).

Thanks for these details as well.  I certainly didn't give a lot of  
detail on what I was thinking of in that original email.  I wasn't  
thinking of using a large quantity of coroutines to internally handle  
multiple types of behavior (collect adc samples until a button is  
pressed), and was thinking more along the lines of what you suggested  
in the latter portion of the message, where one attaches or registers  
functions, closures or coroutines to deal with different events and  
there's some sort of centralized dispatcher that periodically rolls  
through event sources and then calls/notifies registered handlers at  
the beginning of an event loop, and then that dispatcher yields to  
allow other code to run in the flow of a main program.

Point taken with the design approach.  The design should be hashed out  
and mocked before getting into these lower level details.  On  
appearance I like the Qt signals/slots approach, though I'll have to  
see whether that would really work.

--
James Snyder
Biomedical Engineering
Northwestern University
[hidden email]
http://fanplastic.org/key.txt
ph: (847) 448-0386


PGP.sig (201 bytes) Download Attachment