Lua as a data description language

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Lua as a data description language

Dimitris Papavasileiou
Most games or game-like applications (i.e. applications that try to
model a world) I've seen (well all of them actually) seem to only use
Lua as a configuration/scripting language. It's great for this purpose,
no doubt about it, but after using Lua for a while in developing a game
engine, I thought: "why do I keep writing code to parse files, be they
binary or ascii, my own format or standard formats? Lua can do It all
and much better". A simple example:
Say you want to load a 3d model. Now loading 3ds files or others like
many people do is in my opininon completely lame as they've not been
designed for the purpose and introduce a lot of bloat, instability and
bugs. Writing your own simple format with just what you need is much
better but still, consider this: write a function called say model and
accepting a table with model data and returning some sort of model
object(like a userdata). Then in a file put this:
return model {
   vertices = {1.0, 2.0, 3.0, 4.0, 5.0, ...};
   indices = {1, 2, 3, 4, 5, ...};
};

then from an other script:
m = dofile "scripts/models/foo.model"
or
surface.model = dofile "scripts/models/foomodel.lua"

Again I thought to myself: "yay! Never will I need to write another line
of parsing code again. Long live Lua!". And I was right. This approach
gives you:
1) One consistent, clean, and aesthetic (IMHO) language for all your
data files.
2) "All your data files" can include images/maps, models, sounds, whatever.
3) No more parser writing (just converter writing :).
4) ASCII and/or binary files. Just do a $luac
scripts/models/foomodel.lua to get a binary file that will work just as
well as the ASCII script.
5) Propably others which I forget to mention since I now take them for
granted.

Well that's enough though-sharing for one night, let's get to the
bitching part. I have come across one problem with this approach (as
opposed to binary file formats). The fact that Lua just knows about one
kind of number (float, double or whatever you set lua_Number to) can and
will lead to space overhead. Consider this: Say you compile with
lua_Number set to float and you have the above model definiton. Vertices
which includes spatial and texture coordinates and normal vectors which
all need to be floats (in most cases) is no problem but for indices you
can usually get away with just shorts. This introduces a 2 byte/index
overhead. This again is not that big a deal since indices are relatively
few but if you try to store 4 RGBA values per pixel for an image map
where each value needs just be a char(a byte-sized integer that is) as
floats then you get 4 times larger image files. This is more serious
although it should be noted that the only problem is with file size and
hence resource loading time (once you get the data from lua you can cast
it to whatever type you want so it won't matter how lua stored it).
One way to "solve" the problem is to use strings instead of tables.
These strings will contain binary C arrays of floats, shorts or bytes
instead of Lua tables. Then the above example would look like this:

return model {
   vertices = "\123\034\167\123 ..."
   indices = "\132\056\032 ..."
}

This actaully simplifies the needed Lua code but I don't consider this a
solution as the whole idea is tha you can use a high level
machine-independent data description language whereas this is just like
reading a binary file but using lua to do the dirty parsing work. It
still is much better than parsing a binary yourself though, and just as
efficient.

Now I was wondering: since one of Lua's purposes is data-description.
And since it is logical to assume that some datasets can be large and
needing double precision here but only 8bit integer precision there, it
would be useful for Lua (and for me) to somehow be able to avoid wasting
space by storing everything as doubles. I haven't found a solution but I
do know that if a good one existis it:
1) will have to be implemented inside the Lua core as the problem is the
bytecode that the compiler creates: eg t = {1,2,3,4} will compile to
4*sizeof(double) bytes for the data and there's nothing you can do about
it using Lua code.
2) should not mess with Lua's current syntax. That is, solutions like
float = 4.0f or int = 2i or intv = {1, 2, 3, 4}i might work, but they
would introduce complexity into lua that would in my opinion definately
not be worth it.
I doubt that anything can be done about this because of the way Lua
handles numbers and number vectors but maybe I'm wrong. Any ideas on any
of the above subjects are welcome.

Dimitris P.


Reply | Threaded
Open this post in threaded view
|

Re: Lua as a data description language

Mike Pall-43
Hi,

Dimitris Papavasiliou wrote:
> although it should be noted that the only problem is with file size and
> hence resource loading time (once you get the data from lua you can cast
> it to whatever type you want so it won't matter how lua stored it).

If you don't care much about loading time then this might work:
- Compress your (compiled) data files with gzip or bzip2.
- Write a chunkreader that autodetects a compressed file and uncompresses
  on-the-fly. The zlib/libbz2 manuals should have an example that does
  exactly that.
- Hook the chunkreader into your own dofile().

I think this will get you excellent compression ratios for the kind of data
you have. And for *all* of it, not just the data you cared enough to
hand-code a compact representation for.

If this does not get you the desired effect, then create a converter script
that reads your 'plain' Lua data files and generates a Lua data file with
a compact representation (using strings and maybe delta-encoding). Then
compile and compress this file. Allow for loading both unconverted and
converted scripts so you don't need to run the converter script all the time
during development.

Ok, so I do see the advantages of this approach for 3d models and other
heavily structured data. But I think you can save yourself some trouble
by using dedicated libraries for e.g. images and sounds (libpng and
Ogg Vorbis come to mind). I guess you won't use anything else than a flat
RGBA or two-channel-16-bit representation anyway. And these libraries give
you exactly that, so why roll your own?

Bye,
     Mike

Reply | Threaded
Open this post in threaded view
|

Re: Lua as a data description language

Asko Kauppi-3
In reply to this post by Dimitris Papavasileiou

Couldn't a new bytecode be introduced (by Lua authors) that would take less space than current ones. This solution would be totally transparent to Lua end users, imho.

-ak


Reply | Threaded
Open this post in threaded view
|

Re: Lua as a data description language

Juergen Fuhrmann
In reply to this post by Dimitris Papavasileiou
Hi,

the bitching part of Lua is even worse...

Imagine you want to read in  one huge array of doubles which you will
need in a C Array The best you can do is the following

Define 
function load_array(lua_array)
  local n=table.getn(lua_array)
  local c_array=c_array_create(n)
  local i
  for i=1,n do c_array_set(c_array,i,lua_array[i]) end
  return c_array
end

This uses C code bound to Lua (e.g. using tolua).

c_array * c_array_create(int n);
void c_array_set(c_array *a, int i, double val);

Now you can write in the data file

a=load_array{1,2,3,5,12.33}
... lua code using a

But what happens when this file is read by Lua ?
1) The whole file is loaded into memory (?)
2) The file is translated to byte code doubling all the data
3) The bytecode is executed. Only then data gets where it is needed.

So all data enter the memory three (?, at least two) times while it is
needed only once.  We speak here about 10^6 ... 10^7 values.


What about binary data in  strings ? IMHO the proposed ascii representation
is much too long for this case. One could go with base64, though. But this
has IMHO considerable decoding overhead.

My workaround so far is a mechanism wich subdivides input files
into chunks separated by $ characters.
So the example above would be

a=c_array_create(5)
Data{a}
$
1
2
3
5
12.33
$
... lua code using a
[EOF]

When  executed, the first  chunk is  loaded, byte  compiled, executed.
The data statement  internally tells how to parse  the next chunk, and
where to  put the data.  Then,  the middle chunk is  parsed by another
parser (written by hand...) directly  transferring the data into the C
array.  The  last chunk  is  again  handled  by Lua.   Lua5  perfectly
supports  this chunk handling.  For Lua4  I published  the lua_dolines
patch on the wiki.


To handle binary data, I do the following

a=c_array_create(5)
Data{a, encoding="native"}
$
/=)EPEJDPDJP°D!"
$
... lua code using a
[EOF]

where /=)EPEJDPDJP°D!" is _pure_ (not base64, but xdr) encoded binary
data.  It is read directly read in  by fread() without any overhead.


If you want portable binary files,  you can use xdr encoding instead of
native.  One could imagine base64 as well.

You also can write

Data{a, encoding="native", file="f", pos=12334}

Then data is taken from another file by the very same mechanism.

In reality, data sets a linehandler or a binhandler used to  handle
the next chunk, which can be written in C or Lua.

[[ ]] strings instead of these chunks would be stored in memory,thus
doubling the needed  data space.

Please note that I researched XML for these topics, it gives no better
solution because you are left  alone with pure ascii data chunks. Pure
binary (not base64) is even impossible.

Matlab and co  IMHO have slow parsers. Some  communities speak CDF and
HDF which IMHO are incredibly bloated and intransparent. I don't know
about perl, python and ruby as I _love_ Lua.
 
While I  see my approach  more as a  workaround than as a  solution, I
really think  that Lua could win  from being able to  handle huge data
without bloat. My code is part  of a larger system. Time permitting, I
could try to cut out the basics and to make them available.
 
Juergen



Juergen Fuhrmann
 __  __  __  __                  Numerical Mathematics & Scientific Computing
|W  |I  |A  |S     Weierstrass Institute for Applied Analysis and Stochastics
Mohrenstr. 39 10117 Berlin      fon:+49 30 20372560        fax:+49 30 2044975
http://www.wias-berlin.de/~fuhrmann            [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lua as a data description language

Roberto Ierusalimschy
In reply to this post by Asko Kauppi-3
> Couldn't a new bytecode be introduced (by Lua authors) that would take 
> less space than current ones.  This solution would be totally 
> transparent to Lua end users, imho.

I'm not sure that would make much difference. Numbers are reused by the
compiler and any program cannot have many byte values. So, if a data
file uses only small (byte) values, its source will have at most 256
constants.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

RE: Re: Lua as a data description language

André de Leiradella
In reply to this post by Dimitris Papavasileiou
>> although it should be noted that the only problem is with file size and
>> hence resource loading time (once you get the data from lua you can cast
>> it to whatever type you want so it won't matter how lua stored it).
>
>If you don't care much about loading time then this might work:
>- Compress your (compiled) data files with gzip or bzip2.
>- Write a chunkreader that autodetects a compressed file and uncompresses
>  on-the-fly. The zlib/libbz2 manuals should have an example that does
>  exactly that.
>- Hook the chunkreader into your own dofile().

I wrote a library that can read Lua code from different sources (file,
memory),
passing it through uncompressing filters (bunzip2, gunzip).

The library is poorly written but maybe you can find something useful in it,
like the code necessary to uncompress files compressed with bzip2 and gzip.
It's
available at http://www.geocities.com/andre_leiradella/#luareader

Regards,

Andre de Leiradella
http://www.geocities.com/andre_leiradella/


Reply | Threaded
Open this post in threaded view
|

LAR, LuaZip, LuaTar (was: Lua as a data description language)

Danilo Tuler-2
Hi,

> >- Compress your (compiled) data files with gzip or bzip2.
> >- Write a chunkreader that autodetects a compressed file and uncompresses
> >  on-the-fly. The zlib/libbz2 manuals should have an example that does
> >  exactly that.
> >- Hook the chunkreader into your own dofile().
> 
> I wrote a library that can read Lua code from different 
> sources (file, memory), passing it through uncompressing 
> filters (bunzip2, gunzip).


I don't like to announce something and not releasing a public version, but
as the subject is being discussed, this might be interesting.

We, the Kepler Team, are developing three libraries, LAR, LuaZip and LuaTar.
A preview version will be available next week.

-----------------------------------------------------------------------
LAR (Lua ARchives) is a virtual file system using ZIP or tar.gz compression.
The idea is similar to a jar (Java Archive), war (Web archive), etc.
The library basically improves the require function to virtualize files in
an archive.
You can zip (or tar.gz) a bunch of lua files in a .lar file and require them
as usual, transparently.
This library uses LuaZip and LuaTar to read the contents from the archives.

-----------------------------------------------------------------------
LuaZip is a library to read files inside zip files.
The usage is very intuitive. Example below:

local zf, err = zip.open('test.zip')
for f in zf:files() do
	print(f.filename)
end
local f = zf:open('foo.txt')
print(f:read('*a'))
f:close()
zf:close()

-----------------------------------------------------------------------
LuaTar is a library to read files inside tar files.
The usage is as intuitive as LuaZip. Example below.
Notice that LuaTar uses lzlib by Thiago Dionisio.

local gf = gzip.open("lzlib.tar.gz", "rb")
local tf = tar.open(gf)
local f = tf:open("lzlib/README")
print(f:read("*a"))
f:close()
tf:close()
gf:close()

-----------------------------------------------------------------------
Comments are welcome.

-- Danilo


Reply | Threaded
Open this post in threaded view
|

Re: Lua as a data description language

Dimitris Papavasileiou
In reply to this post by Mike Pall-43
Mike Pall wrote:

If you don't care much about loading time then this might work:
- Compress your (compiled) data files with gzip or bzip2.
- Write a chunkreader that autodetects a compressed file and uncompresses
 on-the-fly. The zlib/libbz2 manuals should have an example that does
 exactly that.
- Hook the chunkreader into your own dofile().
I don't consider file size itself a big problem, it's the fact that bigger files take longer to load from disk that bugs me. So this solution could actually help reduce loading time since, say halving the resource file size through compression should save more loading time than the decompression overhead it would introduce. It all depends on how much this data will compress of course. I think that since most part of say an image file would be IEEE floats in the range of 0-255 (wich means that a part of the mantissa and the whole exponent of all floats should be the same) the file should have low enough 'entropy' to compress well. I tried to compress two model lua files one using strings (essentially zero overhead, like a normal binary file) and one using tables of lua numbers (doubles). Uncompressed the files are 16.6k and 58.12k respectively. Using gzip compression reduces the size to 9.4k and 19.2k while bzip2 compression results in 10.2k and 18.8k respectively. I consider this rather encouraging and I'm definately going to do it since it can be integrated so cleanly in dofile. Still saving space in the lua bytecode itself could lead to even smaller files...

Dimtiris

Reply | Threaded
Open this post in threaded view
|

Re: Lua as a data description language

Dimitris Papavasileiou
In reply to this post by Asko Kauppi-3
Asko Kauppi wrote:


Couldn't a new bytecode be introduced (by Lua authors) that would take less space than current ones. This solution would be totally transparent to Lua end users, imho.

-ak

I'm not sure I understand what Roberto says in his post but the bytecode itself (that is the opcodes) is not the problem. The problem lies in the fact that Lua understands only the word 'number', not 'float', 'integer', 'short' etc. I generally consider this a wise decision but it does have drawbacks as in this case. To save space lua would have to decide per-number wether this number can be encoded in fewer bits (say check wether it is an integer or a real number and how much precision is required) and store it accordingly together with some sort of tag saying how this number should be interpreted. Not very pretty I know although it could be implemented completely transparently to the user. If the only objective is to save disk space then this 'number compression' could be used only in bytecode stored in files. Once the file is read into memory the numbers can be decoded and casted to lua_Number and the user will never know the difference. The memory image of the script will still be big but the disk file will be smaller. Storing the numbers encoded in memory as well would reduce memory space overhead but would introduce a lot of decoding and casting time overhead on anything number related. Therefore this solution besides introducing quite a bit of complexity would propably be completely unacceptable from a performance point of view. Even worse hacks could be done by trying to optimize only 'big' arrays (tables) of numbers instead of single numbers but they would propably lead nowhere as well. Still Lua would benefit from a solution to this problem. It would be really cool if most apps would stop using special-purpose binary or ascii data files, each with its own syntax and therefore needing its own parser and using standard scripting language files (preferanly lua :)) to do the job.

Dimitris