Testing LUA: verify the correctness of a UTF16 LUA port

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Testing LUA: verify the correctness of a UTF16 LUA port

uri cohen
I was able to create a UTF16 port of LUA for windows, by following the guidelines I found at http://lua-users.org/lists/lua-l/2002-12/msg00021.html and the considerations from http://lua-users.org/lists/lua-l/2002-12/msg00025.html. This is a version of LUA where all internal strings are UTF16 (wide characters) rather than chars, not a version with two types of strings like LuaState.

My question is on how can I verify my port works? Other than toy scripts I created, I'm looking for a comprehensive set of tests I can run in order to verify all important language feature were not broken...

Any ideas if such a set exists? How does the correctness of changes in Lua 5.1.xx is verified?

I saw there is an official test suite at http://www.inf.puc-rio.br/~roberto/lua/lua5.1-tests.tar.gz but I'm not sure if (a) it should work on Windows and (b) it is supposed to work without considerable changes for my port. I'd appreciate feedback with someone which is familiar with it...

Thanks!
           Uri Cohen
Reply | Threaded
Open this post in threaded view
|

Re: Testing LUA: verify the correctness of a UTF16 LUA port

David Dunham
On 14 Oct 2009, at 15:57, uri cohen wrote:

> This is a version of
> LUA where all internal strings are UTF16 (wide characters) rather than
> chars, not a version with two types of strings like LuaState.


I assume you're aware that not all Unicode characters fit in 16 bits  
-- I never know if Windows wide character support realizes this.

David Dunham       Development Manager
+1 206 926 5722      GameHouse Studios

Reply | Threaded
Open this post in threaded view
|

Re: Testing LUA: verify the correctness of a UTF16 LUA port

David Given
In reply to this post by uri cohen
uri cohen wrote:
[...]
> My question is on how can I verify my port works? Other than toy scripts
> I created, I'm looking for a comprehensive set of tests I can run in
> order to verify all important language feature were not broken...

Unicode is harder than it looks... one reason Lua doesn't really use it
is that once you start dealing with Unicode you start finding places
where you get conflicting requirements.

For example, é can be represented as both U+301 U+0065, or as U+00E9. Do
these compare equal? They are technically the same thing.

What about sorting order? Does Ё (U+0401) sort before, after, or equal
to Ë (U+00CB)? For that matter, what about ഐ (U+0D10) and ᚔ (U+1694)?
What about E (U+0045), Ε (U+0395), Е (U+0415), ⋿ (U+22FF), ⴹ (U+2D39),
E (U+FF25), 𝐄 (U+1D404), 𝐸 (U+1D438), 𝑬 (U+1D46C), 𝔼 (U+1D53C), 𝖤
(U+1D5A4), 𝗘 (U+1D5D8), 𝘌 (U+1D60C), 𝙀 (U+1D640), 𝙴 (U+1D674), 𝚬
(U+1D6AC), �𝛦 (U+1D6E6), 𝜠 (U+1D720), or 𝝚 (U+1D75A)?

Do you mean UTF-16 or UCS-2? UCS-2 can't handle some of the really
freaky Unicode characters like 𝌆 (U+1D306) or 🀎� (U_1F00E) --- I don't
even have the font to display that last one!

And, most importantly of all, can you still use Lua strings to represent
arbitrary binary data, or is the data forced into well-formed UTF-16?

One reason people tend to use UTF-8 in Lua is not that it solves all
these problems, but that it cleanly divides the problems into soluble
ones and non-soluble ones! And it turns out that most people don't care
about the non-soluble ones. Unfortunately, once you start trying to
*natively* support Unicode, you suddenly find yourself having to care
about these things...

(adjusts signature)

--
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────

│ ⍎'⎕',∊N⍴⊂S←'←⎕←(3=T)⋎M⋏2=T←⊃+/(V⌽"⊂M),(V⊝"M),(V,⌽V)⌽"(V,V←1⎺1)⊝"⊂M)'
│ --- Conway's Game Of Life, in one line of APL
Reply | Threaded
Open this post in threaded view
|

Re: Testing LUA: verify the correctness of a UTF16 LUA port

Joshua Jensen
In reply to this post by David Dunham
----- Original Message -----
From: David Dunham
Date: 10/14/2009 6:58 PM
> On 14 Oct 2009, at 15:57, uri cohen wrote:
>> This is a version of
>> LUA where all internal strings are UTF16 (wide characters) rather than
>> chars, not a version with two types of strings like LuaState.
> I assume you're aware that not all Unicode characters fit in 16 bits
> -- I never know if Windows wide character support realizes this.
I used the LuaPlus wide character string support to ship games to the
Japanese market.  For a computer/console game, there is almost always a
known quantity of glyphs for a given language when localizing its string
tables.  Wide character string support covers this nicely.

LuaPlus achieves this via a C-like string representation:

HelloWorld = L"Hello world!"

Wide string hex representations of unprintable characters are as follows:

RandomCharacters = L"\x80fe\xabcd\x1234"

Of course, core Lua can support the string above, but what it can't do
is proper endian conversion of the individual characters.  Since it is
known the string is wide, LuaPlus endian converts each character as
necessary.

LuaPlus supports one other useful wide string feature.  It knows how to
read a UCS-2 encoded Lua source file.  Instead of using the escape
sequences in RandomCharacters above, the actual glyphs can be inserted
into the file.

I apologize for hijacking the thread, but I really do feel it is useful
to have wide character string support for my work within the games
industry.  For applications, wide characters may be worthless.

Josh

Reply | Threaded
Open this post in threaded view
|

Re: Testing LUA: verify the correctness of a UTF16 LUA port

David Given
Joshua Jensen wrote:
[...]
> LuaPlus achieves this via a C-like string representation:
>
> HelloWorld = L"Hello world!"

Yes, that sounds like a good solution --- that lets you distinguish
between bag-of-bytes strings and Unicodish strings, which means you
don't have to worry about conflicting requirements as much.

What does LuaPlus do for things like string comparison and surrogates?

[...]
> I apologize for hijacking the thread, but I really do feel it is useful
> to have wide character string support for my work within the games
> industry.  For applications, wide characters may be worthless.

Do any of them use UTF-8?

I work in mobile games; our company makes a portable native gaming
solution that allows you to install C-based games on any device,
regardless of architecture. The API's based on OpenKODE, which uses
UTF-8 in the few places where it uses strings. As I tend to do the
bottom-end porting to weird and freaky embedded operating systems, I've
got tiresomely familiar with having to translate UTF-8 to whatever
encoding the host OS uses. There are a surprising number that use some
form of half-assed UCS-2, and I've never figured out why --- it just
makes life complex. I suspect that it's simple tradition. Most of them
come from Asia, and Asia seems to have a culture of using UCS-2 or UTF-16...

--
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────

│ ⍎'⎕',∊N⍴⊂S←'←⎕←(3=T)⋎M⋏2=T←⊃+/(V⌽"⊂M),(V⊝"M),(V,⌽V)⌽"(V,V←1⎺1)⊝"⊂M)'
│ --- Conway's Game Of Life, in one line of APL
Reply | Threaded
Open this post in threaded view
|

Re: Testing LUA: verify the correctness of a UTF16 LUA port

Joshua Jensen
----- Original Message -----
From: David Given
Date: 10/14/2009 8:31 PM
> Joshua Jensen wrote:
> [...]
>> LuaPlus achieves this via a C-like string representation:
>>
>> HelloWorld = L"Hello world!"
> What does LuaPlus do for things like string comparison and surrogates?
It does the equivalent of wcscmp(), only it doesn't rely on the C
runtime to achieve this.  That's because on some non-Visual C++
compilers, sizeof(wchar_t) != 2.  sizeof(lua_WChar) is always 2.

My understanding is that UCS-2 doesn't support surrogates.  I don't
think Microsoft's C runtime wide character library supports them
either.  I could be wrong.

> Do any of them use UTF-8?
>
> I work in mobile games; our company makes a portable native gaming
> solution that allows you to install C-based games on any device,
> regardless of architecture. The API's based on OpenKODE, which uses
> UTF-8 in the few places where it uses strings. As I tend to do the
> bottom-end porting to weird and freaky embedded operating systems,
> I've got tiresomely familiar with having to translate UTF-8 to
> whatever encoding the host OS uses. There are a surprising number that
> use some form of half-assed UCS-2, and I've never figured out why ---
> it just makes life complex. I suspect that it's simple tradition. Most
> of them come from Asia, and Asia seems to have a culture of using
> UCS-2 or UTF-16...
I would consider ditching the LuaPlus wide character support if there
was a small library that supported UTF-8 and allowed easy embedding of
UTF-8 string types in Lua source files.

Have you looked at slnunicode?  That seems to be the smallest one I can
find, but documentation is scarce, so I don't know if it achieves all of
the goals.

Josh

Reply | Threaded
Open this post in threaded view
|

Re: Testing LUA: verify the correctness of a UTF16 LUA port

Jerome Vuarand
In reply to this post by uri cohen
2009/10/15 uri cohen <[hidden email]>:

> I was able to create a UTF16 port of LUA for windows, by following the
> guidelines I found at http://lua-users.org/lists/lua-l/2002-12/msg00021.html
> and the considerations from
> http://lua-users.org/lists/lua-l/2002-12/msg00025.html. This is a version of
> LUA where all internal strings are UTF16 (wide characters) rather than
> chars, not a version with two types of strings like LuaState.
>
> My question is on how can I verify my port works? Other than toy scripts I
> created, I'm looking for a comprehensive set of tests I can run in order to
> verify all important language feature were not broken...

I think you should make sure your "port" doesn't break libraries that
use strings for non-text data. For examples, you can try to use the io
library on binary files (use one of the (de)serialization libraries
available around), or LuaSocket on binary protocols.

By the way, what's the goal of such a modification of Lua ? If all you
need is being able to use Unicode filenames on windows, a simpler
approach is to assume filenames passed to Lua are in utf-8 format, and
convert it just when necessary when accessing the filesystem (in just
a very few places in Lua source code). I have a patch doing that if
anyone is interested.
Reply | Threaded
Open this post in threaded view
|

Re: Testing LUA: verify the correctness of a UTF16 LUA port

Andre Leiradella
I think ICU4Lua adds some support for utf-8 strings to Lua?

Jerome Vuarand wrote:

> 2009/10/15 uri cohen <[hidden email]>:
>> I was able to create a UTF16 port of LUA for windows, by following the
>> guidelines I found at http://lua-users.org/lists/lua-l/2002-12/msg00021.html
>> and the considerations from
>> http://lua-users.org/lists/lua-l/2002-12/msg00025.html. This is a version of
>> LUA where all internal strings are UTF16 (wide characters) rather than
>> chars, not a version with two types of strings like LuaState.
>>
>> My question is on how can I verify my port works? Other than toy scripts I
>> created, I'm looking for a comprehensive set of tests I can run in order to
>> verify all important language feature were not broken...
>
> I think you should make sure your "port" doesn't break libraries that
> use strings for non-text data. For examples, you can try to use the io
> library on binary files (use one of the (de)serialization libraries
> available around), or LuaSocket on binary protocols.
>
> By the way, what's the goal of such a modification of Lua ? If all you
> need is being able to use Unicode filenames on windows, a simpler
> approach is to assume filenames passed to Lua are in utf-8 format, and
> convert it just when necessary when accessing the filesystem (in just
> a very few places in Lua source code). I have a patch doing that if
> anyone is interested.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Testing LUA: verify the correctness of a UTF16 LUA port

David Given
In reply to this post by Joshua Jensen
Joshua Jensen wrote:
[...]
> It does the equivalent of wcscmp(), only it doesn't rely on the C
> runtime to achieve this.  That's because on some non-Visual C++
> compilers, sizeof(wchar_t) != 2.  sizeof(lua_WChar) is always 2.

Yes, in the Unix world it's always 4 (wchar_t is an int).

It's easier in the console world --- you've got complete control over
all the text on your system, so you can ensure you're not using any
weird stuff like RTL, surrogates, unsupported combining characters, etc.

[...]
> I would consider ditching the LuaPlus wide character support if there
> was a small library that supported UTF-8 and allowed easy embedding of
> UTF-8 string types in Lua source files.

Well, UTF-8 in Lua source files already Just Works. (They're treated by
Lua as Bags of Bytes.) As far as libraries go, I wrote some very simple
UTF-8 parsing code for WordGrinder:

http://wordgrinder.svn.sourceforge.net/viewvc/wordgrinder/wordgrinder/src/c/utils.c?view=markup

This will let you read and write raw code points from/to a string in a
relatively simple manner.

Thinking about this, a while back I did actually find that Unicode has
real rules for splitting up a UTF-8 string into 'characters', each of
which is an arbitrary-sized string representing a single drawable thing
(I forget the exact term --- grapheme clusters?). So theoretically it
ought to be possible to *truly* do random-access on a string. Maybe I
should revisit this at some point.

--
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────

│ ⍎'⎕',∊N⍴⊂S←'←⎕←(3=T)⋎M⋏2=T←⊃+/(V⌽"⊂M),(V⊝"M),(V,⌽V)⌽"(V,V←1⎺1)⊝"⊂M)'
│ --- Conway's Game Of Life, in one line of APL
Reply | Threaded
Open this post in threaded view
|

Re: Testing LUA: verify the correctness of a UTF16 LUA port

eugeny gladkih
In reply to this post by Joshua Jensen
Joshua Jensen wrote:

> ----- Original Message -----
> From: David Given
> Date: 10/14/2009 8:31 PM
>> Joshua Jensen wrote:
>> [...]
>>> LuaPlus achieves this via a C-like string representation:
>>>
>>> HelloWorld = L"Hello world!"
>> What does LuaPlus do for things like string comparison and surrogates?
> It does the equivalent of wcscmp(), only it doesn't rely on the C
> runtime to achieve this.  That's because on some non-Visual C++
> compilers, sizeof(wchar_t) != 2.  sizeof(lua_WChar) is always 2.
>
> My understanding is that UCS-2 doesn't support surrogates.  I don't
> think Microsoft's C runtime wide character library supports them
> either.  I could be wrong.

WinNT kernel doesn't know anything about UTF-16 it's pure UCS-2 coded
;) so, it's a myth that Windows is Unicode aware

Reply | Threaded
Open this post in threaded view
|

Re[2]: Testing LUA: verify the correctness of a UTF16 LUA port

Bulat Ziganshin
Hello john,

Thursday, October 15, 2009, 11:25:02 AM, you wrote:

> WinNT kernel doesn't know anything about UTF-16 it's pure UCS-2 coded
> ;) so, it's a myth that Windows is Unicode aware

afaik, it was fixed in Vista or so


--
Best regards,
 Bulat                            mailto:[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Testing LUA: verify the correctness of a UTF16 LUA port

Robert Smith-2-3
In reply to this post by uri cohen
David Given wrote:

>uri cohen wrote:
>> > My question is on how can I verify my port works? Other than toy scripts
>> > I created, I'm looking for a comprehensive set of tests I can run in
>> > order to verify all important language feature were not broken...
>
> Unicode is harder than it looks... one reason Lua doesn't really use it
> is that once you start dealing with Unicode you start finding places
> where you get conflicting requirements.
>
> For example, ? can be represented as both U+301 U+0065, or as U+00E9. Do
> these compare equal? They are technically the same thing.
 >...

The other question mark above is an é (it got ASCII'd :-). I agree with
you about the conflicting requirements, indeed more so because whether
Latin small letter e with acute is the same as the unadorned e plus
combining acute accent depends on context. In plain French text, I'd say
yes, but what about in a mathematical text with variables <p>e e&#769;
e&#779;</p>, i.e. e, e-acute, e-double-accute. In that case I'd say no,
as maintaining the separation of the base letter and the diacritic would
probably be important.
Reply | Threaded
Open this post in threaded view
|

Re: Testing LUA: verify the correctness of a UTF16 LUA port

David Dunham
In reply to this post by Joshua Jensen
On 14 Oct 2009, at 19:22, Joshua Jensen wrote:

> I apologize for hijacking the thread, but I really do feel it is  
> useful to have wide character string support for my work within the  
> games industry.  For applications, wide characters may be worthless.


Our games use UTF-8, but user-visible strings come from what are  
essentially external resources -- Lua code has to look them anyway up  
so they can be localized.

> My understanding is that UCS-2 doesn't support surrogates.  I don't  
> think Microsoft's C runtime wide character library supports them  
> either.  I could be wrong.

Again, this is why wide character support seems like the wrong idea --  
it is not 21st century Unicode.

David Dunham       Development Manager
+1 206 926 5722      GameHouse Studios

Reply | Threaded
Open this post in threaded view
|

Re: Testing LUA: verify the correctness of a UTF16 LUA port

Dave Dodge
In reply to this post by David Given
On Thu, Oct 15, 2009 at 05:16:40AM +0100, David Given wrote:
> Yes, in the Unix world it's always 4 (wchar_t is an int).

True in general practice, but technically the Unix standard does not
require it to be the same size as int.

  http://www.opengroup.org/onlinepubs/009695399/basedefs/stddef.h.html

For example I think there were Unicos implementations where everything
except char was 64 bits, but I don't know if Cray still does it that
way.

                                                  -Dave Dodge
Reply | Threaded
Open this post in threaded view
|

Re: Testing LUA: verify the correctness of a UTF16 LUA port

uri cohen
In reply to this post by Joshua Jensen
Thanks you guys for your feedback on porting LUA to UTF16.

but To my knowledge MS wide character support DOES support wide chars of more than 2 bytes (the so-called surrogates characters) indeed, my approach makes it hard to use Lua strings to represent 'arbitrary binary data'. I did it in order to embbed LUA in a native UTF16 application.

The LuaPlus approach is interesting, but having two string types in LUA makes it hard to maintain- it will break in runtime whenever passing a string to a code which does not expect the a string, while the UTF16 approach will identify those problems in compile time.
Nevertheless, I will give LuaPlus further thought.Thanks,
            Uri Cohen

On Thu, Oct 15, 2009 at 5:03 AM, Joshua Jensen <[hidden email]> wrote:
----- Original Message -----
From: David Given
Date: 10/14/2009 8:31 PM
Joshua Jensen wrote:
[...]
LuaPlus achieves this via a C-like string representation:

HelloWorld = L"Hello world!"
What does LuaPlus do for things like string comparison and surrogates?
It does the equivalent of wcscmp(), only it doesn't rely on the C runtime to achieve this.  That's because on some non-Visual C++ compilers, sizeof(wchar_t) != 2.  sizeof(lua_WChar) is always 2.

My understanding is that UCS-2 doesn't support surrogates.  I don't think Microsoft's C runtime wide character library supports them either.  I could be wrong.


Do any of them use UTF-8?

I work in mobile games; our company makes a portable native gaming solution that allows you to install C-based games on any device, regardless of architecture. The API's based on OpenKODE, which uses UTF-8 in the few places where it uses strings. As I tend to do the bottom-end porting to weird and freaky embedded operating systems, I've got tiresomely familiar with having to translate UTF-8 to whatever encoding the host OS uses. There are a surprising number that use some form of half-assed UCS-2, and I've never figured out why --- it just makes life complex. I suspect that it's simple tradition. Most of them come from Asia, and Asia seems to have a culture of using UCS-2 or UTF-16...
I would consider ditching the LuaPlus wide character support if there was a small library that supported UTF-8 and allowed easy embedding of UTF-8 string types in Lua source files.

Have you looked at slnunicode?  That seems to be the smallest one I can find, but documentation is scarce, so I don't know if it achieves all of the goals.

Josh




--

           Uri Cohen
Reply | Threaded
Open this post in threaded view
|

Re: Testing LUA: verify the correctness of a UTF16 LUA port

uri cohen
In reply to this post by Andre Leiradella
Following the feedback that using UTF8 is better than porting LUA to UTF16, is there a LUA port or library which allows the user to:
- define UTF8 literals;
- open files with UTF8 names;
- read and write Windows unicode text using only LUA strings
- bind with external C code which work in UTF16
Thanks you all!
            Uri Cohen

On Thu, Oct 15, 2009 at 5:48 AM, Andre de Leiradella <[hidden email]> wrote:
I think ICU4Lua adds some support for utf-8 strings to Lua?


Jerome Vuarand wrote:
2009/10/15 uri cohen <[hidden email]>:
I was able to create a UTF16 port of LUA for windows, by following the
guidelines I found at http://lua-users.org/lists/lua-l/2002-12/msg00021.html
and the considerations from
http://lua-users.org/lists/lua-l/2002-12/msg00025.html. This is a version of
LUA where all internal strings are UTF16 (wide characters) rather than
chars, not a version with two types of strings like LuaState.

My question is on how can I verify my port works? Other than toy scripts I
created, I'm looking for a comprehensive set of tests I can run in order to
verify all important language feature were not broken...

I think you should make sure your "port" doesn't break libraries that
use strings for non-text data. For examples, you can try to use the io
library on binary files (use one of the (de)serialization libraries
available around), or LuaSocket on binary protocols.

By the way, what's the goal of such a modification of Lua ? If all you
need is being able to use Unicode filenames on windows, a simpler
approach is to assume filenames passed to Lua are in utf-8 format, and
convert it just when necessary when accessing the filesystem (in just
a very few places in Lua source code). I have a patch doing that if
anyone is interested.





--

           Uri Cohen
Reply | Threaded
Open this post in threaded view
|

Re: Testing LUA: verify the correctness of a UTF16 LUA port

Jerome Vuarand
2009/10/16 uri cohen <[hidden email]>:
> Following the feedback that using UTF8 is better than porting LUA to UTF16,
> is there a LUA port or library which allows the user to:
> - define UTF8 literals;
> - open files with UTF8 names;
> - read and write Windows unicode text using only LUA strings
> - bind with external C code which work in UTF16

I wrote such a patch, it is attached to this mail. It adds the
following to the Lua API, declared in standard lua headers but
implemented in lwstring.c:

#define LUA_HAS_WSTRING 1

LUA_API const wchar_t *(lua_towstring) (lua_State *L, int index);
LUA_API const wchar_t *(lua_tolwstring) (lua_State *L, int index,
size_t *length);
LUA_API void (lua_pushwstring) (lua_State *L, const wchar_t *value);
LUA_API void (lua_pushlwstring) (lua_State *L, const wchar_t *value,
size_t size);

LUALIB_API const wchar_t *(luaL_optwstring) (lua_State *L, int index,
const wchar_t *def);
LUALIB_API const wchar_t *(luaL_checkwstring) (lua_State *L, int index);
LUALIB_API const wchar_t *(luaL_checklwstring) (lua_State *L, int
index, size_t *length);
LUALIB_API int (luaL_loadwfile) (lua_State *L, const wchar_t *filename);

#define luaL_dowfile(L, fn) \
        (luaL_loadwfile(L, fn) || lua_pcall(L, 0, LUA_MULTRET, 0))

A "wstring" is a userdata which contains a NULL-terminated string of
wchar_t. lua_towstring converts a string on the stack to a wide char
string. It does replace the string on the stack, similarly to
lua_tostring and lua_tonumber when they are passed a number or a
string respectively, so the same restrictions apply (don't use on keys
in a lua_next loop).

On windows these functions use MultiByteToWideChar and
WideCharToMultiByte to convert between utf-8 and utf-16. On other
platforms the functions use mbstowcs and wcstombs, which convert
between the current multi-byte locale and wide strings, usually ucs-4.

On windows only, the patch modifies all calls to the C library that
uses filenames, and convert these filenames using wstring conversion
functions.

It's only been tested on windows so far (both the
MultiByteToWideChar/WideCharToMultiByte and mbstowcs/wcstombs versions
compile there). It's not perfect (I think it may leak if a
lua_pushstring errors), and I didn't patch the Makefile since I'm not
using it. Any feedback is welcome.

lua-wstring.patch (15K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Testing LUA: verify the correctness of a UTF16 LUA port

uri cohen
Thanks for your patch, Jerome.
I've used it and found it is a minimal and elegant support for UTF16 unicode strings. Although it provide the requirements I wrote in my post, I found it does not work for me because it does not support ability to run LUA code from UTF16 files or lines. So while you support unicode strings you can't really define unicode literals in an easy way.
Thanks!
           Uri Cohen

On Mon, Oct 19, 2009 at 11:25 AM, Jerome Vuarand <[hidden email]> wrote:
2009/10/16 uri cohen <[hidden email]>:
> Following the feedback that using UTF8 is better than porting LUA to UTF16,
> is there a LUA port or library which allows the user to:
> - define UTF8 literals;
> - open files with UTF8 names;
> - read and write Windows unicode text using only LUA strings
> - bind with external C code which work in UTF16

I wrote such a patch, it is attached to this mail. It adds the
following to the Lua API, declared in standard lua headers but
implemented in lwstring.c:

#define LUA_HAS_WSTRING 1

LUA_API const wchar_t *(lua_towstring) (lua_State *L, int index);
LUA_API const wchar_t *(lua_tolwstring) (lua_State *L, int index,
size_t *length);
LUA_API void (lua_pushwstring) (lua_State *L, const wchar_t *value);
LUA_API void (lua_pushlwstring) (lua_State *L, const wchar_t *value,
size_t size);

LUALIB_API const wchar_t *(luaL_optwstring) (lua_State *L, int index,
const wchar_t *def);
LUALIB_API const wchar_t *(luaL_checkwstring) (lua_State *L, int index);
LUALIB_API const wchar_t *(luaL_checklwstring) (lua_State *L, int
index, size_t *length);
LUALIB_API int  (luaL_loadwfile) (lua_State *L, const wchar_t *filename);

#define luaL_dowfile(L, fn)     \
       (luaL_loadwfile(L, fn) || lua_pcall(L, 0, LUA_MULTRET, 0))

A "wstring" is a userdata which contains a NULL-terminated string of
wchar_t. lua_towstring converts a string on the stack to a wide char
string. It does replace the string on the stack, similarly to
lua_tostring and lua_tonumber when they are passed a number or a
string respectively, so the same restrictions apply (don't use on keys
in a lua_next loop).

On windows these functions use MultiByteToWideChar and
WideCharToMultiByte to convert between utf-8 and utf-16. On other
platforms the functions use mbstowcs and wcstombs, which convert
between the current multi-byte locale and wide strings, usually ucs-4.

On windows only, the patch modifies all calls to the C library that
uses filenames, and convert these filenames using wstring conversion
functions.

It's only been tested on windows so far (both the
MultiByteToWideChar/WideCharToMultiByte and mbstowcs/wcstombs versions
compile there). It's not perfect (I think it may leak if a
lua_pushstring errors), and I didn't patch the Makefile since I'm not
using it. Any feedback is welcome.

Reply | Threaded
Open this post in threaded view
|

Re: Testing LUA: verify the correctness of a UTF16 LUA port

Jerome Vuarand
2009/10/27 uri cohen <[hidden email]>:
> Thanks for your patch, Jerome.
> I've used it and found it is a minimal and elegant support for UTF16 unicode
> strings. Although it provide the requirements I wrote in my post, I found it
> does not work for me because it does not support ability to run LUA code
> from UTF16 files or lines. So while you support unicode strings you can't
> really define unicode literals in an easy way.

If you encode your scripts as utf-8, you can have any unicode
character in literal strings. If you really need your scripts to be
utf-16 rather than utf-8, it's very easy to write a script loader that
does the conversion on the fly.