lua for unicode

classic Classic list List threaded Threaded
38 messages Options
12
???
Reply | Threaded
Open this post in threaded view
|

lua for unicode

???
is there any plan to support unicode in lua?

Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

Björn De Meyer
??? wrote:
> 
> is there any plan to support unicode in lua?

In principle, you can already use UTF-8 in Lua strings as 
Lua is 8-bit clean. The standard string library of Lua
however does not support UTF-8 yet. I was interested in 
doing UTF-8 support for lua strings but got a bit 
distraced. Maybe you and I can work on it together?

-- 
"No one knows true heroes, for they speak not of their greatness." -- 
Daniel Remar.
Björn De Meyer 
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

mnicolet
In reply to this post by ???
It should be never ...
Unicode is a bad dream
----- Original Message ----- 
From: "???" <[hidden email]>
To: "Multiple recipients of list" <[hidden email]>
Sent: Friday, November 29, 2002 2:31 AM
Subject: lua for unicode


> is there any plan to support unicode in lua?


Reply | Threaded
Open this post in threaded view
|

RE: lua for unicode

jame
Hmm?

Unicode is great standard. It supports all of the world's languages and
then some.
Thankfully, strings in Lua can be Unicode, UTF16, ASCII, or 8 bit code
page mapped. 
Unfortunately I don't believe the lib src's support wide character
calls. I'm sure 
ports of these will surface in time, or preferably, they will be
converted into 
libs which automatically detect the string width of the platform they
run on.

Regards,
Jim



> -----Original Message-----
> From: [hidden email] 
> [[hidden email]] On Behalf Of mnicolet
> Sent: Friday, November 29, 2002 8:02 PM
> To: Multiple recipients of list
> Subject: Re: lua for unicode
> 
> 
> It should be never ...
>
> Unicode is a bad dream
>
> ----- Original Message ----- 
> From: "???" <[hidden email]>
> To: "Multiple recipients of list" <[hidden email]>
> Sent: Friday, November 29, 2002 2:31 AM
> Subject: lua for unicode
> 
> 
> > is there any plan to support unicode in lua?
> 
> 
> 


Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

Steve Dekorte-4

On Friday, November 29, 2002, at 06:46 PM, [hidden email] wrote:
Unicode is great standard. It supports all of the world's languages and
then some.

Jim, remember when we talked with the CTO from etranslate.com in the Mission? Didn't he say that they found that unicode did not support all languages.

Cheers,
Steve
OSX freeware and shareware: http://www.dekorte.com/downloads.html


Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

Björn De Meyer
Steve Dekorte wrote:
> 
> On Friday, November 29, 2002, at 06:46 PM, [hidden email] wrote:
> > Unicode is great standard. It supports all of the world's languages and
> > then some.
> 
> Jim, remember when we talked with the CTO from etranslate.com in the
> Mission? Didn't he say that they found that unicode did not support all
> languages.
> 
> Cheers,
> Steve
> OSX freeware and shareware: http://www.dekorte.com/downloads.html

Exactly. Unicode doesn't even support the full range of (Chinese) 
Kanji characters used in Japanese. Unicode was in the beginning an 
evil microsoft invention, in which they tries to implement all languages 
in just 65000 characters. Problem being that Japanese alone knows 
80000 Kanji characters. Even now, the so called "CJK unification", 
that is in essence, the dumbing down of both Japanese, Chinese and 
Japanese ideographs is causing much discomfort in Asia... Unfortunately,
in the free software world, the illusion lives that Unicode is the 
derfinitive solution to the internationalisation problem. 

In reality, unicode does help, but it is not the end-all solution.
I've seen alternatives to Unicode, like TRON character encoding,
but those are not that much better, because they are stateful. 
We'd need some encoding that allows each language or at least 
each script to define their own character sets, whilst maintaining 
a stateless encoding. 


-- 
"No one knows true heroes, for they speak not of their greatness." -- 
Daniel Remar.
Björn De Meyer 
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

Peter Loveday-2
> Exactly. Unicode doesn't even support the full range of (Chinese)
> Kanji characters used in Japanese. Unicode was in the beginning an
> evil microsoft invention, in which they tries to implement all languages
> in just 65000 characters.

It may be true that Unicode doesn't currenly support all characters, however
it certainly supports a lot more than 65000.  You may be thinking of the
UCS-2 variant, as used by Windows, which is actually even more limited than
this; but Unicode supports a lot more than that.

According to unicode.org (the official site for unicode) The current unicode
spec (3.2) defines codes for 95,221 characters.

Unicode exists in many formats; UCS-2 was the old way some systems/programs
supported it, but this does not allow for all unicode codes to be used.  The
modern formats are usually UTF-8, UTF-16, or UTF-32, which use combinations
of characters and surrogate pairs to represent the full code range.

It may well be that not all characters have been assigned codes in unicode,
but it still has a huge number of available codes; I believe the current
spec allows for codes in the range U+0 through U+10FFFF, or 1,114,112
possible characters.

Using UTF-8 in Lua is attractive because it is already 8 bit clean.
Certainly the string lib needs to be updated to deal with a character being
between 1 and 4 bytes, as the UTF-8 spec defines, and there are some
collation issues that are very complex.  IBM have an excellent freely usable
unicode library available at http://oss.software.ibm.com/icu/ that deals
with all these issues, but this is much to large to utilise in a standard
Lua distribution.  Still adding basic abilities to handle UTF-8 without
proper collation would not be too hard, perhaps with an add-on module that
supports full unicode handling.


Love, Light and Peace,
- Peter Loveday
Director of Development, eyeon Software


Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

John Belmonte-2
In reply to this post by Björn De Meyer
Björn De Meyer wrote:
Exactly. Unicode doesn't even support the full range of (Chinese) Kanji characters used in Japanese. Unicode was in the beginning an evil microsoft invention, in which they tries to implement all languages in just 65000 characters. Problem being that Japanese alone knows 80000 Kanji characters. Even now, the so called "CJK unification", that is in essence, the dumbing down of both Japanese, Chinese and Japanese ideographs is causing much discomfort in Asia... Unfortunately, in the free software world, the illusion lives that Unicode is the derfinitive solution to the internationalisation problem.

This seems like misinformation. From what I understand, unicode has a 31 bit space, and only about 21 bits of that is required to cover all characters in use today. As for using unicode for Japanese, certainly it is possible and works well, as I've personally deployed unicode with UTF-8 encoding at Japanese websites.

Perhaps your experience is with some naive *encoding* of unicode that tries to stuff 21 bits into 16? ;-)

I use the Unix Unicode FAQ for reference (http://www.cl.cam.ac.uk/~mgk25/unicode.html).


-John




--
http:// if     .   /


Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

Björn De Meyer
John Belmonte wrote:
> 
> This seems like misinformation.  From what I understand, unicode has a
> 31 bit space, and only about 21 bits of that is required to cover all
> characters in use today.  As for using unicode for Japanese, certainly
> it is possible and works well, as I've personally deployed unicode with
> UTF-8 encoding at Japanese websites.
> 
> Perhaps your experience is with some naive *encoding* of unicode that
> tries to stuff 21 bits into 16?  ;-)
> 

Yes, I do see that point. However, Unicode started out as 
being just that 16 bit wide encoding that lmicrosoft still uses
these days. So, historically speaking, unicode laden with the 
stigma of being too restrictive.

The other more important problem wich I mention here is CJK 
unification. If I am not mistaken, even in Unicode of these days, 
many Chinese, Japanese and Korean ideographs have not been 
included on grounds of being  historical forms, being 
"too similar" with other ideographs, on grounds of being 
uncommon in usage, or on grounds of being writable by
other characters.

Think about it. That's like saying that the q is not needed
in the roman alphabet because it's so similar to an o (just
one extra line), not used very commonly, and besides we could 
replace all q's by, for instance "kw". The kwestion is of 
course whether we would agree to such an intrusion into our 
westenr languages. And people named Quinten might object 
to having to write their name as Kwinten.

I have heard from some Japanese people that they find Unicode 
culturally unacceptable exactly for these reasons. Maybe something
has changred at the Unicode consortium, but I can't forget it's 
still Microsoft's brainchild, so I am weary of it.

95000 characters are now in unicode, but to my estimate, 
there are probably a million different characters 
in all human languages of the past and the present. Lacking
are historical forms, rare scripts, reginal variants, etc.
Maybe I have been misinformed. Maybe someone will be able to
reassure me with regards to my doubts.

More importantly, I am still interested in doing a 
small uft8-lib for Lua.


-- 
"No one knows true heroes, for they speak not of their greatness." -- 
Daniel Remar.
Björn De Meyer 
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

John Belmonte-2
Björn De Meyer wrote:
The other more important problem wich I mention here is CJK unification. If I am not mistaken, even in Unicode of these days, many Chinese, Japanese and Korean ideographs have not been included on grounds of being historical forms, being "too similar" with other ideographs, on grounds of being uncommon in usage, or on grounds of being writable by
other characters.

For Japanese, I think in the 50's there was a movement to reduce the number "essential characters", likely with the goal of improving the literacy rate. A set of 1,850 characters was adopted by law and publications now limit themselves to that set except for proper names, as you say.

If there is a group of Japanese not happy with the unicode situation, aren't they free to press for additions to the code set, and to lobby their government to produce a complete and freely licensed font to preserve their heritage?

-John



--
http:// if     .   /


Reply | Threaded
Open this post in threaded view
|

RE: lua for unicode

jame
In reply to this post by Peter Loveday-2
> 
> Using UTF-8 in Lua is attractive because it is already 8 bit clean.
> Certainly the string lib needs to be updated to deal with a 
> character being
> between 1 and 4 bytes, as the UTF-8 spec defines, and there are some
> collation issues that are very complex.  (snip)
> 
> Love, Light and Peace,
> - Peter Loveday
> Director of Development, eyeon Software


I've been working on this in a library I've written. I've 
been trying to come up with a good way to support 8-bit, 
Unicode or UTF8, UTF16, or even UTF32. The solution I'm
trying is the same as Java's, where internally all strings 
are stored in UTF16. With this, it's easy to hook up the 
right systems calls on a per platform basis, which is needed
since wide character versions of calls like fopen are not
standardized.

The tough part is figuring out what the user enters, since
on most platforms a library has to deal with a mix of encoding 
types. For example out of all the various Windows platforms, I 
can have 8 bit codepage mapped, ASCII, or Unicode for input 
character data. I still haven't solved this problem. There's
no solid way to detect the incoming string type on the fly, 
so it looks like the library will have to handle it one of
two ways:

1) Force the user to adhere to UTF16 on all input for 
the library.
2) Try to detect the string width on init and assume the user
is doing the same.
3) Hard code string width via a compile option.

I'm probably going to use #1. In most cases, at least on
windows, it's fairly easy to convert from char* to wchar_t*
and use macros like L"string" when defining string constants. 
I can't support non Unicode systems though, like Win98. 
(oh well :) It's probably time we all started moving toward
16 bit code points anyway. And since Unicode maps into 
UTF16, and ASCII maps directly into UTF16, it all works pretty
well.

On the string and io Lua libs - these have the same problem 
since they make calls to things like fopen, fgets, strlen, etc.. 
and accept char* pointers. They will require a little work to 
get them to run on systems which are Unicode. Overall though, 
probably not a big deal since people can port them to their 
particular platform fairly easily. 

As for the core Lua lib, it's mostly string width independent
I believe, although a quick search across the source produces a 
number of calls to things like strlen in routines like 
lua_pushstring, lua_dostring, luaO_chunkid, luaS_new, and 
luaV_strcomp. (Maybe these should be memcmp's instead??) 
Hopefully if you stay away from some of the ASCII api calls like
lua_pushstring, lua_dostring everything works on a system that 
use Unicode or UTF16? 

Regards,
Jim


Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

Enrico Colombini
In reply to this post by Björn De Meyer
>Think about it. That's like saying that the q is not needed
>in the roman alphabet because it's so similar to an o (just
>one extra line), not used very commonly, and besides we could 
>replace all q's by, for instance "kw".

That should improve readability:
http://dept.physics.upenn.edu/~heiney/jokes/twain.html

  Enrico

Reply | Threaded
Open this post in threaded view
|

RE: lua for unicode

jame
In reply to this post by jame
> 
> As for the core Lua lib, it's mostly string width independent
> I believe, although a quick search across the source produces a 
> number of calls to things like strlen in routines like 
> lua_pushstring, lua_dostring, luaO_chunkid, luaS_new, and 
> luaV_strcomp. (Maybe these should be memcmp's instead??) 
> Hopefully if you stay away from some of the ASCII api calls like
> lua_pushstring, lua_dostring everything works on a system that 
> use Unicode or UTF16? 

Actually, memcmp isn't going to fix this. :) What is this call
luaV_strcomp used for? Looks like it is called by luaV_lessthan, 
which is called from a number of places. Are these routines acting 
on user defined string data? If so, this would break on a Unicode
system.

I also see the use of strcpy in a number of places. I was 
under the impression Lua was string width independent but maybe
I was wrong.

Regards,
Jim



Reply | Threaded
Open this post in threaded view
|

unicode in lua

Tiago Tresoldi
One of the first thing I did in Lua was writing a small and not very well coded utf-8 library for an almost-dead machine 
translation progrm (that was written in Python and maybe some day will become a C/Lua combination).
I have thinking about releasing it or not because I am not only sure that is far from perfection but I also know that I could 
myself do something better... Anyway, I finally decided. You can download it at http://traduki.sf.net/unicode.lua and ue for 
anything you want, but be advised that it is far from perfection.
Private comments on this (but please don't blame too much :) ) are welcome.

Tiago


Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

Björn De Meyer
In reply to this post by John Belmonte-2
John Belmonte wrote:
> 
> For Japanese, I think in the 50's there was a movement to reduce the
> number "essential characters", likely with the goal of improving the
> literacy rate.  A set of 1,850 characters was adopted by law and
> publications now limit themselves to that set except for proper names,
> as you say.

Yes, it's true that there are 2000 Jouyou Kanji  ,
which means "everyday use chinese characters". 
If you know these characters, you are not illiterate.
However that doesn't mean you have a high degree of
literacy! Those 2000 characters are a legally enforced 
minimum knowledge. Official publications and newspapers 
will limit themselves  to these everyday use Kanji. 
However, Japanese literature does not limit itself 
in this way.  

 
> If there is a group of Japanese not happy with the unicode situation,
> aren't they free to press for additions to the code set, and to lobby
> their government to produce a complete and freely licensed font to
> preserve their heritage?

The problem is that due to chauvinism, and the way Microsoft has handled
things, those groups are weary of unicode. They see it as a western 
attempt to butcher their language, and deeply mistrust the unicode 
consortium. Like I said before, they even developed their own 
Japanese- centric OS, TRON, that can display the whole range of 
80000 Kanji and then some. Some of the Japanese free font uyou
mention are already available. Problem being that with Unicode, 
you'll only be able to use part of them in a standardized way.
You'll need to use the nonstandard "extension area" of unicode to use 
them fully which is an unpractical situation.

-- 
"No one knows true heroes, for they speak not of their greatness." -- 
Daniel Remar.
Björn De Meyer 
[hidden email]

???
Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

???
oh, sorry i'm confused in understanding microsoft's wide character and utf-8 format, 
i just thought it looks impossible in lua to read 2-byte wide character formatted file. 
and i have misunderstanding that all unicode is 2-byte array,

i found it's possible to save lua file to UTF-8 format, because all lua keywords and tokens are ascii characters,
and just read string from lua table without any other string operation was ok. 
i used MultiByteToWideChar/WideCharToMultiByte win32 functions to read string from lua table, and worked fine. 
( am i doing right with this functions? first i converted UTF-8 multibyte to widechar, and then converted widechar version string to ANSI and i know this solution can't display all different language characters at same time )

but i'm still confusing, UTF-8 format is not wide character version, and it's unicode, 
and microsoft's wide character isn't unicode? 
UTF-8 is multi-byte and microsoft's wide character is 2-byte array..

i want to work together to support UTF-8 in lua, but i'm still newbie in this field yet. 
and checking if char is greater than 127 in all string loops looks not good at performance..
(wide character version looks more easy in string operation)

and finally,  i'm korean, currently doing to support korean and japanese in my game, and sorry for poor english :)



Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

Roberto Ierusalimschy
In reply to this post by jame
> As for the core Lua lib, it's mostly string width independent
> I believe, although a quick search across the source produces a 
> number of calls to things like strlen in routines like 
> lua_pushstring, lua_dostring, luaO_chunkid, luaS_new, and 
> luaV_strcomp.

lua_pushstring is auxiliar (as you pointed out); the same for
lua_dostring (the "official" luaL_loadbuffer gets a size). luaS_new
and luaO_chunkid may need some rework.


> What is this call luaV_strcomp used for?

It does string-order comparison:  "hi" <= "hello". Yes, this one breaks
an external Unicode system. Suggestions?


> I also see the use of strcpy in a number of places. I was 
> under the impression Lua was string width independent but maybe
> I was wrong.

We try to make it string width independent, but it is difficult to
ensure that. We will try to improve that (maybe after 5.0 beta).

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

lua+Steven.Murdoc
In reply to this post by ???
> but i'm still confusing, UTF-8 format is not wide character version, and it's unicode, 
> and microsoft's wide character isn't unicode? 
> UTF-8 is multi-byte and microsoft's wide character is 2-byte array..

Unicode is a character set which maps integer numbers to characters. In theory 
there is not limit to the size of Unicode since it does not define any 
representation. However it is extremely unlikely any characters will be mapped 
numbers greater than 2^31 (2,147,483,648), and there are proposals to limit 
this to 2^21 (2,097,152).

Initially Unicode was limited to 2^16 positions (65,536), but this was found 
to be inadequate. The first 2^16 characters of Unicode are known as the Basic 
Multilingual Plane (BMP) and is intended be enough to represent all living 
languages, however as other messages have suggested it does not contain 
historical characters. This space is not yet full so there may be further 
characters added in the future.

Unicode doesn't specify a way for the integers representing characters to be 
encoded so there are a number of options. Windows was designed when Unicode 
characters were only 16bit long so are encoded as two bytes, therefore can 
only represent the BMP. This is what Microsoft's wide character representation 
is (sometimes called UCS-2).

UTF-8 is a variable length encoding which can represent the whole Unicode 
codespace (1-3 bytes for the BMP, 1-4 bytes for 21 bit Unicode, 1-6 bytes for 
31 bit Unicode). It has several features that are good for backwards 
compatibility. For details see: http://www.cl.cam.ac.uk/~mgk25/unicode.html

UTF-16 is another encoding which can represent the whole Unicode codespace as 
2 or 4 bytes per character.

There are many other encodings but most applications use these two.

For more information read:
http://www.cl.cam.ac.uk/~mgk25/unicode.html
and
http://www.unicode.org/

Hope his helps,
Steven Murdoch.






Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

Roberto Ierusalimschy
In reply to this post by Roberto Ierusalimschy
> As for the core Lua lib, it's mostly string width independent
> I believe, although a quick search across the source produces a 
> number of calls to things like strlen in routines like 
> lua_pushstring, lua_dostring, luaO_chunkid, luaS_new, and 
> luaV_strcomp.

Just a remark: for UTF-8, the way Lua uses strlen, strcat, strcpy, etc.
is OK, as UTF-8 strings cannot contain zeros. The only dependency in the
core is `strcoll' (used in luaV_strcmp to order strings).

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

lua+Steven.Murdoc
> Just a remark: for UTF-8, the way Lua uses strlen, strcat, strcpy, etc.
> is OK, as UTF-8 strings cannot contain zeros.

The null character ('\0' in C) is represented in Unicode as a single, zero 
byte. Will Lua handle this case correctly, since I think it permits strings 
with '\0' in them? However as you state, UTF-8 does not permit a zero byte in 
any other circumstance.

Steven Murdoch.


12