lua for unicode

classic Classic list List threaded Threaded
38 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

Roberto Ierusalimschy
> The null character ('\0' in C) is represented in Unicode as a single, zero 
> byte. Will Lua handle this case correctly, since I think it permits strings 
> with '\0' in them?

Lua only uses those functions (from string.h) to manipulate some
restricted sets of strings (such as reserved words or identifiers),
which for sure do not contain zeros, or strings that can be truncated
to their first zero (such as when formatting error messages). (In one
single place, it also uses `strlen' to find any eventual '\0' inside the
string.)

-- Roberto

Reply | Threaded
Open this post in threaded view
|

RE: lua for unicode

jame
In reply to this post by lua+Steven.Murdoc
> The null character ('\0' in C) is represented in Unicode as a 
> single, zero 
> byte. 

I believe it's a null word, not byte. Since in 16 bit Unicode,
the characters are each 16 bits, including the termination character.
Plus, in Unicode, you have characters which have 0x00 as one of the 
bytes in the word: for example, ASCII's mappings into Unicode:  
0x002E (Unicode) -> 0x2E (ASCII) == '.'

> It does string-order comparison:  "hi" <= "hello". Yes, this one
breaks
> an external Unicode system. Suggestions?

What would be really cool is if it didn't ever depend on 8 bit
character widths via the use a string length on everything. It
would seem you guys are pretty close to that now. Anyplace you 
use string functions would have to be replaced with the mem functions
that take a length. In luaV_strcomp, (strcoll) Ugh! hard to replace.
I'm familiar with normalization on Unicode - strcoll is essentially 
the same thing for 8 bit - apply normalization and then use strcmp to 
compare the strings. To use this you really have to know the width of
the 
character, so you can call wide / multibyte / ascii versions. Which is 
not very elegant. 

One solution would be to 

a) make the source as independent on character width as possible, so
what you end up with a just a few places where a call like "strcoll"
is used.

b) allow the user to define the function(s) used in these places, 
so for example I can set via a config file "luastrcoll" to "wstrcoll",
"mbstrcoll", "strcoll", "utf16strcoll", or some other routine. 

Then it's up to me to make sure I'm passing compatible strings into
the library. I could use any string format I wanted, as long as I 
provide a version of "luastrcoll" that worked with my string encoding
format.

As for the aux libs, I'd say leave it to the users who work with unicode
or utf16, or whatever to port them for you. :) 

Regards,
Jim


Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

lua+Steven.Murdoc
> > The null character ('\0' in C) is represented in Unicode as a 
> > single, zero 
> > byte. 
> 
> I believe it's a null word, not byte.

In the UTF-8 encoding it is a null byte/octet. All the ASCII characters 
(0-127) are represented as one byte, hence maintaining backwards compatibility 
with ASCII.

The null byte occurs in no other situation, even with the multi-byte 
characters.

Steven Murdoch.



Reply | Threaded
Open this post in threaded view
|

RE: lua for unicode

jame
Sorry, didn't realize you were discussing UTF8 encoding.
my bad.

Regards,
Jim


> -----Original Message-----
> From: [hidden email] 
> [[hidden email]] On Behalf Of 
> [hidden email]
> Sent: Monday, December 02, 2002 11:57 AM
> To: Multiple recipients of list
> Subject: Re: lua for unicode 
> 
> 
> > > The null character ('\0' in C) is represented in Unicode as a 
> > > single, zero 
> > > byte. 
> > 
> > I believe it's a null word, not byte.
> 
> In the UTF-8 encoding it is a null byte/octet. All the ASCII 
> characters 
> (0-127) are represented as one byte, hence maintaining 
> backwards compatibility 
> with ASCII.
> 
> The null byte occurs in no other situation, even with the multi-byte 
> characters.
> 
> Steven Murdoch.
> 
> 
> 
> 


Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

Björn De Meyer
In reply to this post by lua+Steven.Murdoc
> Initially Unicode was limited to 2^16 positions (65,536), but this was found
> to be inadequate. The first 2^16 characters of Unicode are known as the Basic
> Multilingual Plane (BMP) and is intended be enough to represent all living
> languages, however as other messages have suggested it does not contain
> historical characters. This space is not yet full so there may be further
> characters added in the future.

Here I go nitpicking again, but it also lacks characters used 
in names,  contemporary uncommonly used characters, and 
variants of the original. I find it amazing that this space it not yet
full, as it should already have been filled with those 80000 characters
I mentioned before alone. I still get the feeling that Microsoft 
wants to keep using it's obsolete 16-bit encoding (wich is AFAIK not 
UTF-16), and therefore is holding back many characters. Then again, 
I might be a raving madman. ^_^

Anyway, as far as lua is concerned, I am convinced that UTF-8 is the way 
to go. That way, backwards compatibility and internationalisation 
of strings can go hand in hand. 


-- 
"No one knows true heroes, for they speak not of their greatness." -- 
Daniel Remar.
Björn De Meyer 
[hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: lua for unicode

jame
FYI - The Unicode Standard is not controlled by Micorosft:
http://www.unicode.org/unicode/consortium/memblogo.html

On UTF8, I'd say making Lua independent of any encoding standard 
is a better approach if possible. Leave it to the user to decide 
the standard.  Any place a string comparison is done, pipe that 
out to the application hosting Lua so the treatment comes out 
right based on the text encoding.

Regards,
Jim




> -----Original Message-----
> From: [hidden email] 
> [[hidden email]] On Behalf Of Björn De Meyer
> Sent: Monday, December 02, 2002 1:14 PM
> To: Multiple recipients of list
> Subject: Re: lua for unicode
> 
> 
> > Initially Unicode was limited to 2^16 positions (65,536), 
> but this was found
> > to be inadequate. The first 2^16 characters of Unicode are 
> known as the Basic
> > Multilingual Plane (BMP) and is intended be enough to 
> represent all living
> > languages, however as other messages have suggested it does 
> not contain
> > historical characters. This space is not yet full so there 
> may be further
> > characters added in the future.
> 
> Here I go nitpicking again, but it also lacks characters used 
> in names,  contemporary uncommonly used characters, and 
> variants of the original. I find it amazing that this space it not yet
> full, as it should already have been filled with those 80000 
> characters
> I mentioned before alone. I still get the feeling that Microsoft 
> wants to keep using it's obsolete 16-bit encoding (wich is AFAIK not 
> UTF-16), and therefore is holding back many characters. Then again, 
> I might be a raving madman. ^_^
> 
> Anyway, as far as lua is concerned, I am convinced that UTF-8 
> is the way 
> to go. That way, backwards compatibility and internationalisation 
> of strings can go hand in hand. 
> 
> 
> -- 
> "No one knows true heroes, for they speak not of their greatness." -- 
> Daniel Remar.
> Björn De Meyer 
> [hidden email]
> 
> 
> 


Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

Scott Morgan-2
In reply to this post by lua+Steven.Murdoc
[hidden email] wrote:
Initially Unicode was limited to 2^16 positions (65,536), but this was found to be inadequate. The first 2^16 characters of Unicode are known as the Basic Multilingual Plane (BMP) and is intended be enough to represent all living languages, however as other messages have suggested it does not contain historical characters. This space is not yet full so there may be further characters added in the future.
>

What issues would be involved getting lua to natively use the MS Windows 16-bit unicode effort? I know its a little (well quite a lot :) ) against the spirit of lua, but it seems like a good quick fix just to go through the lua source and replace all the str* function calls with win32 wstr* calls. Of course all scripts would have to be saved into the same 16-bit text files in this situation.

Just to make clear (and avoid flames) I wouldn't consider this a proper fix but just a quick way to get win32 unicode support into lua.

Scott


Reply | Threaded
Open this post in threaded view
|

RE: lua for unicode

jame
Title: Message

It would take a bit more work than just replacing the ASCII
calls with wide character calls. There are some places where
ASCII strings are iterated across.

If you search for strcoll, strncpy, strcpy, strcmp, etc, you'll
find most of these. Some string constants might need the addition
of the L"" macro as well.

Another issue is the parser - it's not designed to handle
Unicode files, so your source files would have to be in ASCII.
This could present some issues on systems that don't support
ASCII files, or operating systems which save multi-lingual files
in Unicode. It also doesn't seem to support UTF8 files. I tried
a simple test (this won't show up right for most unless you have an
HTML mail reader with a Japanese font installed):

-- 建築
print( "建築" )
建築 = {}
print( 建築 )

Which is essentially the same as:

-- test
print( "11" )
aa = {  }
print( aa )
But with Japanese text stored as UTF8 in the file. The Lua parser
dies on the assignment statement.
 
Also with UTF8, it makes it tough on the developer.  Say for
example, a Japanese user to enter Japanese characters in their
print statements, or function names, etc.. When these show up
or are referenced by name, they both have to be in the same encoding,
or things won't work right.
All in all, not an easy problem to solve.

Regards,
Jim


> -----Original Message-----
> From: [hidden email]
> [
[hidden email]] On Behalf Of Scott Morgan
> Sent: Monday, December 02, 2002 3:43 PM
> To: Multiple recipients of list
> Subject: Re: lua for unicode
>
>
> [hidden email] wrote:
> > Initially Unicode was limited to 2^16 positions (65,536),
> but this was found
> > to be inadequate. The first 2^16 characters of Unicode are
> known as the Basic
> > Multilingual Plane (BMP) and is intended be enough to
> represent all living
> > languages, however as other messages have suggested it does
> not contain
> > historical characters. This space is not yet full so there
> may be further
> > characters added in the future.
>  >
>
> What issues would be involved getting lua to natively use the
> MS Windows
> 16-bit unicode effort? I know its a little (well quite a lot :) )
> against the spirit of lua, but it seems like a good quick fix
> just to go
> through the lua source and replace all the str* function calls with
> win32 wstr* calls. Of course all scripts would have to be
> saved into the
> same 16-bit text files in this situation.
>
> Just to make clear (and avoid flames) I wouldn't consider
> this a proper
> fix but just a quick way to get win32 unicode support into lua.
>
> Scott
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

Roberto Ierusalimschy
In reply to this post by Scott Morgan-2
> What issues would be involved getting lua to natively use the MS Windows 
> 16-bit unicode effort?

Mainly you have to change all calls (from str* to wstr*, but also from
fopen to _wfopen etc etc), redefine `char' to wchar_t, add "L" to all
literal strings in the code, add "L" to all literal characters in the
code, and then redefine some other stuff (such as EOF). We tried to
automate that in Lua, using lots of macros (I think some of 4.1 work
versions have it). It did work (I sussessfully compiled and ran Lua in
Windows using that stuff), but it was lots of pain to keep (mainly the
macros to put the "L"s), so we removed those macros.

Currently we try to facilitate the task, without explicit support. Among
other things, we avoid using "..." or '.' inside comments, we try to use
"char" only for characters (otherwise we use lu_byte), and we try to
keep the code independent of the fact that sizeof(char)==1. (I think
Luiz has some code to add "L" to your literals.)

Attached follows a list of what is (or was, for Lua 4.1w) involved in
the translation.

-- Roberto

#ifndef wwwc_h
#define wwc_h


#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>

#define L_CHAR	wchar_t
#define l_charint	wint_t

#define uchar(c)	(c)


#define l_c(x)	L##x
#define l_saux(x)	L##x
#define l_s(x)		l_saux(x)

#undef EOF
#define EOF		WEOF

#define	strcspn		wcscspn
#define fgetc		fgetwc
#define fgets		fgetws
#define fprintf		fwprintf
#define fputs		fputws
#define fscanf		fwscanf
#define printf		wprintf
#define sprintf		swprintf
#define strchr		wcschr
#define strcmp		wcscmp
#define strcoll		wcscoll
#define strcpy		wcscpy
#define strftime	wcsftime
#define strlen		wcslen
#define strncpy		wcsncpy
#define strpbrk		wcspbrk
#define strtod		wcstod
#define strtol		wcstol
#define strtoul		wcstoul
#define vsprintf	vswprintf


#define fopen		_wfopen
#define strerror(x)	L"system error"
#define system		_wsystem
#define remove		_wremove
#define rename		_wrename
#define tmpnam		_wtmpnam
#define getenv		_wgetenv
#define setlocale	_wsetlocale
#define perror		_wperror

#undef isalnum
#define	isalnum		iswalnum
#undef isalpha
#define	isalpha		iswalpha
#undef iscntrl
#define	iscntrl		iswcntrl
#undef isdigit
#define	isdigit		iswdigit
#undef islower
#define	islower		iswlower
#undef ispunct
#define	ispunct		iswpunct
#undef isspace
#define	isspace		iswspace
#undef isupper
#define	isupper		iswupper
#undef isxdigit
#define	isxdigit	iswxdigit

#define main	wmain

#endif
Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

Roberto Ierusalimschy
In reply to this post by jame
>  Another issue is the parser - it's not designed to handle
>  Unicode files, so your source files would have to be in ASCII.
>  [...]
>  But with Japanese text stored as UTF8 in the file. The Lua parser
>  dies on the assignment statement.

"Dies"? Or do you mean syntax error?

Lua will not recognize "$B7zC[(J" as a valid identifier, but the `print(
"$B7zC[(J" )' should work. The main point is to handle Unicode data, not
Unicode identifiers.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

lua+Steven.Murdoc
In reply to this post by Björn De Meyer
> > Initially Unicode was limited to 2^16 positions (65,536), but this was found
> > to be inadequate. The first 2^16 characters of Unicode are known as the Basic
> > Multilingual Plane (BMP) and is intended be enough to represent all living
> > languages, however as other messages have suggested it does not contain
> > historical characters. This space is not yet full so there may be further
> > characters added in the future.
> 
> I find it amazing that this space it not yet
> full, as it should already have been filled with those 80000 characters
> I mentioned before alone.

The unassigned codespace of the BMP is a very scarce resource and as I 
understand the proposals for its use far exceed its capacity. Given that 
Unicode guarantees not to delete any characters once they are added, any 
mistakes made could have very bad consequences and be impossible to rectify. 
Also standards organizations move very, very slowly (sometimes this is a good 
thing, other times it is not).

I don't think there is any conspiracy at work here, while you may think that 
those characters are very important, there are other organizations which 
believe others are more important and everyones views have to be considered.

The area outside of the BMP is quite sparsely populated so there will be less 
work trying to get characters added to this area, however characters here 
require more space to store (4 octets in UTF-16/UTF-8 as opposed to 1-3 octets 
in UTF-8 and 2 octets in UTF-16) so there is a desire to move the "more 
important" characters into the BMP.

> I still get the feeling that Microsoft 
> wants to keep using it's obsolete 16-bit encoding (wich is AFAIK not 
> UTF-16), and therefore is holding back many characters.

Microsoft has very little influence in this matter (it is a big group), 
moreover there would be no advantage to them of preventing useful characters 
being added to the BMP, since this is the only subset they support.

Steven Murdoch.


Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

lua+Steven.Murdoc
In reply to this post by ???
I think there are several issues being discussed here simultaneously so it 
might be helpful to clarify what is being proposed.

Firstly there is the internal format Lua should use to encode Unicode strings. 
In most programming languages this could be any format, however since Lua 
exposes it's internals there would be advantages if it was a standard format. 
There are plenty of standards in active use, each with there own tradeoffs and 
these should all be considered. Since Lua allows embedded nulls in strings it 
can support UTF-16/UTF-32 if needed as well as UTF-8. Changes required to 
support Unicode would not be huge, mainly ensuring that string length is 
properly calculated and that characters are properly iterated over (now that 
there is no longer a direct octet-character correlation). An important 
consideration to be made is whether all strings are Unicode or whether a new 
Unicode type is to be added (as is done in Python).

Then there is the input and output of Unicode characters. This would require 
changes to the I/O libraries to support the various encodings and convert them 
to/from the Lua internal format. An important consideration is that in every 
encoding not every byte pattern in a file is a valid Unicode character. It is 
strongly recommend to consider any such byte pattern as an error and not try 
to work around it. It is essential that such byte patterns do not exist in the 
internal encoding since this opens several security issues. For this reason it 
is sensible to use an existing parser which is well tested against such 
issues. This would also allow all the commonly used Unicode encodings to be 
supported.

Also there is the representation of Unicode character literals in Lua 
programs. Most, if not all languages have done this by escape codes system 
(for example \uxxxx and \Uxxxxxxxx in Python[1]) rather than having UTF-8 or 
other Unicode input files. This technique has the advantage of keeping the 
parser small and allows any text editor to be used to edit Lua programs. Again 
here checks must be made to ensure that no invalid Unicode characters can be 
stored.

Finally there is the issue of whether to allow Unicode identifiers. This would 
require many changes to the parser and would require that Lua programs were 
edited in a Unicode aware text editor. I would consider the disadvantages of 
doing this to far outweigh any advantages, and most, if not all other 
programming langauges, do not permit multibyte characters in programs, either 
as literals or identifiers. I do not know the current encoding used for Lua 
programs, is it ASCII, Latin-1 or is it defined by the platform?

[1] http://www.python.org/doc/current/ref/strings.html

Hope this helps,
Steven Murdoch. 


Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

Roberto Ierusalimschy
> An important consideration to be made is whether all strings are Unicode
> or whether a new Unicode type is to be added (as is done in Python).

I think we can live outside these two options. Strings may contain
Unicode data or not (e.g. they may contain raw binary data, as now).
If you call a function from the new "utf8" library, it will assume
the string is a Unicode-utf8 string.


> It is essential that such byte patterns [non-valid Unicode character]
> do not exist in the internal encoding since this opens several
> security issues.

I think it would be easier to allow such patterns (among other things
because strings may contain other stuff besides Unicode data), and to
check for consistency when needed (that is, inside the functions of the
"utf8" library).

This is more or less what happens now. Strings may contain embedded
zeros, but some functions in the `string' library do not operate on
them, because proper "ISO" strings cannot contain zeros. The important
thing is to ensure that all functions have an "acceptable" behavior
(such as a polite error message) for any input.

-- Roberto


Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

Björn De Meyer
In reply to this post by Roberto Ierusalimschy
Roberto Ierusalimschy wrote:
> 
> "Dies"? Or do you mean syntax error?
> 
> Lua will not recognize "$B7zC[(J" as a valid identifier, but the `print(
> "$B7zC[(J" )' should work. The main point is to handle Unicode data, not
> Unicode identifiers.
> 
> -- Roberto

Furthermore, the string $B7zC[(J is NOT a valid utf-8
string. It looks more like SHIFT-JIS encoding to me.
And I agree, it's the strings that matter.

-- 
"No one knows true heroes, for they speak not of their greatness." -- 
Daniel Remar.
Björn De Meyer 
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

Björn De Meyer
In reply to this post by lua+Steven.Murdoc
[hidden email] wrote:
> I do not know the current encoding used for Lua
> programs, is it ASCII, Latin-1 or is it defined by the platform?

AFAICS, it's C locale dependent. The Lua lexer allows all 
characters in identifiers that the standard C isalpha() 
returns 1 for.

-- 
"No one knows true heroes, for they speak not of their greatness." -- 
Daniel Remar.
Björn De Meyer 
[hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: lua for unicode

jame
In reply to this post by Björn De Meyer
I was just testing a lua source file which was encoded in UTF8.
The cut and paste into my post to the list didn't show up right.

This relates to the last item in [hidden email]
nice summary post. 

I'm impressed you guys did a wide character version.. maybe you
should just dump that out into the community and let the people
who use it keep it up to date? I'd like to play with that if 
I could... I've been working on some apps for a PocketPC device 
and see some uses for Lua. But I'd like it to use a version that
supports Unicode strings since PocketPC apps really have to be 
localized.

Regards,
Jim





> -----Original Message-----
> From: [hidden email] 
> [[hidden email]] On Behalf Of Björn De Meyer
> Sent: Tuesday, December 03, 2002 12:11 PM
> To: Multiple recipients of list
> Subject: Re: lua for unicode
> 
> 
> Roberto Ierusalimschy wrote:
> > 
> > "Dies"? Or do you mean syntax error?
> > 
> > Lua will not recognize "$B7zC[(J" as a valid identifier, 
> but the `print(
> > "$B7zC[(J" )' should work. The main point is to handle 
> Unicode data, not
> > Unicode identifiers.
> > 
> > -- Roberto
> 
> Furthermore, the string $B7zC[(J is NOT a valid utf-8
> string. It looks more like SHIFT-JIS encoding to me.
> And I agree, it's the strings that matter.
> 
> -- 
> "No one knows true heroes, for they speak not of their greatness." -- 
> Daniel Remar.
> Björn De Meyer 
> [hidden email]
> 
> 
> 


Reply | Threaded
Open this post in threaded view
|

RE: lua for unicode

Joshua Jensen
In reply to this post by Scott Morgan-2
> What issues would be involved getting lua to natively use the 
> MS Windows 
> 16-bit unicode effort? I know its a little (well quite a lot :) ) 
> against the spirit of lua, but it seems like a good quick fix 
> just to go 
> through the lua source and replace all the str* function calls with 
> win32 wstr* calls. Of course all scripts would have to be 
> saved into the 
> same 16-bit text files in this situation.
> 
> Just to make clear (and avoid flames) I wouldn't consider 
> this a proper 
> fix but just a quick way to get win32 unicode support into lua.

If you need functionality very similar to what you describe now, hop on
over to http://workspacewhiz.com/ and click on the Misc. Code sidebar
link.  You'll see two Lua distributions... LuaState 4.1 Alpha was used
to ship the Xbox title Amped: Freestyle Snowboarding.  I built a wide
character "Unicode" string type into it.  LuaPlus 5.0 Alpha is based on
the Lua 5.0 alpha codebase and has the same Unicode type.  Note that the
string type is a true Lua string type.  That is, I can write:

myString = L"This is a Unicode string"
myString = myString .. ", and it can use regular string operations."
print(myString:len())

myAnsiString = L"This is an ANSI string"
print(myAnsiString:len())

-Josh


Reply | Threaded
Open this post in threaded view
|

Re: lua for unicode

lua+Steven.Murdoc
In reply to this post by Roberto Ierusalimschy
> > An important consideration to be made is whether all strings are Unicode
> > or whether a new Unicode type is to be added (as is done in Python).
> 
> I think we can live outside these two options. Strings may contain
> Unicode data or not (e.g. they may contain raw binary data, as now).
> If you call a function from the new "utf8" library, it will assume
> the string is a Unicode-utf8 string.

This approach sounds reasonable. If the utf8 library is the only Unicode 
string manipulation library then this will effectively be using UTF-8 for the 
internal encoding of Unicode strings. This has the advantage of bringing some 
backward compatibility characteristics, but probably decreases efficiency. 
Most langauges I know of use UTF-16 for encoding Unicode strings but this 
choice depends on a number of options so it not necessarily valid for Lua.

> > It is essential that such byte patterns [non-valid Unicode character]
> > do not exist in the internal encoding since this opens several
> > security issues.
> 
> I think it would be easier to allow such patterns (among other things
> because strings may contain other stuff besides Unicode data), and to
> check for consistency when needed (that is, inside the functions of the
> "utf8" library).

Yes, that is equally good. The essential feature is not to allow invalid bit 
patterns be interpreted as valid UTF-8/UTF-16/etc data. This is normally done 
by ensuring that any Unicode strings created are guaranteed to be valid, but 
this would not permit binary data to be stored in this datatype. Checking 
consistency on read may bring a small performance penalty but this probably 
will not be significant.

However it would be desirable to check consistency of Unicode data as it is 
read from files, since then errors would be caught immediately rather than 
later during processing. A Unicode I/O library would be necessary anyway since 
data may have to be read in, or outputted in a format other than UTF-8. 
Consistency of UTF-8 strings must also be checked before they are written out 
to Unicode files.

Steven Murdoch.


12