Will Lua kernel use Unicode in the future?

classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Klaus Ripke
On Fri, Dec 30, 2005 at 10:22:16AM -0500, Dave Dodge wrote:
> On Fri, Dec 30, 2005 at 11:36:13AM -0200, Roberto Ierusalimschy wrote:
> > > I think the more important addition would be an easy Lua way to set the 
> > > locale to use the UTF8 encoding.
> > 
> > os.setlocale("UTF-8") ?
> 
> Locale names other that "C" and "" are implementation-defined.  For
> example on Solaris 8 I believe the recommended locale name for this
> would be "en_US.UTF-8".
for sorting you would probably want to use something like
os.setlocale("UTF-8", "ctype")
os.setlocale("de_PHONEBOOK", "collate")

while in theory the collations are well defined
by unicode.org, in practice you face two problems:
a) the definitions are updated every now and then
b) the correctness and completeness of implementations varies
In other words it's about as portable as XML.

So for persistent data like a database index which relies upon
being sorted, using libc's strcoll means looking for trouble.
You should better link against a given version of ICU,
which is fairly complete and portable,
and make sure to ship that exact version with your app.


cheers

Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Mike Pall-61
In reply to this post by Chris Marrin
Hi,

Chris Marrin wrote:
> I see that you can say "en-us.utf-8", but does it 
> REQUIRE a language code? And is this cross-platform?

Yes, you need a language code. But it's ignored except for things
like the monetary symbol or collation order. IMHO it's best to
only set "ctype" (LC_CTYPE environment variable) to avoid some
other NLS pitfalls (e.g. the dot vs. comma problem with numbers).

The Unicode FAQ for Unix/Linux explains this and many more things
that have been discussed in this thread:

  http://www.cl.cam.ac.uk/~mgk25/unicode.html

And here are some opinions on the UTF-8 vs. UTF-16/UTF-32/wchar_t
debate:

  http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
  http://www.tbray.org/ongoing/When/200x/2003/04/30/JavaStrings

My personal opinion: There is no point in using anything else than
UTF-8 (in memory and on disk). All other variants create more
problems than they solve. 'Characters' is a concept of the past.

Most apps can just treat strings as opaque byte streams. Lua is
very well suited for this. It can be augmented with Klaus' library
for the more demanding things: http://luaforge.net/projects/sln/

Only very few apps (e.g. word processors, GUI libraries) need to
take care of combining characters, glyphs, writing directions and
so on. And not surprisingly none of these apps rely on the
awful libc NLS support. So let's better forget about wchar_t.

Bye,
     Mike

Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Chris Marrin
Mike Pall wrote:
Hi,

Chris Marrin wrote:

I see that you can say "en-us.utf-8", but does it REQUIRE a language code? And is this cross-platform?


Yes, you need a language code. But it's ignored except for things
like the monetary symbol or collation order. IMHO it's best to
only set "ctype" (LC_CTYPE environment variable) to avoid some
other NLS pitfalls (e.g. the dot vs. comma problem with numbers).

The Unicode FAQ for Unix/Linux explains this and many more things
that have been discussed in this thread:

  http://www.cl.cam.ac.uk/~mgk25/unicode.html

Yes, my recollection is that this is a total morass. So it would be nice if Lua could "simplify" it by doing the right thing in its code. Perhaps os.setlocale("UTF8") could do some special processing to setup all the right values to get the same expected (or at least well-defined) results on all platforms.

Or maybe a new os.setencoding("UTF8") call could be created?

--
chris marrin              ,""$, "As a general rule,don't solve puzzles
[hidden email]        b`    $  that open portals to Hell" ,,.
        ,.`           ,b`    ,`                            , 1$'
     ,|`             mP    ,`                              :$$'     ,mm
   ,b"              b"   ,`            ,mm      m$$    ,m         ,`P$$
  m$`             ,b`  .` ,mm        ,'|$P   ,|"1$`  ,b$P       ,`  :$1
 b$`             ,$: :,`` |$$      ,`   $$` ,|` ,$$,,`"$$     .`    :$|
b$|            _m$`,:`    :$1   ,`     ,$Pm|`    `    :$$,..;"'     |$:
P$b,      _;b$$b$1"       |$$ ,`      ,$$"             ``'          $$
 ```"```'"    `"`         `""`        ""`                          ,P`

Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Andras Balogh-4
These kinds of problems should be solved at a different level, not hacked into Lua. The beautiful thing about Lua is that it's really clean ANSI C code and is not bloated with #ifedef-ed platform dependent workaround code.


Andras

Chris Marrin wrote:
Yes, my recollection is that this is a total morass. So it would be nice if Lua could "simplify" it by doing the right thing in its code. Perhaps os.setlocale("UTF8") could do some special processing to setup all the right values to get the same expected (or at least well-defined) results on all platforms.

Or maybe a new os.setencoding("UTF8") call could be created?



Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Jens Alfke
In reply to this post by Chris Marrin
(I've replied to a bunch of messages here rather than sending out six separate replies...)

On 29 Dec '05, at 9:17 AM, Chris Marrin wrote:

It allows you to add "incidental" characters without the need for a fully functional editor for that language. For instance, when I worked for Sony we had the need to add a few characters of Kanji on occasion. It's not easy to get a Kanji editor setup for a western keyboard, so adding direct unicode was more convenient. There are also some oddball symbols in the upper registers for math and chemistry and such that are easier to add using escapes.

Also, in some projects there are guidelines that discourage the use of non-ascii characters in source files (due to problems with editors, source control systems, or other tools. In these situations it's convenient to be able to use inline escapes to specify non-ascii characters that commonly occur in human-readable text ... examples would include ellipses, curly-quotes, emdashes, bullets, currency symbols, as well as accented letters of course.


[hidden email] wrote:
IMO, with globalization, languages that don't support Unicode won't make the cut in the long run.

I find it ironic that the three non-Unicode-savvy languages I use (PHP, Ruby, Lua) all come from countries whose native languages use non-ascii characters :)

I'm keeping an eye on this thread because, if I end up using Lua for any actual work projects, I18N is mandatory and I can't afford to run into any walls when it comes time to make things work in Japanese or Thai or Arabic. (Not like last time...See below for the problems I've had with JavaScript regexps.)


[hidden email] wrote:
I've done an awful lot of work with international character sets, and I personally consider any encoding of Unicode other than UTF-8 to be obsolete. Looking at most other modern (i.e. not held back by backwards compatibility, e.g. Windows) users of Unicode, their authors appear to feel similarly.

Yes, at least as an external representation. Internally, it can be convenient to use a wide representation for speed of indexing into strings, but a good string library should hide that from the client as an implementation detail. (Disclaimer: I'm mostly knowledgeable only about Java's and Mac OS X's string libraries.)


[hidden email] wrote:
Most apps can just treat strings as opaque byte streams.

I agree, mostly; the one area I've run into problems has been with regexp/pattern libraries. Any pattern that relies on "alphanumeric characters" or "word boundaries" assumes a fair bit of Unicode knowledge behind the scenes. I ran into this problem when implementing a live search field in Safari RSS, since while JS strings are Unicode-savvy, many implementations' regexp implementations aren't, so character classes like "\w" or "\b" only work on ascii alphanumerics.


[hidden email] wrote:
These kinds of problems should be solved at a different level, not hacked into Lua. The beautiful thing about Lua is that it's really clean ANSI C code

...except for the parts that aren't, like library loading, and the extension libraries for sockets, databases, etc. I agree that having a portable core runtime is important, but there should be some kind of standard extension for Unicode strings, hopefully one that cleanly extends the built-in string objects using something like ICU.

--Jens

Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Mike Pall-61
Hi,

Jens Alfke wrote:
> I agree, mostly; the one area I've run into problems has been with  
> regexp/pattern libraries. Any pattern that relies on "alphanumeric  
> characters" or "word boundaries" assumes a fair bit of Unicode  
> knowledge behind the scenes.
> [...]
> but there should be some kind  
> of standard extension for Unicode strings, hopefully one that cleanly  
> extends the built-in string objects using something like ICU.

Well, did you even take a look at Klaus' library?

$ lua -l unicode -e 'print(unicode.utf8.find("xäy", ".(%w)."))'
1       4       ä

You can see that I typed the umlaut in a UTF-8 locale because the
second number is 4 (and not 3 like in iso-8859-1 or -15).

BTW to Klaus: 5.1-beta renamed MAX_CAPTURES to LUA_MAXCAPTURES.

Bye,
     Mike

Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Klaus Ripke
In reply to this post by Jens Alfke
hi

so far everybody seems to be ok with UTF-8 as the single encoding
which especially does not require any changes to the core.

Please let me stress the importance of distinguishing between
several string related functions:
(not top qoute, but lenghty intro)
a)	detecting code points (i.e. decoding the bytes to abstract
	unicode numbers) is straight forward and locale independent
b)	character class in unicode is completely independent of the locale
c)	grapheme boundaries are independent of the locale
	and with a few exceptions are easily detected
d)	casing is mostly independent of the locale
	with the main exception being the turkish I/i issue.
	(which is a design flaw in unicode, adopted for practical reasons)
e)	normalization is a little bit less straight forward but,
	IIRC, locale independent
f)	the single big bugger is collation.

No, that's not true, things only start to get interesting
with issues related to the graphical representation like
BiDi printing, indic super/subscripts and more.
But that's not covered by libc anyway. See the sila project.

a)-d) are already implemented in the selene unicode package
(currently being upgraded to 5.1 string functions) for the
standard cases and the exceptions can be added if necessary.
Adding e) is tedious but basically also straight forward
(AFAIK libc does not have normalization functions).

On Fri, Dec 30, 2005 at 10:32:05AM -0800, Jens Alfke wrote:
> (I've replied to a bunch of messages here rather than sending out six  
> separate replies...)
> 
> On 29 Dec '05, at 9:17 AM, Chris Marrin wrote:
> 
> >It allows you to add "incidental" characters without the need for a  
> >fully functional editor for that language. For instance, when I  
> >worked for Sony we had the need to add a few characters of Kanji on  
> >occasion. It's not easy to get a Kanji editor setup for a western  
> >keyboard, so adding direct unicode was more convenient. There are  
> >also some oddball symbols in the upper registers for math and  
> >chemistry and such that are easier to add using escapes.
> 
> Also, in some projects there are guidelines that discourage the use  
> of non-ascii characters in source files (due to problems with  
> editors, source control systems, or other tools. In these situations  
> it's convenient to be able to use inline escapes to specify non-ascii  
> characters that commonly occur in human-readable text ... examples  
> would include ellipses, curly-quotes, emdashes, bullets, currency  
> symbols, as well as accented letters of course.
ok, several means have been devised for that,
including decimal byte escape sequences and a custom U function
(which you can beef up to even turn ... into ellipses)

> [hidden email] wrote:
> >IMO, with globalization, languages that don't support Unicode won't  
> >make the cut in the long run.
> 
> I find it ironic that the three non-Unicode-savvy languages I use  
> (PHP, Ruby, Lua) all come from countries whose native languages use  
> non-ascii characters :)
> 
> I'm keeping an eye on this thread because, if I end up using Lua for  
> any actual work projects, I18N is mandatory and I can't afford to run  
> into any walls when it comes time to make things work in Japanese or  
> Thai or Arabic. (Not like last time...See below for the problems I've  
> had with JavaScript regexps.)
even in PHP it's not a matter of the language, only of the libs.

> [hidden email] wrote:
> >I've done an awful lot of work with international character sets,  
> >and I personally consider any encoding of Unicode other than UTF-8  
> >to be obsolete. Looking at most other modern (i.e. not held back by  
> >backwards compatibility, e.g. Windows) users of Unicode, their  
> >authors appear to feel similarly.
> 
> Yes, at least as an external representation. Internally, it can be  
> convenient to use a wide representation for speed of indexing into  
> strings, but a good string library should hide that from the client  
> as an implementation detail. (Disclaimer: I'm mostly knowledgeable  
> only about Java's and Mac OS X's string libraries.)
IMHO this is a misconception.
The 16 bit wide char is faster only if you really confine it
to UCS-2, i.e. only enconding the base multilingual plane
of the first 64K characters. Beyond that there is not only
etruscian, but also yet more chinese symbols.
To do the real thing, you'd have to use UTF-16, checking
for surrogate pairs. Most implementations don't do that
and thus are nothing but broken hacks.

> [hidden email] wrote:
> >Most apps can just treat strings as opaque byte streams.
> 
> I agree, mostly; the one area I've run into problems has been with  
> regexp/pattern libraries. Any pattern that relies on "alphanumeric  
> characters" or "word boundaries" assumes a fair bit of Unicode  
> knowledge behind the scenes. I ran into this problem when  
> implementing a live search field in Safari RSS, since while JS  
> strings are Unicode-savvy, many implementations' regexp  
> implementations aren't, so character classes like "\w" or "\b" only  
> work on ascii alphanumerics.
see b) above; we have proper character classes on UTF-8.


Remains collation as big issue.
Again I'd strongly suggest to not rely on libc/setlocale for anything
but the most feeble uses.
In "de_DE" we have two collations and I'd bet for each of them
there exists some libc which will pick it as default for
setlocale("de_DE.UTF-8").
One pro of using libc is that you will be sorting bug-compatible
to your local "sort" utility, but just drop the plan to exchange
"sorted" data remotely.
"There is no collation but the one you are using."

If you really need the full Unicode Collation Algorithm,
use ICU and especially ICU's equivalent of strxfrm.
Else you might want to consider resorting to simpler and faster
means like single level sorting (ignoring case and accents)
a/o means supporting other features like decoding the
transformed representation to the original or an equivalent.
A sample implementation is given in our malete database
and will be released as a lua utility with selene.


cheers

Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Chris Marrin
In reply to this post by Jens Alfke
Jens Alfke wrote:
(I've replied to a bunch of messages here rather than sending out six separate replies...)

On 29 Dec '05, at 9:17 AM, Chris Marrin wrote:

It allows you to add "incidental" characters without the need for a fully functional editor for that language. For instance, when I worked for Sony we had the need to add a few characters of Kanji on occasion. It's not easy to get a Kanji editor setup for a western keyboard, so adding direct unicode was more convenient. There are also some oddball symbols in the upper registers for math and chemistry and such that are easier to add using escapes.


Also, in some projects there are guidelines that discourage the use of non-ascii characters in source files (due to problems with editors, source control systems, or other tools. In these situations it's convenient to be able to use inline escapes to specify non-ascii characters that commonly occur in human-readable text ... examples would include ellipses, curly-quotes, emdashes, bullets, currency symbols, as well as accented letters of course.

But we're moving into an era where support of English text (the only language in existance that fits into ASCII) is not good enough. I used to work at Sony, where this issue is magnified about a million times compared to "western" languages. The notion of "optional" support for non-ascii characters was never acceptable.



[hidden email] wrote:

IMO, with globalization, languages that don't support Unicode won't make the cut in the long run.


I find it ironic that the three non-Unicode-savvy languages I use (PHP, Ruby, Lua) all come from countries whose native languages use non-ascii characters :)

But I think Lua mostly does a great job of supporting Unicode with it's agnostic approach, because it allows UTF8 sequences to pass through unchallenged... mostly. The few rough edges being discussed here are really minor. Escaping unicode would simply make it more practical to handle the full unicode range. Making it easier to tell the underlying clib to use UTF8 for collating avoids putting platform specific code outside Lua. And allowing non-ascii character identifiers takes this restriction away from non-English speakers. Small changes just to tighten up the i18n support.

--
chris marrin              ,""$, "As a general rule,don't solve puzzles
[hidden email]        b`    $  that open portals to Hell" ,,.
        ,.`           ,b`    ,`                            , 1$'
     ,|`             mP    ,`                              :$$'     ,mm
   ,b"              b"   ,`            ,mm      m$$    ,m         ,`P$$
  m$`             ,b`  .` ,mm        ,'|$P   ,|"1$`  ,b$P       ,`  :$1
 b$`             ,$: :,`` |$$      ,`   $$` ,|` ,$$,,`"$$     .`    :$|
b$|            _m$`,:`    :$1   ,`     ,$Pm|`    `    :$$,..;"'     |$:
P$b,      _;b$$b$1"       |$$ ,`      ,$$"             ``'          $$
 ```"```'"    `"`         `""`        ""`                          ,P`

Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Philippe Lhoste
In reply to this post by Chris Marrin
Chris Marrin wrote:
you could enter this:

     "This is a summation character: \226\136\145"

and a console that understood UTF8 would show a summation characer. But to do this I had to:

- Use to the Character Map utility to find the character
- Copy that character into OpenOffice, which understands Unicode
- Save the file as UTF8
- Open the file as binary in MsDev
- Hand convert the character sequence to UTF8
- Use Calculator to convert the result into decimal

It would be nice if Lua could do this for me :-)

Part of this process may be simplified using the right tools...
Perhaps something like BabelMap <http://www.babelstone.co.uk/Software/BabelMap.html> may help here. It is supposed to be able to save a Unicode sequence in UTF-8, although I am not sure if this works correctly (I get \U0001D6AA sequence, for example). But if you look at the properties of the highlighted character, it gives the UTF-8 sequence in hexa. A simple Lua program can convert it to decimal...

If you do this frequently, an automation tool like AutoIt or AutoHotkey can help too.

Note: the above software recommendations work only for Windows, but I suppose there are equivalent tools for other OSs.

--
Philippe Lhoste
--  (near) Paris -- France
--  http://Phi.Lho.free.fr
--  --  --  --  --  --  --  --  --  --  --  --  --  --

12