Will Lua kernel use Unicode in the future?

classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Will Lua kernel use Unicode in the future?

CHU Run-min
1)Like Windows NT kernel.
2)Lua source files can use various encoding.
3)Before parsing, the source file is converted to Unicode.

Now Lua string is something like binary data, is there any better
solution at this problem, the string encoding problem?


Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Klaus Ripke
hi

On Thu, Dec 29, 2005 at 06:26:26PM +0800, CHU Run-min wrote:
> 1)Like Windows NT kernel.
> 2)Lua source files can use various encoding.
> 3)Before parsing, the source file is converted to Unicode.
A general recoding facility is something way too fat
to be integrated at core level.

Preferred way to handle a collection of scripts
based on different encodings is to convert
them to Unicode (UTF-8) while installing
(using external tools like iconv or recode).

> Now Lua string is something like binary data, is there any better
> solution at this problem, the string encoding problem?
The Lua "kernel" is agnostic to encodings, so there is no problem.
For string operations on UTF-8 see
http://lua-users.org/wiki/LuaUnicode

Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Lisa Parratt

On 29 Dec 2005, at 11:06, Klaus Ripke wrote:
http://lua-users.org/wiki/LuaUnicode

A few observations, reading this page:

Lack of "\U+1234" style unicode character escapes - it strikes me that the code to isolate such an escape, and then convert it to an 8 bit string would only take a few lines of code. Is there a good theological reason why this isn't supported?

Inability to use UTF-8 identifiers due to use of isalpha and isalnum - surely it would be better to use hardcoded functions for determining if the characters in an identifier are valid? Otherwise there will be potential locale issues anyway. Locales should apply to human languages, not computer languages!

Unicode string comparison and normalisation issues - I might be being forgetful, but I was under the impression C99 added Unicode compliant wide character comparison functions - perhaps these should be used if present?

--
Lisa


Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Alex Queiroz
Hallo,

On 29/12/05, Lisa Parratt <[hidden email]> wrote:
>
> On 29 Dec 2005, at 11:06, Klaus Ripke wrote:
> > http://lua-users.org/wiki/LuaUnicode
>
> A few observations, reading this page:
>
> Lack of "\U+1234" style unicode character escapes - it strikes me
> that the code to isolate such an escape, and then convert it to an 8
> bit string would only take a few lines of code. Is there a good
> theological reason why this isn't supported?
>

     Maybe because it'd give the wrong impression that Lua supports Unicode.

--
-alex
http://www.ventonegro.org/


Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Klaus Ripke
In reply to this post by Lisa Parratt
On Thu, Dec 29, 2005 at 11:38:57AM +0000, Lisa Parratt wrote:
> 
> On 29 Dec 2005, at 11:06, Klaus Ripke wrote:
> >http://lua-users.org/wiki/LuaUnicode
> 
> A few observations, reading this page:
> 
> Lack of "\U+1234" style unicode character escapes - it strikes me  
> that the code to isolate such an escape, and then convert it to an 8  
> bit string would only take a few lines of code. Is there a good  
> theological reason why this isn't supported?
It would require the parser to settle for a given encoding
like UTF-8 or UCS2 or UTF16 or ...
OTOH a preprocessing step either at build time or
as a load hook could do this and much more.
Personally I prefer to have my editor produce UTF-8.

> Inability to use UTF-8 identifiers due to use of isalpha and isalnum  
> - surely it would be better to use  hardcoded functions for  
> determining if the characters in an identifier are valid? Otherwise  
> there will be potential locale issues anyway. Locales should apply to  
> human languages, not computer languages!
d'accord (although this is not unicode related)

> Unicode string comparison and normalisation issues - I might be being  
> forgetful, but I was under the impression C99 added Unicode compliant  
> wide character comparison functions - perhaps these should be used if  
> present?
You might not want to use wide chars at all
(there are pros and cons compared to using UTF-8 internally).
For UTF-8 good old strcoll/strxfrm (hence Lua) does the job,
with appropriate locale settings.
Anyway many consider the "locale" mechanism broken,
and a full implementation of the unicode collation algorithm
has to be quite expensive.


regards

Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Chris Marrin
Klaus Ripke wrote:
On Thu, Dec 29, 2005 at 11:38:57AM +0000, Lisa Parratt wrote:

On 29 Dec 2005, at 11:06, Klaus Ripke wrote:

http://lua-users.org/wiki/LuaUnicode

A few observations, reading this page:

Lack of "\U+1234" style unicode character escapes - it strikes me that the code to isolate such an escape, and then convert it to an 8 bit string would only take a few lines of code. Is there a good theological reason why this isn't supported?

It would require the parser to settle for a given encoding
like UTF-8 or UCS2 or UTF16 or ...
OTOH a preprocessing step either at build time or
as a load hook could do this and much more.
Personally I prefer to have my editor produce UTF-8.

But it would not be hard for Lua to support this form, in addition to the other currently supported escape sequences. Translating "U+1234" into a UTF8 sequence is a very small snippet of code. This does not technically lock Lua into the UTF8 encoding. It merely translates this escape sequence to a series of characters that happen to be UTF8. Splitting hairs???

...
Unicode string comparison and normalisation issues - I might be being forgetful, but I was under the impression C99 added Unicode compliant wide character comparison functions - perhaps these should be used if present?

You might not want to use wide chars at all
(there are pros and cons compared to using UTF-8 internally).
For UTF-8 good old strcoll/strxfrm (hence Lua) does the job,
with appropriate locale settings.
Anyway many consider the "locale" mechanism broken,
and a full implementation of the unicode collation algorithm
has to be quite expensive.

But this is an area of the spec that needs to be tightened up. Currently it says "strings are compared in the usual way". What does that mean? As it turns out, Lua uses strcoll(), which is good. That means it will do a string comparison that is accurate according to the currently set encoding. If you set the encoding to UTF8, then you will have full UTF8 support. Of course you would need the \U escape as well!

I think the biggest problem here is that setting the locale can't be done from Lua, AFAIK. It would be nice to have a platform independent way to do that.

--
chris marrin                ,""$,
[hidden email]          b`    $                             ,,.
                        mP     b'                            , 1$'
        ,.`           ,b`    ,`                              :$$'
     ,|`             mP    ,`                                       ,mm
   ,b"              b"   ,`            ,mm      m$$    ,m         ,`P$$
  m$`             ,b`  .` ,mm        ,'|$P   ,|"1$`  ,b$P       ,`  :$1
 b$`             ,$: :,`` |$$      ,`   $$` ,|` ,$$,,`"$$     .`    :$|
b$|            _m$`,:`    :$1   ,`     ,$Pm|`    `    :$$,..;"'     |$:
P$b,      _;b$$b$1"       |$$ ,`      ,$$"             ``'          $$
 ```"```'"    `"`         `""`        ""`                          ,P`
"As a general rule,don't solve puzzles that open portals to Hell"'

Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Roberto Ierusalimschy
> Translating "U+1234" into a UTF8 sequence is a very small snippet of code.

Sorry my ignorance, but why is such escape sequences so necessary? Are
there so many control chars in unicode? (For non-control chars, a
UTF8-aware editor should be able to put the proper enconding inside the
string, is that right?)

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Chris Marrin
Roberto Ierusalimschy wrote:
Translating "U+1234" into a UTF8 sequence is a very small snippet of code.


Sorry my ignorance, but why is such escape sequences so necessary? Are
there so many control chars in unicode? (For non-control chars, a
UTF8-aware editor should be able to put the proper enconding inside the
string, is that right?)

It allows you to add "incidental" characters without the need for a fully functional editor for that language. For instance, when I worked for Sony we had the need to add a few characters of Kanji on occasion. It's not easy to get a Kanji editor setup for a western keyboard, so adding direct unicode was more convenient. There are also some oddball symbols in the upper registers for math and chemistry and such that are easier to add using escapes.

--
chris marrin                ,""$,
[hidden email]          b`    $                             ,,.
                        mP     b'                            , 1$'
        ,.`           ,b`    ,`                              :$$'
     ,|`             mP    ,`                                       ,mm
   ,b"              b"   ,`            ,mm      m$$    ,m         ,`P$$
  m$`             ,b`  .` ,mm        ,'|$P   ,|"1$`  ,b$P       ,`  :$1
 b$`             ,$: :,`` |$$      ,`   $$` ,|` ,$$,,`"$$     .`    :$|
b$|            _m$`,:`    :$1   ,`     ,$Pm|`    `    :$$,..;"'     |$:
P$b,      _;b$$b$1"       |$$ ,`      ,$$"             ``'          $$
 ```"```'"    `"`         `""`        ""`                          ,P`
"As a general rule,don't solve puzzles that open portals to Hell"'

Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Asko Kauppi

..keeping my fingers out of the hot oven, but..

local str= "This might be in Kanji: "..U(1234)..U(5678).." is that enough? :)"

I don't oppose the \U, no I don't. If they do it, also \x (hex input) should be in..

-ak


Chris Marrin kirjoitti 29.12.2005 kello 19.17:

Roberto Ierusalimschy wrote:
Translating "U+1234" into a UTF8 sequence is a very small snippet of code.
Sorry my ignorance, but why is such escape sequences so necessary? Are
there so many control chars in unicode? (For non-control chars, a
UTF8-aware editor should be able to put the proper enconding inside the
string, is that right?)

It allows you to add "incidental" characters without the need for a fully functional editor for that language. For instance, when I worked for Sony we had the need to add a few characters of Kanji on occasion. It's not easy to get a Kanji editor setup for a western keyboard, so adding direct unicode was more convenient. There are also some oddball symbols in the upper registers for math and chemistry and such that are easier to add using escapes.

--
chris marrin                ,""$,
[hidden email]          b`    $                             ,,.
                        mP     b'                            , 1$'
        ,.`           ,b`    ,`                              :$$'
,|` mP ,` ,mm ,b" b" ,` ,mm m$ $ ,m ,`P$$ m$` ,b` .` ,mm ,'|$P ,|"1$` ,b $P ,` :$1 b$` ,$: :,`` |$$ ,` $$` ,|` ,$$,,`"$ $ .` :$| b$| _m$`,:` :$1 ,` ,$Pm|` ` :$$,..;"' |$:
P$b,      _;b$$b$1"       |$$ ,`      ,$$"             ``'          $$
 ```"```'"    `"`         `""`        ""`                          ,P`
"As a general rule,don't solve puzzles that open portals to Hell"'


Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Roberto Ierusalimschy
In reply to this post by Chris Marrin
> It allows you to add "incidental" characters without the need for a 
> fully functional editor for that language.

Thanks for the explanation. Anyway, if the point is "incidental" use,
those that need may set an auxiliary function and write code like
this:

  x = U"this is pi: U+003C0"

with U defined appropriately:

function L (s)
  return string.gsub(s, "U+(%x+)", function (n)
    n = tonumber(n, 16)
    if n <= 127 then return string.char(n)
    elseif n <= 2047 then
      return string.char(192 + n/64, 128 + n%64)
    elseif ...
  end)
end

-- Roberto


Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

whisper-4
IMO, with globalization, languages that don't support Unicode won't make the cut in the long run.

Too bad - I like Lua.

Dave LeBlanc
Seattle, WA USA


Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Dave Dodge
In reply to this post by Lisa Parratt
On Thu, Dec 29, 2005 at 11:38:57AM +0000, Lisa Parratt wrote:
> but I was under the impression C99 added Unicode compliant  
> wide character comparison functions

The Unicode compliance is optional.

If __STDC_ISO_10646__ is defined, it indicates that the wchar_t type
contains character values that match up with ISO/IEC 10646.  The value
of __STDC_ISO_10646__ tells you the year and month of compliance (any
newer amendments to ISO/IEC 10646 might not be handled).  For example
my Linux system with glibc/gcc defines this as 200009 (September
2000).

C99 adds the \unnnn and \Unnnnnnnn escapes as well, though not all
compilers have completely implemented this.  For example gcc can
handle them in wide character constants, but not in identifiers.

                                                  -Dave Dodge

Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Chris Marrin
In reply to this post by Asko Kauppi
Asko Kauppi wrote:

..keeping my fingers out of the hot oven, but..

local str= "This might be in Kanji: "..U(1234)..U(5678).." is that enough? :)"

Yes, that would of course work, too. But that's lots of noise compared to: "This might be in Kanji:\u+1234\u+5678", IMHO.

Note that, according to the unicode spec 'u' and 'U' are not equivalent. 'u' expects 4 hex digits, allowing for a UCS16 character, while 'U' expects 8 hex digits, for a UCS32 character. But that's just for parsing the string. In UTF8 \U+00001234 and \u+1234 produce the same sequence.



I don't oppose the \U, no I don't. If they do it, also \x (hex input) should be in..

Agreed, although Lua already supports '\123' for decimal literals. Unlike C, the number here is interpreted as decimal rather than octal. Not a bad diff, considering how obsolete octal is! Hex notation would merely be a convenience for representing single characters. The nice thing about \u is that it actually does some grungy work for you. Today you could enter this:

    "This is a summation character: \226\136\145"

and a console that understood UTF8 would show a summation characer. But to do this I had to:

- Use to the Character Map utility to find the character
- Copy that character into OpenOffice, which understands Unicode
- Save the file as UTF8
- Open the file as binary in MsDev
- Hand convert the character sequence to UTF8
- Use Calculator to convert the result into decimal

It would be nice if Lua could do this for me :-)

--
chris marrin                ,""$,
[hidden email]          b`    $                             ,,.
                        mP     b'                            , 1$'
        ,.`           ,b`    ,`                              :$$'
     ,|`             mP    ,`                                       ,mm
   ,b"              b"   ,`            ,mm      m$$    ,m         ,`P$$
  m$`             ,b`  .` ,mm        ,'|$P   ,|"1$`  ,b$P       ,`  :$1
 b$`             ,$: :,`` |$$      ,`   $$` ,|` ,$$,,`"$$     .`    :$|
b$|            _m$`,:`    :$1   ,`     ,$Pm|`    `    :$$,..;"'     |$:
P$b,      _;b$$b$1"       |$$ ,`      ,$$"             ``'          $$
 ```"```'"    `"`         `""`        ""`                          ,P`
"As a general rule,don't solve puzzles that open portals to Hell"'

Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Asko Kauppi
In reply to this post by whisper-4

Lua definately is a 21st century language, and all for global access if anything. :)

But, it's also conservative in the approach to "extensions" as we all know. I had this UTF-8 issue a while back with LuaX, and there wasn't much needed to be changed in the core (if any?) to get me pleased there. Follow the wiki information provided here during the thread, and look for the Selene package (that's what I used, but the details are in the dustbin).

I think many of the people are using Lua, with UTF-8, daily without having major problems, am I right?

-asko


[hidden email] kirjoitti 30.12.2005 kello 1.30:

IMO, with globalization, languages that don't support Unicode won't make the cut in the long run.

Too bad - I like Lua.

Dave LeBlanc
Seattle, WA USA



Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Klaus Ripke
In reply to this post by Chris Marrin
On Thu, Dec 29, 2005 at 04:10:26PM -0800, Chris Marrin wrote:
>     "This is a summation character: \226\136\145"
> 
> and a console that understood UTF8 would show a summation characer. But 
> to do this I had to:
> 
> - Use to the Character Map utility to find the character
> - Copy that character into OpenOffice, which understands Unicode
> - Save the file as UTF8
> - Open the file as binary in MsDev
> - Hand convert the character sequence to UTF8
> - Use Calculator to convert the result into decimal
> 
> It would be nice if Lua could do this for me :-)
what should Lua do for you?
Use to the Character Map utility?
Copy that character into OpenOffice?
Or write the one line Lua script to print the decimal
numbers corresponding to the UTF-8 representation
of a given Unicode point?
No problem, Lua does it for you.
The attached utftab prints 64 chars a line with their
base Lua string encoding; just add 0-63 to the last \128.
Run it in a UTF-8 xterm.
Use ncurses (or some extremely antisocial hacks from slnspider)
to add mouse support to print the code for any character.

However, your list pretty much makes the point that the problem
is not whether to write some u+xxx or two or three decimal codes.
You want to use either an Unicode capable editor
or some utility to look up codes
-- as long as everybody agrees to use UTF-8 anyway, that is.
It does make a big difference though where friends of the wide char
want their wchar_t based version of Lua. Anybody?
#!/opt/selene/bin/lua
local sprintf = string.format

-- bytes a,b
-- skip invalid a=192,193
a=194
io.write(sprintf("    \\%3d\\128 ",a))
for b=128,159 do io.write(" ") end -- skip c1 controls
for b=160,191 do io.write(sprintf("%c%c",a,b)) end
io.write("\n")

for a=195,223 do -- regular
	io.write(sprintf("    \\%3d\\128 ",a))
	for b=128,191 do io.write(sprintf("%c%c",a,b)) end
	io.write("\n")
end

-- bytes a,b,c
a=224 -- skip invalid b=128,159 for a=224
for b=160,191 do
	io.write(sprintf("\\%3d\\%3d\\128 ",a,b))
	for c=128,191 do io.write(sprintf("%c%c%c",a,b,c)) end
	io.write("\n")
end

for a=225,239 do -- regular a,b,c
	for b=128,191 do
		io.write(sprintf("\\%3d\\%3d\\128 ",a,b))
		for c=128,191 do io.write(sprintf("%c%c%c",a,b,c)) end
		io.write("\n")
	end
end
Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Chris Marrin
Klaus Ripke wrote:
On Thu, Dec 29, 2005 at 04:10:26PM -0800, Chris Marrin wrote:

   "This is a summation character: \226\136\145"

and a console that understood UTF8 would show a summation characer. But to do this I had to:

- Use to the Character Map utility to find the character
- Copy that character into OpenOffice, which understands Unicode
- Save the file as UTF8
- Open the file as binary in MsDev
- Hand convert the character sequence to UTF8
- Use Calculator to convert the result into decimal

It would be nice if Lua could do this for me :-)

what should Lua do for you?
Use to the Character Map utility?
Copy that character into OpenOffice?
Or write the one line Lua script to print the decimal
numbers corresponding to the UTF-8 representation
of a given Unicode point?
No problem, Lua does it for you.
The attached utftab prints 64 chars a line with their
base Lua string encoding; just add 0-63 to the last \128.
Run it in a UTF-8 xterm.
Use ncurses (or some extremely antisocial hacks from slnspider)
to add mouse support to print the code for any character.

However, your list pretty much makes the point that the problem
is not whether to write some u+xxx or two or three decimal codes.
You want to use either an Unicode capable editor
or some utility to look up codes
-- as long as everybody agrees to use UTF-8 anyway, that is.
It does make a big difference though where friends of the wide char
want their wchar_t based version of Lua. Anybody?

I think UTF8 is a good representation (I used to be in the wchar_t camp). The utility of \U is simply convenience, doing the work of converting a hex unicode value into UTF8. There are other ways to do it, including some Lua code to do the translation.

I think the more important addition would be an easy Lua way to set the locale to use the UTF8 encoding.

--
chris marrin                ,""$,
[hidden email]          b`    $                             ,,.
                        mP     b'                            , 1$'
        ,.`           ,b`    ,`                              :$$'
     ,|`             mP    ,`                                       ,mm
   ,b"              b"   ,`            ,mm      m$$    ,m         ,`P$$
  m$`             ,b`  .` ,mm        ,'|$P   ,|"1$`  ,b$P       ,`  :$1
 b$`             ,$: :,`` |$$      ,`   $$` ,|` ,$$,,`"$$     .`    :$|
b$|            _m$`,:`    :$1   ,`     ,$Pm|`    `    :$$,..;"'     |$:
P$b,      _;b$$b$1"       |$$ ,`      ,$$"             ``'          $$
 ```"```'"    `"`         `""`        ""`                          ,P`
"As a general rule,don't solve puzzles that open portals to Hell"'

Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Lisa Parratt
In reply to this post by Klaus Ripke
On 30 Dec 2005, at 01:37, Klaus Ripke wrote:
-- as long as everybody agrees to use UTF-8 anyway, that is.
It does make a big difference though where friends of the wide char
want their wchar_t based version of Lua. Anybody?

I've done an awful lot of work with international character sets, and I personally consider any encoding of Unicode other than UTF-8 to be obsolete. Looking at most other modern (i.e. not held back by backwards compatibility, e.g. Windows) users of Unicode, their authors appear to feel similarly.

--
Lisa


Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Roberto Ierusalimschy
In reply to this post by Chris Marrin
> I think the more important addition would be an easy Lua way to set the 
> locale to use the UTF8 encoding.

os.setlocale("UTF-8") ?

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Dave Dodge
On Fri, Dec 30, 2005 at 11:36:13AM -0200, Roberto Ierusalimschy wrote:
> > I think the more important addition would be an easy Lua way to set the 
> > locale to use the UTF8 encoding.
> 
> os.setlocale("UTF-8") ?

Locale names other that "C" and "" are implementation-defined.  For
example on Solaris 8 I believe the recommended locale name for this
would be "en_US.UTF-8".

                                                  -Dave Dodge

Reply | Threaded
Open this post in threaded view
|

Re: Will Lua kernel use Unicode in the future?

Chris Marrin
In reply to this post by Roberto Ierusalimschy
Roberto Ierusalimschy wrote:
I think the more important addition would be an easy Lua way to set the locale to use the UTF8 encoding.


os.setlocale("UTF-8") ?

Yes, I suppose I should try that. Looking at the implementation it looks like it should work. But the documentation for setlocale on Win32 says nothing about the string "UTF-8" being supported. And even in Unix docs, it's pretty sketchy. I see that you can say "en-us.utf-8", but does it REQUIRE a language code? And is this cross-platform?

Anyway, if this can work across platforms, that's great!

--
chris marrin              ,""$, "As a general rule,don't solve puzzles
[hidden email]        b`    $  that open portals to Hell" ,,.
        ,.`           ,b`    ,`                            , 1$'
     ,|`             mP    ,`                              :$$'     ,mm
   ,b"              b"   ,`            ,mm      m$$    ,m         ,`P$$
  m$`             ,b`  .` ,mm        ,'|$P   ,|"1$`  ,b$P       ,`  :$1
 b$`             ,$: :,`` |$$      ,`   $$` ,|` ,$$,,`"$$     .`    :$|
b$|            _m$`,:`    :$1   ,`     ,$Pm|`    `    :$$,..;"'     |$:
P$b,      _;b$$b$1"       |$$ ,`      ,$$"             ``'          $$
 ```"```'"    `"`         `""`        ""`                          ,P`

12