The World According to Lua: How To?

classic Classic list List threaded Threaded
40 messages Options
12
Reply | Threaded
Open this post in threaded view
|

The World According to Lua: How To?

Petite Abeille
Hello,

I'm trying to write a full fledge application in Lua... questions related to I18N (aka internationalization) come knocking on my door more and more insistingly... and I have to pitifully admit that I'm at a total loss on how to even start to handle any of them in Lua's World...

First and foremost, character set encoding... I have been through the entire mailing list back and forth, but still no luck... Lua is rumored to be "8bit clean"... fine... but what does that mean as far as character set goes? Bits and bytes (clean or not) are utterly useless if I don't know what character set they do represent in practice. What character set does Lua use effectively?!?!? How do I find out about it? How do I convert things back and forth to it? How do I set it in the first place?

When I do aFile:read( "*all" ), what do I get back as far as character set goes? My OS do have a default character set encoding, but how do I know about it? When I do aFile:write( aContent ), what am I writing?!? What does string.byte() returns? In what encoding?

Then there is the fabled "setlocale" and its related idiosyncrasies... The Man hints that this may very well be the answer to many unanswered questions... but... how precisely does setlocale relates to character set encoding, if at all? How do I tell Lua that everything I want to deal with is UTF-8 encoded and that is it?!?!

Then there is the issue of setlocale scope... does it impact the entire VM? How do I handle several locales concurrently? For instance, lets assume that my application display its data according to HTTP's Accept-Language header. One request is in de_DE, the next one in fr_FR and so on, while the application default language is en_US. How does all this fit together?

In short, I'm at lost. Could any kind soul point me to any coherent resources which may shed some sense on any of this?

Thanks in advance.

Cheers

--
PA, Onnay Equitursay
http://alt.textdrive.com/


Reply | Threaded
Open this post in threaded view
|

Re: The World According to Lua: How To?

Klaus Ripke
Hi

On Friday 18 February 2005 09:06, PA wrote:
> related to I18N (aka internationalization) come knocking on my door
... with an axe, I assume

> a total loss on how to even start to handle any of them in Lua's
> World...
here we go

> First and foremost, character set encoding... I have been through the
> entire mailing list back and forth, but still no luck... Lua is rumored
> to be "8bit clean"... fine... but what does that mean as far as
> character set goes?
It means that it can pass around data in any charset you like.
No less and no more (than "passing around").

> Bits and bytes (clean or not) are utterly useless
> if I don't know what character set they do represent in practice.
unclean they're even less useful

> character set does Lua use effectively?!?!? How do I find out about it?
Lua uses the locale sensitive single-byte C API.
So you find out in the C89/C90 standard documents.
(e.g. http://danpop.home.cern.ch/danpop/ansi.c)

There is no builtin way to usefully deal with multibyte charsets,
however, extensions like my UTF-8 stuff can use the builtin string type
without any trouble.

There also is absolutely no recoding support builtin,
you will get just whatever character coding came in from your files
or accross the wire. Yet, a recoding extension is scheduled,
and wrapping it around standard files and sockets to make them
appear magically recoded is no big deal.
For a truly i18n app it's probably easiest to always use UTF-8
internally.

The behaviour of your single-byte charset, however,
is affected at a few places by your locale setting,
most important the meaning of character classes
and the sorting order in string comparision:

$ echo 'os.setlocale("C") print(string.find(string.char(193),"%a"))'|lua
nil
$ echo 'os.setlocale("en_US.ISO-8859-1") 
print(string.find(string.char(193),"%a"))'|lua
1       1
$ echo 'os.setlocale("en_US.UTF-8") 
print(string.find(string.char(193),"%a"))'|lua
nil

given the single-byte API, you can not match anything
interesting in UTF-8 (or Big5 ...). Use the UTF-8 extension.
string.len and sub will work ok for any single-byte charset,
but the latter will happily kill multi-byte encodings.

> How do I convert things back and forth to it?
wait for the encoding extension

> How do I set it in the first place?
see above

> When I do aFile:read( "*all" ), what do I get back as far as character
> set goes?
whatever is in there

> My OS do have a default character set encoding, but how do I
> know about it?
It is ISO-8859-1 (Latin-1).

> When I do aFile:write( aContent ), what am I writing?!?
aContent
> What does string.byte() returns?
the first byte's value
> In what encoding?
Bytes don't have no encoding.

> Then there is the fabled "setlocale" and its related idiosyncrasies...
ouch
> The Man hints that this may very well be the answer to many unanswered
> questions...
not really

> but... how precisely does setlocale relates to character
> set encoding, if at all?
ctypes and collation

> How do I tell Lua that everything I want to
> deal with is UTF-8 encoded and that is it?!?!
you don't - Lua doesn't care. Use the extension.

> Then there is the issue of setlocale scope... does it impact the entire
> VM?
yep

> How do I handle several locales concurrently?
you don't
well, you may switch back and forth,
but don't try to do this in a multithreaded app.

C API locale support is just braindead.
It was meant to enable NLS in existing applications which had been
written without being aware of these issues.
Since NLS is not that simple, it didn't work out.

> For instance, lets
> assume that my application display its data according to HTTP's
> Accept-Language header. One request is in de_DE, the next one in fr_FR
> and so on, while the application default language is en_US. How does
> all this fit together?
It doesn't - you just don't care.
First, all of these are using Latin-1 anyways.
Pick one of "ja" and "oui" and "yerpo".
Second, set the document's content type to "text/html; charset=ISO-8859-1"
in your webserver config and better also in the documents header.
Use other charsets, including UTF-8, accordingly.
If you happen to have russian text in KOI on your server,
you ought to know that.

> Could any kind soul point me to any coherent 
> resources which may shed some sense on any of this?
http://www.i18nguy.com/

cheers


Reply | Threaded
Open this post in threaded view
|

Re: The World According to Lua: How To?

Petite Abeille

On Feb 18, 2005, at 11:30, Klaus Ripke wrote:

It means that it can pass around data in any charset you like.
No less and no more (than "passing around").

Hmmm... right. Not that helpful, isn't it?

Lua uses the locale sensitive single-byte C API.
So you find out in the C89/C90 standard documents.
(e.g. http://danpop.home.cern.ch/danpop/ansi.c)

Ok.

There is no builtin way to usefully deal with multibyte charsets,
however, extensions like my UTF-8 stuff can use the builtin string type
without any trouble.

In other words, Lua itself is not Unicode safe one way or another?

There also is absolutely no recoding support builtin,
you will get just whatever character coding came in from your files
or accross the wire. Yet, a recoding extension is scheduled,
and wrapping it around standard files and sockets to make them
appear magically recoded is no big deal.

This is futureware, right? What about today?

For a truly i18n app it's probably easiest to always use UTF-8
internally.

This is what I would like to do, yes. How do I achieve that today with the stock Lua distribution?

How do I convert things back and forth to it?
wait for the encoding extension

Hmmm... where is that fabled "extension"? Got a link?

When I do aFile:read( "*all" ), what do I get back as far as character
set goes?
whatever is in there

Not very helpful, isn't it?

My OS do have a default character set encoding, but how do I
know about it?
It is ISO-8859-1 (Latin-1).

Always? ISO-8859-1? This is useless for three fourth of world.

Bytes don't have no encoding.

What about sequences of bytes?

but... how precisely does setlocale relates to character
set encoding, if at all?
ctypes and collation

Ok. How? I would like my application to always deal with UTF-8 internally. And convert everything and anything coming its way to it. How?

How do I tell Lua that everything I want to
deal with is UTF-8 encoded and that is it?!?!
you don't - Lua doesn't care. Use the extension.

You mean your recently mentioned UTF-8 library? This is the extension you are talking about? Is Lua itself going to ever support Unicode directly one way or another?

Then there is the issue of setlocale scope... does it impact the entire
VM?
yep

Bummer :/


How do I handle several locales concurrently?
you don't
well, you may switch back and forth,
but don't try to do this in a multithreaded app.

C API locale support is just braindead.
It was meant to enable NLS in existing applications which had been
written without being aware of these issues.
Since NLS is not that simple, it didn't work out.

Ok. Any alternatives? How do people deal with locales then? Just pretend they are not there?


For instance, lets
assume that my application display its data according to HTTP's
Accept-Language header. One request is in de_DE, the next one in fr_FR
and so on, while the application default language is en_US. How does
all this fit together?
It doesn't - you just don't care.

Hmmm... what if I do care?

First, all of these are using Latin-1 anyways.

Latin-1 doesn't work for my potential Japanese users.

Pick one of "ja" and "oui" and "yerpo".
Second, set the document's content type to "text/html; charset=ISO-8859-1"
in your webserver config and better also in the documents header.

My webserver is my application. There is no Deus ex Machina.

Use other charsets, including UTF-8, accordingly.
If you happen to have russian text in KOI on your server,
you ought to know that.

Sorry, you totally lost me here.

http://www.i18nguy.com/

I'm not interested in i18n in general. Only specifics on how to deal with it in Lua.

In any case, what seems to emerge from all this mess is that Lua is simply not ready for prime time as far as i18n goes. Is that a fair assessment or did I miss something obvious as usual?

Thanks.

Cheers

--
PA, Onnay Equitursay
http://alt.textdrive.com/


Reply | Threaded
Open this post in threaded view
|

Re: The World According to Lua: How To?

Glenn Maynard
A tip: using a real name on technical lists will tend to get you
a better response.

On Fri, Feb 18, 2005 at 11:53:03AM +0100, PA wrote:
> >For a truly i18n app it's probably easiest to always use UTF-8
> >internally.
> 
> This is what I would like to do, yes. How do I achieve that today with 
> the stock Lua distribution?

Probably by having the application that's using Lua do the appropriate
conversion to UTF-8 at the entry/exit points, and using setlocale()
to change the locale to UTF-8.

> >>My OS do have a default character set encoding, but how do I
> >>know about it?
> >It is ISO-8859-1 (Latin-1).
> 
> Always? ISO-8859-1? This is useless for three fourth of world.

No, it's not always ISO-8859-1.  (Don't know where he got that answer from.)
In Unix, it depends on the locale; in Windows, it's the ANSI codepage.  My
encoding is UTF-8, not ISO-8859-1.  The encoding on my Windows machine is
currently CP932 (Shift-JIS).

> >>How do I tell Lua that everything I want to
> >>deal with is UTF-8 encoded and that is it?!?!
> >you don't - Lua doesn't care. Use the extension.
> 
> You mean your recently mentioned UTF-8 library? This is the extension 
> you are talking about? Is Lua itself going to ever support Unicode 
> directly one way or another?

"Support" in what way?  What operations do you want to do?  You can do things
like substring searches on UTF-8 without any extra code.  Regexes/gsub is
harder (eg. "[äï]" should be a set of two letters, regardless of encoding).
I don't know if that's supported--it would come at a performance and
portability cost.

> >>For instance, lets
> >>assume that my application display its data according to HTTP's
> >>Accept-Language header. One request is in de_DE, the next one in fr_FR
> >>and so on, while the application default language is en_US. How does
> >>all this fit together?
> >It doesn't - you just don't care.
> 
> Hmmm... what if I do care?

In this case, you'd probably want an interface to iconv.  Accepting multiple
tagged character sets in a single application is fairly atypical for i18n
support: most applications only need to be able to deal with the language of
the user.  Web browsers, mailers, etc. need more than that.

(If I was using Lua to extend a web browser, though, I'd probably convert the
text to UTF-8 before giving it to Lua, and not make Lua do the conversion.)

> >Pick one of "ja" and "oui" and "yerpo".
> >Second, set the document's content type to "text/html; 
> >charset=ISO-8859-1"
> >in your webserver config and better also in the documents header.
> 
> My webserver is my application. There is no Deus ex Machina.

One generally extends a webserver with Lua--one doesn't write one *in* Lua.  :)

> >http://www.i18nguy.com/
> 
> I'm not interested in i18n in general. Only specifics on how to deal 
> with it in Lua.

If you don't understand i18n in general, you can't really understand
how to deal with it in Lua.

-- 
Glenn Maynard

Reply | Threaded
Open this post in threaded view
|

Re: The World According to Lua: How To?

Klaus Ripke
In reply to this post by Petite Abeille
moin moin

On Friday 18 February 2005 11:53, PA wrote:
> > No less and no more (than "passing around").
> Hmmm... right. Not that helpful, isn't it?
It is quite helpful!
In many applications and especially a webserver everything
works perfectly well if you just shuffle around the data without
touching it!
You can dish out documents written using a dozen different
charsets without any concern! As long as every document
bears the proper content-type (at least in the meta header),
the user's browsers will get it right.
In the same way you use programmatic messages from some
message catalogue (my recommendation is a cdb) and store
and echo user's form input without regard for the encoding.

So where do you have to pay attention to the encoding?
- be careful with string len and sub
 They are ok only for single byte charsets.
 Quite straightforward for double-byte charsets
  - simple use twice the numbers.
 Difficult for multi-byte (variable length) and next to unusable
 for escape-driven encodings. That's why you want to stick
 to UTF-8 and use mine or some other lib.
- be careful with upper and lower
 as they require exact knowledge fo the charset
 (but are meaningless in most scripts anyway)
- don't rely too much on character classes
 but usually it's enough to treat everything outside
 ASCII as some kind of "letter". E.g. properly recognizing
 non-Latin digits as such wont buy you much as long
 as you don't know how to get a numerical value
 from them (and, yes, that includes roman "digits"!).
- avoid relying on collation

> In other words, Lua itself is not Unicode safe one way or another?
It is absolutely "Unicode safe"!
Does not break anything (avoiding sub, upper and lower).

> > or accross the wire. Yet, a recoding extension is scheduled,
> This is futureware, right?
Right, but not far away.

> What about today?
Use Tcl or Java, as both have fairly complete support for that
- if you really need it.
With lua, you can deploy external filters like iconv or recode.

> This is what I would like to do, yes. How do I achieve that today with
> the stock Lua distribution?
depends on what you want to do:
- Passing around is fine with the stock.
- sub, len, upper, lower and char are done in
  http://malete.org/tar/slnutf8.0.8.tar.gz

Maybe you could spell out what you really need?
After all, there are a lot of issues like character canonicalization
and bidirectional text you maybe haven't ever considered?
If you want to "just have it all in the most proper way",
read all of the Unicode tech reports,
then have a look at IBM's ICU,
and check out why it has to be a multimegabyte library
with a very vast and complicated interface -- pretty unlua.
There is no such thing as a free I18N.

But again, especially for a webserver, you can go to
great lengths in total oblivion.

> > wait for the encoding extension
> Hmmm... where is that fabled "extension"? Got a link?
please allow me one more weekend :)

> >> When I do aFile:read( "*all" ), what do I get back as far as character
> > whatever is in there
> Not very helpful, isn't it?
Yes it is.

> >> My OS do have a default character set encoding, but how do I
> >> know about it?
> > It is ISO-8859-1 (Latin-1).
> Always? ISO-8859-1?
On *your* OS I guess so.

> This is useless for three fourth of world.
Right, on *their* OS it's different.
Being *nixish, check out LC* and LANG* environment variables,
/etc/locale and similar named, man locale, setlocale et al.

> > Bytes don't have no encoding.
> What about sequences of bytes?
Ask string.byte to spell out the values of multiple bytes
(if you really want to know them).

> Ok. How? I would like my application to always deal with UTF-8
> internally. And convert everything and anything coming its way to it.
> How?
see above

But then, as you are talking about webservers, there's also a simpler
answer working in practice (sorry for becoming very OT here,
we should go private on next occasion):
Convert your static documents once to UTF-8 from whatever chaset they
were written in using iconv or recode and change the meta headers
to spell out "text/html; charset=UTF-8".
Most browsers will send any form data in UTF-8 and voila.
(This is violating the standards; strictly speaking you had to use
multipart/formdata which should send a charset with every little piece).

> You mean your recently mentioned UTF-8 library?
right

> This is the extension you are talking about?
no, recoding is a different issue

> Is Lua itself going to ever support Unicode directly one way or another?
"itself"? utf8 is just a lib like string is, which is perfectly fine,
and all-in-one packages a la LuaCheia may one day include it.

> Ok. Any alternatives? How do people deal with locales then?
> Just pretend they are not there?
mostly

> >> Accept-Language header. One request is in de_DE, the next one in fr_FR
> >> and so on, while the application default language is en_US. How does
>
> Hmmm... what if I do care?
All a webserver has to do is pick the right version of the document.
E.g. when asked for /foo/bar.html, see if there is a /fr_FR/foo/bar.html
(or /foo/bar.fr_FR.html if you prefer) and if so, use it, else some default.
That's all.
Really.

> > First, all of these are using Latin-1 anyways.
> Latin-1 doesn't work for my potential Japanese users.
If they are asking for fr_FR, they will get by with Latin-1.

> > If you happen to have russian text in KOI on your server,
> > you ought to know that.
>
> Sorry, you totally lost me here.
If there is somebody doing the russian translations for you,
they will know which charset they are using.

> In any case, what seems to emerge from all this mess is that Lua is
> simply not ready for prime time as far as i18n goes.
It is perfectly well ready, as I18N is a mere library issue.

> Is that a fair assessment or did I miss something obvious as usual?
Sound's like you've been expecting the kind of magic the
C locale API promises, but fails to deliver.
It just doesn't work that way.


regards


Reply | Threaded
Open this post in threaded view
|

Re: The World According to Lua: How To?

Klaus Ripke
In reply to this post by Glenn Maynard
On Friday 18 February 2005 12:36, Glenn Maynard wrote:
> Probably by having the application that's using Lua do the appropriate
> conversion to UTF-8 at the entry/exit points,
yes

> and using setlocale() to change the locale to UTF-8.
don't! It's not working by any means.

> "Support" in what way?  What operations do you want to do?  You can do
> things like substring searches on UTF-8 without any extra code. 
right (while this does not hold in general for multi-byte charsets)

> Regexes/gsub is harder (eg. "[äï]" should be a set of two letters,
> regardless of encoding). I don't know if that's supported
it is for UTF-8

> --it would come at a performance
sure, but not as bad as one might expect

> and portability cost.
no, to the very opposite: while the C API locale support will vary largely,
standard unicode support is 100% portable in Java and Tcl
and hence in the provided little lib (which is a port from Tcl).

> If you don't understand i18n in general, you can't really understand
> how to deal with it in Lua.
sad but true



regards


Reply | Threaded
Open this post in threaded view
|

Re: The World According to Lua: How To?

Petite Abeille
In reply to this post by Glenn Maynard

On Feb 18, 2005, at 12:36, Glenn Maynard wrote:

A tip: using a real name on technical lists will tend to get you
a better response.

Right... this is email... there is no such a thing as a "read name" :)

Probably by having the application that's using Lua do the appropriate
conversion to UTF-8 at the entry/exit points, and using setlocale()
to change the locale to UTF-8.

The application is in Lua.

No, it's not always ISO-8859-1. (Don't know where he got that answer from.) In Unix, it depends on the locale; in Windows, it's the ANSI codepage. My encoding is UTF-8, not ISO-8859-1. The encoding on my Windows machine is
currently CP932 (Shift-JIS).

Ok. So Lua's encoding reflects the OS encoding?

"Support" in what way?  What operations do you want to do?

I simply want to create, manipulate and vend text resources. And would like to make sure that whatever text I create is in a known encoding format.

In this case, you'd probably want an interface to iconv. Accepting multiple tagged character sets in a single application is fairly atypical for i18n support: most applications only need to be able to deal with the language of
the user.  Web browsers, mailers, etc. need more than that.

I'm working on a web server. In Lua.

One generally extends a webserver with Lua--one doesn't write one *in* Lua. :)

There is a first for everything.

http://luaforge.net/projects/luahttpd/
http://luaforge.net/projects/xavante/

If you don't understand i18n in general, you can't really understand
how to deal with it in Lua.

Lets suspend disbelieve one second and pretend that I heard of i18n before. Just not in Lua and/or ANSI C :)

Thanks in any case.

Cheers

--
PA, Onnay Equitursay
http://alt.textdrive.com/


Reply | Threaded
Open this post in threaded view
|

Re: The World According to Lua: How To?

David Given
On Friday 18 February 2005 14:35, PA wrote:
> On Feb 18, 2005, at 12:36, Glenn Maynard wrote:
> > A tip: using a real name on technical lists will tend to get you
> > a better response.
>
> Right... this is email... there is no such a thing as a "read name" :)

I beg to differ; everyone has a real name, and it's usually considered good 
etiquette to use them on a technical list (as opposed to a social list)... 
not an important point, however.

[...]
> Ok. So Lua's encoding reflects the OS encoding?

Basically, Lua doesn't know about encodings. Lua strings are streams of bytes, 
and it assumes that one character is one byte. Collation is done using the 
byte value.

This means that you can put any kind of data in a string --- but it's your 
responsibility to manipulate it correctly and do any conversion.

For example, if you're storing UTF8 in a Lua string (which is the recommended 
way of doing Unicode in Lua), then you can't assume that you can read 
character n by looking at byte n. *However*, string substitutions and pattern 
matching will still work in a limited way. The regular expression ".*fnord.*" 
will still match any string containing 'fnord', regardless of whether there 
are multibyte characters in the string; likewise, the pattern ".*©.*" will 
work; but "©*" won't work, because the * will bind to the last byte of the 
multibyte character. The collation functions will still work on single-byte 
characters but will sort multibyte characters oddly. And so on.

If you use fixed-length encodings such as UCS2 or UCS4 then of course the 
pattern matching functions become useless to you.

Anything from the ISO8859 family is trivial, of course.

If you're writing a web server, then your best bet is to emit UTF8, and avoid 
doing any string slicing; if you write your Lua scripts in UTF8, then you can 
trivially include UTF8 sequences in constant strings:

 local s = "fóö"

Since HTTP can be driven entirely with US-ASCII, then this probably won't 
cause you any problems.

-- 
+- David Given --McQ-+ "There is // One art // No more // No less // To
|  [hidden email]    | do // All things // With art // Lessness." --- Piet
| ([hidden email]) | Hein
+- www.cowlark.com --+ 


Reply | Threaded
Open this post in threaded view
|

Re: The World According to Lua: How To?

Petite Abeille
In reply to this post by Klaus Ripke

On Feb 18, 2005, at 14:06, Klaus Ripke wrote:

moin moin

Like in:

http://www.answers.com/moin+moin&r=67

No less and no more (than "passing around").
Hmmm... right. Not that helpful, isn't it?
It is quite helpful!
In many applications and especially a webserver everything
works perfectly well if you just shuffle around the data without
touching it!

Oh, well... my server is of the type which creates the documents in the first place as well.

So where do you have to pay attention to the encoding?
- be careful with string len and sub
 They are ok only for single byte charsets.
 Quite straightforward for double-byte charsets
  - simple use twice the numbers.
 Difficult for multi-byte (variable length) and next to unusable
 for escape-driven encodings. That's why you want to stick
 to UTF-8 and use mine or some other lib.

I'm fine with UTF-8. I would love to simply use UTF-8 everywhere. I'm just confused about how to reach that goal in a more or less straightforward fashion.

- be careful with upper and lower
 as they require exact knowledge fo the charset
 (but are meaningless in most scripts anyway)

Hmmm... lost me here. Why would case conversion be meaningless. Scripts or no scripts?

- don't rely too much on character classes
 but usually it's enough to treat everything outside
 ASCII as some kind of "letter". E.g. properly recognizing
 non-Latin digits as such wont buy you much as long
 as you don't know how to get a numerical value
 from them (and, yes, that includes roman "digits"!).
- avoid relying on collation

In other words, Lua itself is not Unicode safe one way or another?
It is absolutely "Unicode safe"!
Does not break anything (avoiding sub, upper and lower).

Ok. Do you mean that Lua is 'absolutely "Unicode safe"' as long as you don't touch any of those strings? This would rather limit the usefulness of Lua, no?

What about today?
Use Tcl or Java, as both have fairly complete support for that
- if you really need it.
With lua, you can deploy external filters like iconv or recode.

Ok.

You mean those:

http://www.gnu.org/software/libiconv/
http://www.gnu.org/software/recode/recode.html

What would be the proper way to subvert all Lua's internal libraries (string, io, etc) to work transparently in terms of the above (or at least systematically in terms of UTF-8).

depends on what you want to do:
- Passing around is fine with the stock.
- sub, len, upper, lower and char are done in
  http://malete.org/tar/slnutf8.0.8.tar.gz

Ok. But that seems to imply that I have to forgo Lua's string library altogether and use the above library instead. Is that a drop in replacement? Just asking :))


Maybe you could spell out what you really need?

I have a little app, which create, manipulate and serve textual information:

http://alt.textdrive.com/lua/23/lua-lupad

And I simply would like to make sure, before it's too late, that I handle character set encoding properly.

After all, there are a lot of issues like character canonicalization
and bidirectional text you maybe haven't ever considered?
If you want to "just have it all in the most proper way",
read all of the Unicode tech reports,
then have a look at IBM's ICU,
and check out why it has to be a multimegabyte library
with a very vast and complicated interface -- pretty unlua.
There is no such thing as a free I18N.

I don't expect it to be free (of pain). Just would like to know where Lua stands in this mess. And act accordantly.

But again, especially for a webserver, you can go to
great lengths in total oblivion.

Not in this case. It's an embedded server which acts as the primary interface to the application.

Very much like this application:

http://zoe.nu/
http://zoe.nu/itstories/story.php?data=stories&num=23&sec=2

wait for the encoding extension
Hmmm... where is that fabled "extension"? Got a link?
please allow me one more weekend :)

Sorry. I thought this was something coming in the Lua's master plan or something :)

My OS do have a default character set encoding, but how do I
know about it?
It is ISO-8859-1 (Latin-1).
Always? ISO-8859-1?
On *your* OS I guess so.

Ok. What if I don't want to use my OS encoding... perhaps for portability reasons... but would like to systematically use UTF-8 instead... what should I do, if anything?

This is useless for three fourth of world.
Right, on *their* OS it's different.

"Theirs"?

Being *nixish, check out LC* and LANG* environment variables,
/etc/locale and similar named, man locale, setlocale et al.

Right... but... I simply would like to have my application work in a more or less portable manner. Anyway to do that in Lua? If the answer is no, that's fine. I just would like to know so I can decide to ditch one tool in favor of another one if necessary. That's all.

Ask string.byte to spell out the values of multiple bytes
(if you really want to know them).

I'm not that interested in the various way to encode the same string. I simply would like my app to work in Unicode using UTF-8 as its sole encoding. That's all. If this is to cumbersome to achieve, so be it. I just would like to know :)

But then, as you are talking about webservers, there's also a simpler
answer working in practice (sorry for becoming very OT here,
we should go private on next occasion):
Convert your static documents once to UTF-8 from whatever chaset they
were written in using iconv or recode and change the meta headers
to spell out "text/html; charset=UTF-8".
Most browsers will send any form data in UTF-8 and voila.
(This is violating the standards; strictly speaking you had to use
multipart/formdata which should send a charset with every little piece).

Yes. This is usually what I do in the first place by specifying form's accept-charset in addition to the document and HTTP header character set encoding.


All a webserver has to do is pick the right version of the document.
E.g. when asked for /foo/bar.html, see if there is a /fr_FR/foo/bar.html (or /foo/bar.fr_FR.html if you prefer) and if so, use it, else some default.
That's all.
Really.

Perhaps there is no document in the first place? Perhaps I'm generating such a document on the fly? Perhaps I would like to generate it according to whatever accept-language header I have received?


First, all of these are using Latin-1 anyways.
Latin-1 doesn't work for my potential Japanese users.
If they are asking for fr_FR, they will get by with Latin-1.

They are asking for ja_JP. French will simply not do. Nor any other Latin-1. They wrote it in Japanese. They want to have it back in Japanese. Or Hindu. Or whatever they use. Or perhaps one document is in Korean. And the next in Sanskrit. And they both need to be displayed on the same "page". So is life :P

Sound's like you've been expecting the kind of magic the
C locale API promises, but fails to deliver.
It just doesn't work that way.

Ok. I'm fine with that. I just would like to know how it does work in practice, if it works at all. If Lua is not meant to be used that way, so be it. I just would like to know.

Thanks :)

Cheers

--
PA, Onnay Equitursay
http://alt.textdrive.com/


Reply | Threaded
Open this post in threaded view
|

Re: The World According to Lua: How To?

Klaus Ripke
On Friday 18 February 2005 16:18, PA wrote:
> http://www.answers.com/moin+moin&r=67
right, low saxon from east frisia :)

> Oh, well... my server is of the type which creates the documents in the
> first place as well.
but gathering text from some source ...

> I'm fine with UTF-8. I would love to simply use UTF-8 everywhere.
Absolutely no problem and lua is perfect for that.
We're just right now having three projects where we need that,
so you can kind of count on me getting it done.

> > - be careful with upper and lower
> >  (but are meaningless in most scripts anyway)
>
> Hmmm... lost me here. Why would case conversion be meaningless. Scripts
> or no scripts?
oh, script in the sense of phoenician vs. sinitic or brahmi.
Only the tiny small greek fork uses case.
http://web.archive.org/web/20030806053052/czyborra.com/unicode/characters.html#scripts

> Ok. Do you mean that Lua is 'absolutely "Unicode safe"' as long as you
> don't touch any of those strings?
no, as long as you use the proper functions to touch them.

> What would be the proper way to subvert all Lua's internal libraries
> (string, io, etc) to work transparently in terms of the above (or at
> least systematically in terms of UTF-8).
For string, use my little utf8 stuff instead.

For io and sockets it's probably best to use a ltn12-style filter.
The question of what should be the most generic read/write
interface regardless of io(file), socket or filter around any of
those is more general and afaik no yet settled (?);
once it's there, a recoding filter is just one special case.

Anyway, there's no need to subvert anything here.
If you really want all of io to transparently recode,
that would mean without explicitly stating which charset is
on the other end (involving some guesswork about "system encoding"),
then it still can be done using the usual wrapping techniques.
But I'd always go for an explicit interface, as guesswork is bound to fail.

Nothing else would need to be touched.


> Ok. But that seems to imply that I have to forgo Lua's string library
> altogether and use the above library instead. Is that a drop in
> replacement? Just asking :))
In principle yes -- to the extend where any
data used with string is really UTF-8 character data
(and once the mentioned todos are done).
In practice this is looking for trouble, as it would break other
libs using string to cut arbitrary bytes out of something.

> http://alt.textdrive.com/lua/23/lua-lupad
fine, fine TextDrive staff :)

> And I simply would like to make sure, before it's too late, that I
> handle character set encoding properly.
doesn't look like there's too much need to manipulate the texts? 

> Just would like to know where Lua stands in this mess.
I'd say in the midth of nowhere, which is a fine place to be

> Ok. What if I don't want to use my OS encoding...
right. forget about the OS.

> but would like to systematically use UTF-8 instead...
> what should I do, if anything?
gather data in UTF-8 from the first minute
and save it as is without applying any recoding.

> Right... but... I simply would like to have my application work in a
> more or less portable manner. Anyway to do that in Lua?
key to portability is to just not ask the OS to get it.
Then there's all fine with lua, as it won't break anything either.


> such a document on the fly? Perhaps I would like to generate it
> according to whatever accept-language header I have received?
you're going for automatic translation?

> They want to have it back in Japanese. Or Hindu. Or whatever they use.
> Or perhaps one document is in Korean. And the next in Sanskrit.
> And they both need to be displayed on the same "page". So is life :P
no problem






Reply | Threaded
Open this post in threaded view
|

Re: The World According to Lua: How To?

Bernardo Signori
In reply to this post by Petite Abeille
On Fri, 18 Feb 2005 16:18:40 +0100, PA <[hidden email]> wrote:
> Japanese. Or Hindu. Or whatever they use. Or perhaps one document is in
> Korean. And the next in Sanskrit. And they both need to be displayed on
> the same "page". So is life :P
> 
What's the use of having one page with two documents in different
languages? The Korean won't understand the sanskrit nor the other way
around!

Bernardo

Reply | Threaded
Open this post in threaded view
|

Re: The World According to Lua: How To?

Petite Abeille

On Feb 18, 2005, at 18:00, Bernardo Signori wrote:

What's the use of having one page with two documents in different
languages? The Korean won't understand the sanskrit nor the other way
around!

Perhaps we are talking about a Korean scholar studying the Mahabharata?

In any case, things can be more complicated than the first appear:

http://alt.textdrive.com/assets/public/i18n.editor.png
http://alt.textdrive.com/assets/public/i18n.message.png
http://alt.textdrive.com/assets/public/i18n.png

Cheers

--
PA, Onnay Equitursay
http://alt.textdrive.com/


Reply | Threaded
Open this post in threaded view
|

Re: The World According to Lua: How To?

Petite Abeille
In reply to this post by Klaus Ripke

On Feb 18, 2005, at 17:40, Klaus Ripke wrote:

Oh, well... my server is of the type which creates the documents in the
first place as well.
but gathering text from some source ...

Yes. Usually a HTTP POST.


I'm fine with UTF-8. I would love to simply use UTF-8 everywhere.
Absolutely no problem and lua is perfect for that.
We're just right now having three projects where we need that,
so you can kind of count on me getting it done.

Good :)


- be careful with upper and lower
 (but are meaningless in most scripts anyway)

Hmmm... lost me here. Why would case conversion be meaningless. Scripts
or no scripts?
oh, script in the sense of phoenician vs. sinitic or brahmi.
Only the tiny small greek fork uses case.
http://web.archive.org/web/20030806053052/czyborra.com/unicode/ characters.html#scripts

I see. I don't mind too much if Coptic subtleties are not fully accounted for :P

For string, use my little utf8 stuff instead.

Ok. Will take a closer look.

For io and sockets it's probably best to use a ltn12-style filter.
The question of what should be the most generic read/write
interface regardless of io(file), socket or filter around any of
those is more general and afaik no yet settled (?);
once it's there, a recoding filter is just one special case.

Anyway, there's no need to subvert anything here.
If you really want all of io to transparently recode,
that would mean without explicitly stating which charset is
on the other end (involving some guesswork about "system encoding"),
then it still can be done using the usual wrapping techniques.
But I'd always go for an explicit interface, as guesswork is bound to fail.

Ok. I would rather use an explicit encoding everywhere. UTF-8 is just fine with me.

Nothing else would need to be touched.

Ok.


Ok. But that seems to imply that I have to forgo Lua's string library
altogether and use the above library instead. Is that a drop in
replacement? Just asking :))
In principle yes -- to the extend where any
data used with string is really UTF-8 character data
(and once the mentioned todos are done).
In practice this is looking for trouble, as it would break other
libs using string to cut arbitrary bytes out of something.

So... what does that mean? Most (all?) Lua extensions are build in terms of the default string API, no? Does the entire string library needs to be hijacked?

http://alt.textdrive.com/lua/23/lua-lupad
fine, fine TextDrive staff :)

Staff or stuff? :P

And I simply would like to make sure, before it's too late, that I
handle character set encoding properly.
doesn't look like there's too much need to manipulate the texts?

Indexing? Search? Reading and writing to disk? Reading and writing to sockets?

Just would like to know where Lua stands in this mess.
I'd say in the midth of nowhere, which is a fine place to be

Ok...


Ok. What if I don't want to use my OS encoding...
right. forget about the OS.

but would like to systematically use UTF-8 instead...
what should I do, if anything?
gather data in UTF-8 from the first minute
and save it as is without applying any recoding.

Ok. I can do that. I think :P


Right... but... I simply would like to have my application work in a
more or less portable manner. Anyway to do that in Lua?
key to portability is to just not ask the OS to get it.
Then there's all fine with lua, as it won't break anything either.

Well... I need to read this data back from the disk at least.

such a document on the fly? Perhaps I would like to generate it
according to whatever accept-language header I have received?
you're going for automatic translation?

No, no. This is a simple application to write textual notes. The notes can be in any language. The client accessing the notes can be in any language. The notes are always displayed in their original format (e.g. UTF-8). The UI itself can change depending on the client language header.

They want to have it back in Japanese. Or Hindu. Or whatever they use.
Or perhaps one document is in Korean. And the next in Sanskrit.
And they both need to be displayed on the same "page". So is life :P
no problem

Ok. This is what I mean:

http://alt.textdrive.com/assets/public/i18n.png

In the application above, everything is indeed seamlessly handled as far as encoding goes.

Cheers

--
PA, Onnay Equitursay
http://alt.textdrive.com/


Reply | Threaded
Open this post in threaded view
|

Re: The World According to Lua: How To?

Klaus Ripke
In reply to this post by Petite Abeille
On Friday 18 February 2005 18:44, PA wrote:
> In any case, things can be more complicated than the first appear:
absolutely!
http://openisis.org/openisis/unicode?search=&prefix=1&action=search

Did we agree that there is no urgent need for recoding since all
is round trip UTF-8?
If so, I'd rather continue work on find/gsub before proceeding
to encoding issues


regards


Reply | Threaded
Open this post in threaded view
|

Re: The World According to Lua: How To?

Petite Abeille

On Feb 18, 2005, at 19:03, Klaus Ripke wrote:

Did we agree that there is no urgent need for recoding since all
is round trip UTF-8?

Yes.

If so, I'd rather continue work on find/gsub before proceeding
to encoding issues

Sounds like a plan :)

Thanks!

Cheers

--
PA, Onnay Equitursay
http://alt.textdrive.com/


Reply | Threaded
Open this post in threaded view
|

Re: The World According to Lua: How To?

Petite Abeille
In reply to this post by David Given

On Feb 18, 2005, at 16:16, David Given wrote:

Basically, Lua doesn't know about encodings. Lua strings are streams of bytes, and it assumes that one character is one byte. Collation is done using the
byte value.

Ok.


This means that you can put any kind of data in a string --- but it's your
responsibility to manipulate it correctly and do any conversion.

Ah... this is the major catch.

For example, if you're storing UTF8 in a Lua string (which is the recommended
way of doing Unicode in Lua), then you can't assume that you can read
character n by looking at byte n. *However*, string substitutions and pattern matching will still work in a limited way. The regular expression ".*fnord.*" will still match any string containing 'fnord', regardless of whether there are multibyte characters in the string; likewise, the pattern ".*©.*" will work; but "©*" won't work, because the * will bind to the last byte of the multibyte character. The collation functions will still work on single-byte
characters but will sort multibyte characters oddly. And so on.

So basically, UTF-8 renders most Lua core functionality useless as soon as one venture beyond US-ASCII, broadly speaking?

If you're writing a web server, then your best bet is to emit UTF8,

I can do that.

 and avoid
doing any string slicing;

I need to do that. A major feature of the app is search.

if you write your Lua scripts in UTF8, then you can
trivially include UTF8 sequences in constant strings:

 local s = "fóö"

Since HTTP can be driven entirely with US-ASCII, then this probably won't
cause you any problems.

Thanks for the explanations :)

Cheers

--
PA, Onnay Equitursay
http://alt.textdrive.com/



Reply | Threaded
Open this post in threaded view
|

Re: The World According to Lua: How To?

Klaus Ripke
In reply to this post by Petite Abeille
On Friday 18 February 2005 19:03, PA wrote:
> >> Ok. But that seems to imply that I have to forgo Lua's string library
> >> altogether and use the above library instead. Is that a drop in
> >> replacement? Just asking :))
> >
> > In principle yes -- to the extend where any
> > data used with string is really UTF-8 character data
> > (and once the mentioned todos are done).
> > In practice this is looking for trouble, as it would break other
> > libs using string to cut arbitrary bytes out of something.
>
> So... what does that mean? Most (all?) Lua extensions are build in
> terms of the default string API, no?
probably

> Does the entire string library needs to be hijacked?
no, don't do it.
Most uses will really mean to work on bytes,
regardless of their content,
and these could get very confused by such hijacking.

In principle there could be some libs actually wanting
to strip, say, the first ten *characters* from a string
(regardless of how many bytes that may be)
or, maybe less exotic, to lowercase everything.

Maybe it would be nice to have the encoding independent
functions rep, dump, and bytewise sub, len and reverse
in a package "bytes" so as to replace "string" with whatever
seems fit. However, assume "string" to mean "bytes",
and to maximize confusion do "ctype = utf8" and patch
around libs to use "ctype" instead of "string". Anyway,
such libs better accept a ctype module as parameter.


> Staff or stuff? :P
just quoting your server's X-Powered-by header

> > doesn't look like there's too much need to manipulate the texts?
> Indexing? Search?
can be done regardless of content.
Only normalization and collation would require some more.

> Reading and writing to disk? Reading and writing to sockets?
no manipulation there. just pass bytes on.


saludos


Reply | Threaded
Open this post in threaded view
|

Re: The World According to Lua: How To?

Klaus Ripke
In reply to this post by Petite Abeille
On Friday 18 February 2005 19:14, PA wrote:
> So basically, UTF-8 renders most Lua core functionality useless as soon
> as one venture beyond US-ASCII, broadly speaking?
far too broad. only string.

> >  and avoid doing any string slicing;
> I need to do that. A major feature of the app is search.
That's a tough one!
May want to check http://malete.org/Doc/CharSet on that,
especially the notes on east asian word indexing.
As promised this will be (independently of the full DB)
be made available from lua.


cheers




Reply | Threaded
Open this post in threaded view
|

Re: The World According to Lua: How To?

David Given
In reply to this post by Petite Abeille
On Friday 18 February 2005 18:14, PA wrote:
[...]
> So basically, UTF-8 renders most Lua core functionality useless as soon
> as one venture beyond US-ASCII, broadly speaking?

No. Most of the core functionality will continue to work fine. Anything that 
makes assumptions as to how many bytes a character takes won't work, but 
there's surprisingly little that does that.

For example:

 instring = "Hello, world!"
 instring = string.gsub(instring, "e", "Ã")
 instring = string.gsub(instring, "l", "Ä")
 instring = string.gsub(instring, "o", "Ã")
 instring = string.gsub(instring, "!", "Â")
 print(instring)

This works fine. string.gsub() doesn't care that the replacement strings are 
multibyte characters --- it's dealing purely with bytes. In fact, the above 
code will work with *all* character encodings that degrade into ASCII; 
ISO8859-n, UTF8, some of the Asian encodings, etc. I happened to use UTF8. 
The only caveat is that you have to save the script in the same encoding as 
the source file, because otherwise the constant strings in the script won't 
match.

Likewise:

 instring = "abcdÃfghi"
 instring = string.gsub(instring, "Ã", "e")
 print(instring)

...will also work. But:

 instring = "abcdÃfghi"
 s = string.find(instring, "Ã")
 print(string.sub(instring, s, 1))

Oop! This'll emit a broken character. Use this instead:

 instring = "abcdÃfghi"
 s, e = string.find(instring, "Ã")
 print(string.sub(instring, s, (e-s)))

See? I'm not making any assumptions about how long any strings are.

The only major issues is that [], * and + in regular expression won't work on 
multibyte characters (but they'll keep working on the single-byte characters 
surrounding the multibyte characters)... it's a shame that Lua's regular 
expressions don't support groups. Most regexp engines allow you to group 
stuff up like this:

 (abcd)*   matches   abcdabcdabcd

This would provide a very easy way to get regexps working with multibyte 
characters:

 (Ã)*

The à would expand into multiple bytes, and the regexp would Just Work. 
Unfortunately, that's not supported.

The other thing to be aware of is that string.upper() and string.lower() won't 
work, but nobody expects them to work on anything other than Latin scripts 
anyway.

-- 
+- David Given --McQ-+ "USER'S MANUAL VERSION 1.0:  The information
|  [hidden email]    | presented in this publication has been carefully
| ([hidden email]) | for reliability." --- anonymous computer hardware
+- www.cowlark.com --+ manual


Reply | Threaded
Open this post in threaded view
|

Re: The World According to Lua: How To?

Petite Abeille

On Feb 18, 2005, at 20:01, David Given wrote:

No. Most of the core functionality will continue to work fine. Anything that makes assumptions as to how many bytes a character takes won't work, but
there's surprisingly little that does that.

Ok. This all sounds all good. I will keep in mind the caveat about source file encoding and string length.

Now I need to get down and dirty and give it a spin.

Thanks a lot for the explanations :)

Cheers

--
PA, Onnay Equitursay
http://alt.textdrive.com/


12