Could Lua itself become UTF8-aware?

classic Classic list List threaded Threaded
62 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Sean Conner
It was thus said that the Great Dirk Laurie once stated:

> 2017-04-30 23:11 GMT+02:00 Sean Conner <[hidden email]>:
>
> >   There was a long discussion about that a few years ago:
> >
> >         http://lua-users.org/lists/lua-l/2014-04/msg01226.html
> >
> >   It appears the consensus then was "maybe not a good idea."
> >
> >   -spc
>
> I did not read that before posting, and Sean is careful not to imply that
> old issues are dead and buried,

  Well, the discussion was about using keywords as variable names, like:

        if if == then then then = else else else = end end

> But UTF-8 has come closer to universal acceptance in the last three
> years.  What was "maybe not a good idea" back then, might have
> become" maybe not a bad idea" now.

  But there are still issues, like not all of Unicode code points are valid
alphabetic characters---you have a block of mathematical operator
characters, digit characters for different languages, combining characters
which now include colors.  The rules get rather complex rather quickly ...

  -spc (Don't even want to think about left-to-right vs. right-to-left
        issues ... )


Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Enrico Colombini
In reply to this post by Daurnimator
On 01-May-17 07:28, Daurnimator wrote:

> However, to reply to the issue at hand: are unicode classes wanted?
> i.e. should a unicode space such as U+2001 count as whitespace for
> token separation?
> Furthermore, what should be considered valid characters for identifiers?
> I guess we still want the rule "alpha followed by any number of alphanumeric"?
> Which Unicode standard do we want to pick? (You did realise unicode
> gets updated.... right?)
> We'd need a strategy to deal with updates (which rarely go well: see
> how people are still dealing with fallout from IDNA2003 => IDNA2008)
>
> Which brings us to the next problem: normalisation of identifiers. It
> would seem perplexing to many that the identifiers U+00C5 and U+0041
> U+030A would refer to different variables.
> Even if you don't think normalisation should occur (like myself), then
> you'll at least have an easy mechanism for obfuscated code
> contests....

FWIW, in the '80s there were Italian versions of BASIC, but they luckily
died out. I say "luckily" because language localization mean community
fragmentation: you cannot search for ideas and solutions and you cannot
cooperate with people around the world. Not to speak of library usability.

So I do not think localized identifiers are a great idea. I realize that
the issue may be more acutely felt by Asian users, but it is always a
balance between symbol comprehension and using universal symbols (even
in case one does not understand their literal meaning: think of
identifiers as pictograms).

You do not have to think in English to write "if...then...else", even if
it could be very slightly helpful in the first learning steps. In fact,
I knew almost nothing of English when I started programming and that did
not hamper me in the least. But the benefits of using world-standardized
identifiers were immense.

--
   Enrico

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Rain Gloom
> you cannot search for ideas and solutions and you cannot cooperate with people around the world. Not to speak of library usability.
This alone should be a big enough problem. If someone wants to teach kids in their native language, they should be using Scratch or something similar. If the kids are already learning English (I started when I was 7) then they know enough to write a few identifiers in it. If not, they can ask and learn both subjects at once.
Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Martin
In reply to this post by Roberto Ierusalimschy
On 04/29/2017 06:21 AM, Roberto Ierusalimschy wrote:
> What does it mean for Lua to be "UTF-8 aware"?

>From my point of view there are two topics concerning UTF-8 in Lua:

  * allow UTF in identifiers

    Don't see any long-term profit in it. This does not add any
    power but encourages to write unmaintainable code.

  * introduce UTF-8 strings

    UTF-8 codepoint is variable length, from 1 to 4 bytes.
    So we lose ability of direct addressing character in O(1) time.
    Also we could no longer change any character without shifting
    rest of string tail. So it will be performance degradation.

    But further extending "utf8" standard library with advanced
    coding/encoding UTF-8 routines is not bad.

>From the other hand, by using UTF we may design language with
extended glyphs in syntax. It will eliminate clashes between
"[" (indexing) and "[[" long quote string, "-" unary minus and
"--" comment, and may shrink "<=", "=>", "==" to one glyph.

Problem is how to type UTF glyphs from keyboard? And this will be
another language anyway.

-- Martin

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Dirk Laurie-2
2017-05-01 23:25 GMT+02:00 Martin <[hidden email]>:

> Problem is how to type UTF glyphs from keyboard? And this will be
> another language anyway.

The system I use has a fairly easy way to configure a second
language on the third and fourth level. I use it for APL. Apart
from that, there are plenty of common characters on the compose
key, e.g. all accented letters, ©, ß, §, →, ⇒, —, ….

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Roberto Ierusalimschy
In reply to this post by Sean Conner
> > What is wrong with
> >
> > function   ()
> > end
>
>   Again, missing character in font.
>
> > or
> >
> > function functiοn ()
> > end
> > ?
>
>   There was a long discussion about that a few years ago:
>
> http://lua-users.org/lists/lua-l/2014-04/msg01226.html

There is no missing character in font nor keywords being used as
identifiers. Both | | ("en quad", a kind of white space) and
|functiοn| (with a omicron for "ο") would be valid identifiers with
the proposal (accept everything above 127 as valid in identifiers).

I want to be very clear that I am not against Unicode, quite the
opposite. However, "accept everything above 127 as valid in identifiers"
is quite different from supporting Unicode (again, it is quite the
opposite). For data, the Lua approach is that it gives a very basic
support and leave the rest for specialized libraries. Several programs
have been using Lua with Unicode quite successful (e.g., Lightroom and
LuaTeX).

However, the Lua lexer should not depend on libraries. It would be great
if the lexer could handle Unicode (correctly!), but I don't know how to
do it without at least doubling the size of Lua. To do it (very) broken,
I prefer to keep it as it is.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Roberto Ierusalimschy
In reply to this post by Jay Carlson
> I'd like to have Lua better support checking whether something is RFC-legal UTF-8.

What is wrong with 'utf8.len'?

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Soni "They/Them" L.
In reply to this post by Roberto Ierusalimschy


On 2017-05-01 10:00 AM, Roberto Ierusalimschy wrote:

>>> What is wrong with
>>>
>>> function   ()
>>> end
>>    Again, missing character in font.
>>
>>> or
>>>
>>> function functiοn ()
>>> end
>>> ?
>>    There was a long discussion about that a few years ago:
>>
>> http://lua-users.org/lists/lua-l/2014-04/msg01226.html
> There is no missing character in font nor keywords being used as
> identifiers. Both | | ("en quad", a kind of white space) and
> |functiοn| (with a omicron for "ο") would be valid identifiers with
> the proposal (accept everything above 127 as valid in identifiers).
>
> I want to be very clear that I am not against Unicode, quite the
> opposite. However, "accept everything above 127 as valid in identifiers"
> is quite different from supporting Unicode (again, it is quite the
> opposite). For data, the Lua approach is that it gives a very basic
> support and leave the rest for specialized libraries. Several programs
> have been using Lua with Unicode quite successful (e.g., Lightroom and
> LuaTeX).
>
> However, the Lua lexer should not depend on libraries. It would be great
> if the lexer could handle Unicode (correctly!), but I don't know how to
> do it without at least doubling the size of Lua. To do it (very) broken,
> I prefer to keep it as it is.
>
> -- Roberto
>

But my latin-1! (why limit it to UTF-8 if you can just add support for
pretty much every single extension of ASCII instead?)

--
Disclaimer: these emails may be made public at any given time, with or without reason. If you don't agree with this, DO NOT REPLY.


Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

szbnwer@gmail.com
In reply to this post by Roberto Ierusalimschy
i deeply agree with these arguments, and proper solutions tastes like
over complicating stuffs, but i think trolls and their apps/libs would
just bleed out on natural selection, even if every possibilities are
left opened :D in worse case trolls' stuffs just become forked and
cleaned up... the harder part is input from different keyboard, if one
have to relay to some special chars, but this is for a target group or
also for natural selection...

the most important question is security, but i've got no hints, just i
know, it can be tricky to find good ideas for it, but opensauce (yup)
is for trust, and wouldn't be good to make people abled to trick
others eyes and do anything evil, but thats already art to use this
for anything harmful, but it's an old trick to use these tricks for
phising, but i dunni how this would come from lua utf stuffs, but the
security topic always worth the time and effort :)

2017-05-01 15:05 GMT+02:00 Roberto Ierusalimschy <[hidden email]>:
>> I'd like to have Lua better support checking whether something is RFC-legal UTF-8.
>
> What is wrong with 'utf8.len'?
>
> -- Roberto
>

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Dirk Laurie-2
In reply to this post by Soni "They/Them" L.
2017-05-01 15:15 GMT+02:00 Soni L. <[hidden email]>:

> But my latin-1! (why limit it to UTF-8 if you can just add support for
> pretty much every single extension of ASCII instead?)

Much as I loved ISO-8859-1, I have to admit that it has become
obsolete except as input to iconv very quickly.

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Jay Carlson
In reply to this post by Soni "They/Them" L.

> On Apr 30, 2017, at 10:44 AM, Soni L. <[hidden email]> wrote:
>
> On 2017-04-30 07:19 AM, Shmuel Zeigerman wrote:
>> On 29/04/2017 16:41, Dirk Laurie wrote:
>>>
>>> The next step would be a compiler option under which the lexer
>>> accepts a UTF-8 first character followed by the correct number
>>> of UTF-8 continuation characters as being alphabetic for the
>>> purpose of being an identifier or part of one.
>>>
>>
>> BTW, LuaJIT 2 has it for years already (it allows UTF-8 in identifiers). But it seems nobody needs it.
>>
>
> Well, I'd argue mostly nobody needs it because it doesn't allow emoji in identifiers.

Does it prohibit all astral characters, or are emoji singled out?

Either way, that's too bad. You can write this Ruby:

def 🚫(📖)
        🐦 = 💻(📖)
        📝🚫🐊 = Array.new

        begin
                🚫📥 = 🐦.blocked_ids
                File.open(File.join(LOGDIR, 'block_ids.current'), 'w+') do |💾|
                        🚫📥.sort.each do |😭🐊|
                                💾.puts(😭🐊)
                                📝🚫🐊 << 😭🐊.to_s

https://github.com/oapi/emojiautoblocker/blob/master/emojiautoblocker



Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Gé Weijers
In reply to this post by Roberto Ierusalimschy
Roberto,

You're not very far off the mark.

On Mon, May 1, 2017 at 6:00 AM, Roberto Ierusalimschy <[hidden email]> wrote:

However, the Lua lexer should not depend on libraries. It would be great
if the lexer could handle Unicode (correctly 
!), but I don't know how to
do it without at least doubling the size of Lua. To do it (very) broken,
I prefer to keep it as it is.

 
Russ Cox' RE2 regular expression library, which supports Unicode, is 50% larger than the Lua interpreter, after comments are stripped. Tables take about 6200 lines of that. I'm sure those tables could be represented in a more compact way, but those tables still have to be interpreted, so the Lua interpreter would grow very substantially.

Anyone who wants to do it the right way should probably read this first:






Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Jay Carlson
In reply to this post by Sean Conner

> On Apr 30, 2017, at 5:11 PM, Sean Conner <[hidden email]> wrote:
>
> It was thus said that the Great Roberto Ierusalimschy once stated:
>>> [...] What's wrong with:
>>>
>>> function 解決問題 ()
>>> end
>
>  I don't have the right font installed to see the function name.

Which century are you in? OK, OK, I'll spot you another ten years. Which decade? :-)

More to the point, which environment is so minimal it doesn't have coverage for unified Han today? I was playing with a Raspberry Pi Zero, and the Raspbian default packaging is happy to display Chinese on Chinese web pages (or on English pages, for that matter.). I bet your *phone* can display world languages better than that environment.

raspberrypi.org ships a meta-distro called NOOBS; it includes a Raspbian install image. That particular Raspbian build is missing the main Korean glyphs, called hangeul. Well, I mean it was missing them by default; I installed.

If you are using a "pick everything yourself" distro you only have yourself to blame for those missing character boxes. Maybe you can blame the distro too, because the "install a reasonable GUI" target should include support for all text you will encounter. Minimal support could be display-only.

Not everybody is stuck on a subsidized $5+peripherals computer with an 8GB sdcard. The next step up from that for me is Lubuntu. Lubuntu LiveCDs cover display of all the big languages and some of the small.

IMO the software development world will have a lot to figure out about the "right" way to edit UTF-8. Remember, after n decades of C, we still have tab-vs-space warfare.

Jay
Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Ahmed Charles
In reply to this post by Roberto Ierusalimschy
On 5/1/2017 6:00 AM, Roberto Ierusalimschy wrote:

> I want to be very clear that I am not against Unicode, quite the
> opposite. However, "accept everything above 127 as valid in identifiers"
> is quite different from supporting Unicode (again, it is quite the
> opposite). For data, the Lua approach is that it gives a very basic
> support and leave the rest for specialized libraries. Several programs
> have been using Lua with Unicode quite successful (e.g., Lightroom and
> LuaTeX).
>
> However, the Lua lexer should not depend on libraries. It would be great
> if the lexer could handle Unicode (correctly!), but I don't know how to
> do it without at least doubling the size of Lua. To do it (very) broken,
> I prefer to keep it as it is.
>
> -- Roberto

Lua 5.3.4 seems to accept the following code:

-- print á
print("á")

When encoded as UTF-8.

Looking at the manual, I can't tell whether this is intentional or not.
The program, when executed, does seem to function as expected. Can you
clarify if this program is intended to be valid Lua?

If this code is intended to be valid Lua, then I'd think it has enough
Unicode support already. The only remaining support I could think of
would be for identifiers, but that would require implementing UAX31 [1]
or similar, which would require at least some Unicode mapping tables.
Lua currently doesn't include these tables and lets this functionality
be up to external libraries. So, I fail to see how including these for
the significantly less useful feature of Unicode identifiers would be a
worthwhile trade-off.

Anecdotally:

Go [2] seems to do something simpler than what UAX31 suggests, which
seems to be allowing any Unicode letter followed by any Unicode letter
or Unicode digit, while considering underscore to be a Unicode letter.

Rust [3] seems to still be on the fence, though since minimalism isn't a
primary goal of the language and providing a larger set of Unicode
functionality would be beneficial, I expect a solution allowing
non-ascii identifiers to be arrived at eventually.

[1]: http://unicode.org/reports/tr31/
[2]: https://golang.org/ref/spec
[3]: https://github.com/rust-lang/rust/issues/28979
Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Soni "They/Them" L.
In reply to this post by Jay Carlson


On 2017-05-01 03:44 PM, Jay Carlson wrote:

>> On Apr 30, 2017, at 10:44 AM, Soni L. <[hidden email]> wrote:
>>
>> On 2017-04-30 07:19 AM, Shmuel Zeigerman wrote:
>>> On 29/04/2017 16:41, Dirk Laurie wrote:
>>>> The next step would be a compiler option under which the lexer
>>>> accepts a UTF-8 first character followed by the correct number
>>>> of UTF-8 continuation characters as being alphabetic for the
>>>> purpose of being an identifier or part of one.
>>>>
>>> BTW, LuaJIT 2 has it for years already (it allows UTF-8 in identifiers). But it seems nobody needs it.
>>>
>> Well, I'd argue mostly nobody needs it because it doesn't allow emoji in identifiers.
> Does it prohibit all astral characters, or are emoji singled out?
>
> Either way, that's too bad. You can write this Ruby:
>
> def 🚫(📖)
> 🐦 = 💻(📖)
> 📝🚫🐊 = Array.new
>
> begin
> 🚫📥 = 🐦.blocked_ids
> File.open(File.join(LOGDIR, 'block_ids.current'), 'w+') do |💾|
> 🚫📥.sort.each do |😭🐊|
> 💾.puts(😭🐊)
> 📝🚫🐊 << 😭🐊.to_s
>
> https://github.com/oapi/emojiautoblocker/blob/master/emojiautoblocker
>
>
>

Beautiful! See, this is why we need 0x80-0xFF to be valid in
identifiers. (Or emoji, but we're trying to keep things simple and
adding full UTF-8 handling to Lua isn't simple.)

--
Disclaimer: these emails may be made public at any given time, with or without reason. If you don't agree with this, DO NOT REPLY.


Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Sean Conner
In reply to this post by Roberto Ierusalimschy
It was thus said that the Great Roberto Ierusalimschy once stated:

> > > What is wrong with
> > >
> > > function   ()
> > > end
> >
> >   Again, missing character in font.
> >
> > > or
> > >
> > > function functiοn ()
> > > end
> > > ?
> >
> >   There was a long discussion about that a few years ago:
> >
> > http://lua-users.org/lists/lua-l/2014-04/msg01226.html
>
> There is no missing character in font nor keywords being used as
> identifiers. Both | | ("en quad", a kind of white space) and
> |functiοn| (with a omicron for "ο") would be valid identifiers with
> the proposal (accept everything above 127 as valid in identifiers).

  The "en quad" showed up as an outlined box for me (which is how my system
displays a missing or unsupported Unicode character).  And the |function|
with the omicron for the "o" only shows up as "function" (no marks) for me.

  Perhaps I should have been more clear.

  -spc


Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Marc Balmer
In reply to this post by Roberto Ierusalimschy
> I prefer to keep it as it is.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Sean Conner
In reply to this post by Jay Carlson
It was thus said that the Great Jay Carlson once stated:

>
> > On Apr 30, 2017, at 5:11 PM, Sean Conner <[hidden email]> wrote:
> >
> > It was thus said that the Great Roberto Ierusalimschy once stated:
> >>> [...] What's wrong with:
> >>>
> >>> function 解決問題 ()
> >>> end
> >
> >  I don't have the right font installed to see the function name.
>
> Which century are you in? OK, OK, I'll spot you another ten years. Which
> decade? :-)

  Last decade.  I'm not one to needlessly update every 20 minutes because I
have better things to do than to babysit a computer doing nothing up dating
itself (and breaking something---whether it be an existing program or work
flow or something).

  Besides, Unicode is constantly being updated.  How long is an acceptable
amount of time to update *every* computer you use to get the latest and
greatest Unicode fonts installed because you decided to use the latest emoji
characters for an identifier?

  -spc (Not a fan at all of the "constantly upgrade" train ... )

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Sean Conner
In reply to this post by Dirk Laurie-2
It was thus said that the Great Dirk Laurie once stated:
> 2017-05-01 23:25 GMT+02:00 Martin <[hidden email]>:
>
> > Problem is how to type UTF glyphs from keyboard? And this will be
> > another language anyway.
>
> The system I use has a fairly easy way to configure a second
> language on the third and fourth level. I use it for APL. Apart
> from that, there are plenty of common characters on the compose
> key, e.g. all accented letters, ©, ß, §, →, ⇒, —, ….

  Compose key?  (Looks down at his US keyboard)

  -spc (And you can pry my 25 year old IBM Modem M keyboard from my cold
        dead hands ... )

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Andrew Starks-2
In reply to this post by Dirk Laurie-2

On Mon, May 1, 2017 at 08:46 Dirk Laurie <[hidden email]> wrote:
2017-05-01 15:15 GMT+02:00 Soni L. <[hidden email]>:

> But my latin-1! (why limit it to UTF-8 if you can just add support for
> pretty much every single extension of ASCII instead?)

Much as I loved ISO-8859-1, I have to admit that it has become
obsolete except as input to iconv very quickly.


There is one single language understood by every culture that is doing scientific work: math. As far as I'm aware, there is no alternate character set that one needs to translate to or from if you want to express change, vectors or some other mathematical concept with someone from another culture. 

In the same way that the language of math is universally understood in every corner of the world, is standard ASCII the universally accepted character set for source code and primitive English used for keywords? If that is more true than less true, then it might be reasonable to argue that it is better to limit source code to ASCII instead of innovating to allow a broader range of characters, outside of string literals, unless there is a clear and convincing reason to break this norm (if it is a norm). 

What problem does it solve? Is support for UTF-8 useful for automated script processing or some sort of DSL application?

-Andrew


1234