Quantcast

Could Lua itself become UTF8-aware?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
62 messages Options
1234
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Could Lua itself become UTF8-aware?

Dirk Laurie-2
At present all the entries from 0x80 to 0xFF in the constant array
luai_ctype in lctype.c are zero: no bit set.

There are three unused bits. Couldn't two of them be used to mean
UTF8_FIRST and UTF8_CONT?

This is only the first step, but if the idea is shot down here already,
the others need not be mentioned.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Could Lua itself become UTF8-aware?

Roberto Ierusalimschy
> At present all the entries from 0x80 to 0xFF in the constant array
> luai_ctype in lctype.c are zero: no bit set.
>
> There are three unused bits. Couldn't two of them be used to mean
> UTF8_FIRST and UTF8_CONT?
>
> This is only the first step, but if the idea is shot down here already,
> the others need not be mentioned.

This particular idea has very low cost, so I don't see why to shot it
down before knowing the rest of the story. What does it mean for Lua
to be "UTF-8 aware"?

-- Roberto

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Could Lua itself become UTF8-aware?

Dirk Laurie-2
2017-04-29 15:21 GMT+02:00 Roberto Ierusalimschy <[hidden email]>:

>> At present all the entries from 0x80 to 0xFF in the constant array
>> luai_ctype in lctype.c are zero: no bit set.
>>
>> There are three unused bits. Couldn't two of them be used to mean
>> UTF8_FIRST and UTF8_CONT?
>>
>> This is only the first step, but if the idea is shot down here already,
>> the others need not be mentioned.
>
> This particular idea has very low cost, so I don't see why to shot it
> down before knowing the rest of the story. What does it mean for Lua
> to be "UTF-8 aware"?
>
> -- Roberto

The next step would be a compiler option under which the lexer
accepts a UTF-8 first character followed by the correct number
of UTF-8 continuation characters as being alphabetic for the
purpose of being an identifier or part of one.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Could Lua itself become UTF8-aware?

Patrick Donnelly
On Sat, Apr 29, 2017 at 9:41 AM, Dirk Laurie <[hidden email]> wrote:

> 2017-04-29 15:21 GMT+02:00 Roberto Ierusalimschy <[hidden email]>:
>>> At present all the entries from 0x80 to 0xFF in the constant array
>>> luai_ctype in lctype.c are zero: no bit set.
>>>
>>> There are three unused bits. Couldn't two of them be used to mean
>>> UTF8_FIRST and UTF8_CONT?
>>>
>>> This is only the first step, but if the idea is shot down here already,
>>> the others need not be mentioned.
>>
>> This particular idea has very low cost, so I don't see why to shot it
>> down before knowing the rest of the story. What does it mean for Lua
>> to be "UTF-8 aware"?
>>
>> -- Roberto
>
> The next step would be a compiler option under which the lexer
> accepts a UTF-8 first character followed by the correct number
> of UTF-8 continuation characters as being alphabetic for the
> purpose of being an identifier or part of one.

I'm very against even inching towards this destination. Lua is a
*language*. As soon as we start allowing identifiers outside of ASCII,
we begin to cultivate "dialects". Only with full support would
anyone's Lua be able to load scripts written with identifiers from
another language. And, of course, programmers not fluent in that
language would be at great disadvantage.

Air traffic control for flight standardized on English so any pilot
can communicate with any flight controller. In the same way, I think
it makes a lot of sense for programmers to accept that English is the
lingua franca for programming, including comments, documentation, and
identifiers. There really is no upside to allowing non-English (ASCII)
identifiers.

Maybe that's self-serving as an American for which I apologize.

--
Patrick Donnelly

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Could Lua itself become UTF8-aware?

Soni L.


On 2017-04-30 12:05 AM, Patrick Donnelly wrote:

> On Sat, Apr 29, 2017 at 9:41 AM, Dirk Laurie <[hidden email]> wrote:
>> 2017-04-29 15:21 GMT+02:00 Roberto Ierusalimschy <[hidden email]>:
>>>> At present all the entries from 0x80 to 0xFF in the constant array
>>>> luai_ctype in lctype.c are zero: no bit set.
>>>>
>>>> There are three unused bits. Couldn't two of them be used to mean
>>>> UTF8_FIRST and UTF8_CONT?
>>>>
>>>> This is only the first step, but if the idea is shot down here already,
>>>> the others need not be mentioned.
>>> This particular idea has very low cost, so I don't see why to shot it
>>> down before knowing the rest of the story. What does it mean for Lua
>>> to be "UTF-8 aware"?
>>>
>>> -- Roberto
>> The next step would be a compiler option under which the lexer
>> accepts a UTF-8 first character followed by the correct number
>> of UTF-8 continuation characters as being alphabetic for the
>> purpose of being an identifier or part of one.
> I'm very against even inching towards this destination. Lua is a
> *language*. As soon as we start allowing identifiers outside of ASCII,
> we begin to cultivate "dialects". Only with full support would
> anyone's Lua be able to load scripts written with identifiers from
> another language. And, of course, programmers not fluent in that
> language would be at great disadvantage.
>
> Air traffic control for flight standardized on English so any pilot
> can communicate with any flight controller. In the same way, I think
> it makes a lot of sense for programmers to accept that English is the
> lingua franca for programming, including comments, documentation, and
> identifiers. There really is no upside to allowing non-English (ASCII)
> identifiers.
>
> Maybe that's self-serving as an American for which I apologize.
>

Just allow all 0x80-0xFF to be used in identifiers. Solves most of our
issues.

Also, y'know...

emoji.

(I'd argue there's a HUGE upside to allowing emoji. I think everything
should just allow emoji. Actually why don't we get rid of ASCII while at
it, make things emoji-only?)

--
Disclaimer: these emails may be made public at any given time, with or without reason. If you don't agree with this, DO NOT REPLY.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Could Lua itself become UTF8-aware?

Paul E. Merrell, J.D.
In reply to this post by Patrick Donnelly
On Sat, Apr 29, 2017 at 8:05 PM, Patrick Donnelly <[hidden email]> wrote:

> I'm very against even inching towards this destination. Lua is a
> *language*. As soon as we start allowing identifiers outside of ASCII,
> we begin to cultivate "dialects". Only with full support would
> anyone's Lua be able to load scripts written with identifiers from
> another language. And, of course, programmers not fluent in that
> language would be at great disadvantage.
>
> Air traffic control for flight standardized on English so any pilot
> can communicate with any flight controller. In the same way, I think
> it makes a lot of sense for programmers to accept that English is the
> lingua franca for programming, including comments, documentation, and
> identifiers. There really is no upside to allowing non-English (ASCII)
> identifiers.
>
> Maybe that's self-serving as an American for which I apologize.

[long post warning]

Some telegraphy history might stretch your horizon a bit.

One of the early experimental systems in France at the sending end
involved selecting a wire corresponding to a character and briefly
connecting it to an electrical power source, which caused a charge to
be sent down the wire. At the receiving end was a group of people,
each holding a wire in their hand corresponding to a character. When
one of them felt a shock, they were to shout the name of their
character. A transcriber seated among the group was to write down each
character as it was received. Crude and slow, but it worked.

Skip ahead and we arrive at Baudot, who devised what amounted to a
5-bit code that could handle an English or French alphabet. Never
commercialized, but his system led to similar systems that were
successful. Five-bit telegraphy code peaked with invention and
commercialization of the Teletype, a wondrous device that allowed
using a typewriter keyboard and eventually punched 5-bit TTY paper
tape for storage and advance recording of telegraph messages.

But not expressive enough for some folk, such as newspaper types who
wanted both lower case and upper case characters plus special
characters and a few codes for the forerunners of today's markup
languages. Enter the Teletypesetter in 1928, which used a 6-bit TTS
code that, with two code pages (the typewriter shift key was the
switch between them), fit the bill. Suddenly, lines of type could be
set one one end of the continent but cast from molten lead at the
other end. All you had to do was to learn to cope with a
typewriter-like keyboard that had both a Shift key and an Unshift key.

Then in the 1950s, when IBM et ilk were looking for ways to
commercialize computers, they realized that the newspaper industry was
working with punched paper tape that was inherently binary and that
newspapers had the money to buy computers that could reduce very
expensive labor costs. So we got very crude hyphenation programs that
could process "idiot tape," TTS tape punched without line breaks, each
paragraph a continuous stream. The computers could process the tape
and add the line endings and do the hyphenation.  But as word
processing technology developed in the newspaper industry, [1] 6 bits
with two code pages just wasn't expressive enough, so a third code
page was engrafted onto TTS, using the dollar-sign character as its
toggle (we had to put two consecutive dollar signs to get a dollar
sign after that. But that gave us a code page for cryptic commands
like "$d52" that were the forerunners of today's word processing
"styles."

But even with a third code page, TTS just wasn't expressive enough. So
along comes 8-bit ASCII. Room for the entire English alphabet, both
upper and lower case, plus a lot more special characters and computer
codes. And what had been the printer's handwritten markup language
that had evolved over some 500 years was abruptly translated into
computer code that could be processed far more quickly. ASCII was even
expressive enough for modern "markup" languages and programming
languages.

But ASCII was not expressive enough for all human languages, notably
the CJK languages that depend on iconic symbols rather than alphabetic
characters. And thus Unicode was born.

UTF-8 is now recognizable as ASCII's successor. It's become the most
common character set specified for web pages. And some of the
most-used programming libraries speak UTF-8, e.g, the multi-platform
GTK and Qt families of libraries.

I work on a GTK-based program [2] that runs on a wide variety of
operating systems and is localized for about 17 human languages,
including CJK languages and right-to-left languages. We couldn't do
that if our data files were written to ASCII. We have to use UTF-8.
And we embed Lua, which means that Lua has to handle UTF-8 strings. We
get there by also embedding Xavier Wang's lua-utf8 code, [3] which
lets our scripters get and set character offsets that allow for UTF-8
multi-byte characters.

But my message here is that Lua either makes the transition to more
elegant handling of UTF-8 data or Lua will go the way of those people
who held those wires, waiting for a shock so they could shout their
character. Lua's code may be written in ASCII, but Lua has to make the
transition to UTF-8 strings or Lua will be obsoleted much sooner than
it has to be.

Or in other words, it's the character set of the data content rather
than the character set of the programming code that points the way
here. The data is going to be UTF-8 and Lua must evolve to handle it.

My 2 cents.

Paul

[1]  Virtually all of the basic modern word processing techniques were
originally developed for the newspaper industry while it was still
using TTS code. There is probably more here about TTS and related
technology than you will want to view/read.
<http://www.gochipmunk.com/html/contents.html>. I worked as a
typographer in daily newspapers in the years spanning the introduction
of computers and the transition from hot metal linecasting to
computer-aided phototypesetting.

[2] <https://notecasepro.com>

[3] <https://github.com/starwing/luautf8>.




--
[Notice not included in the above original message:  The U.S. National
Security Agency neither confirms nor denies that it intercepted this
message.]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Could Lua itself become UTF8-aware?

Russell Haley
I could never understand the need for extra characters in ascii until reading your post and then seeing a picture of a business listing in a 1960 phone book produced with and ACE machine (it was formatted in a box... that had to come from specific character slugs to create the mold for printing! I saw it in the second link in Paul's gochipmunk.com reference).

Russ

Sent from my BlackBerry 10 smartphone on the Virgin Mobile network.
  Original Message  
From: Paul Merrell
Sent: Saturday, April 29, 2017 11:09 PM
To: Lua mailing list
Reply To: Lua mailing list
Subject: Re: Could Lua itself become UTF8-aware?

On Sat, Apr 29, 2017 at 8:05 PM, Patrick Donnelly <[hidden email]> wrote:

> I'm very against even inching towards this destination. Lua is a
> *language*. As soon as we start allowing identifiers outside of ASCII,
> we begin to cultivate "dialects". Only with full support would
> anyone's Lua be able to load scripts written with identifiers from
> another language. And, of course, programmers not fluent in that
> language would be at great disadvantage.
>
> Air traffic control for flight standardized on English so any pilot
> can communicate with any flight controller. In the same way, I think
> it makes a lot of sense for programmers to accept that English is the
> lingua franca for programming, including comments, documentation, and
> identifiers. There really is no upside to allowing non-English (ASCII)
> identifiers.
>
> Maybe that's self-serving as an American for which I apologize.

[long post warning]

Some telegraphy history might stretch your horizon a bit.

One of the early experimental systems in France at the sending end
involved selecting a wire corresponding to a character and briefly
connecting it to an electrical power source, which caused a charge to
be sent down the wire. At the receiving end was a group of people,
each holding a wire in their hand corresponding to a character. When
one of them felt a shock, they were to shout the name of their
character. A transcriber seated among the group was to write down each
character as it was received. Crude and slow, but it worked.

Skip ahead and we arrive at Baudot, who devised what amounted to a
5-bit code that could handle an English or French alphabet. Never
commercialized, but his system led to similar systems that were
successful. Five-bit telegraphy code peaked with invention and
commercialization of the Teletype, a wondrous device that allowed
using a typewriter keyboard and eventually punched 5-bit TTY paper
tape for storage and advance recording of telegraph messages.

But not expressive enough for some folk, such as newspaper types who
wanted both lower case and upper case characters plus special
characters and a few codes for the forerunners of today's markup
languages. Enter the Teletypesetter in 1928, which used a 6-bit TTS
code that, with two code pages (the typewriter shift key was the
switch between them), fit the bill. Suddenly, lines of type could be
set one one end of the continent but cast from molten lead at the
other end. All you had to do was to learn to cope with a
typewriter-like keyboard that had both a Shift key and an Unshift key.

Then in the 1950s, when IBM et ilk were looking for ways to
commercialize computers, they realized that the newspaper industry was
working with punched paper tape that was inherently binary and that
newspapers had the money to buy computers that could reduce very
expensive labor costs. So we got very crude hyphenation programs that
could process "idiot tape," TTS tape punched without line breaks, each
paragraph a continuous stream. The computers could process the tape
and add the line endings and do the hyphenation. But as word
processing technology developed in the newspaper industry, [1] 6 bits
with two code pages just wasn't expressive enough, so a third code
page was engrafted onto TTS, using the dollar-sign character as its
toggle (we had to put two consecutive dollar signs to get a dollar
sign after that. But that gave us a code page for cryptic commands
like "$d52" that were the forerunners of today's word processing
"styles."

But even with a third code page, TTS just wasn't expressive enough. So
along comes 8-bit ASCII. Room for the entire English alphabet, both
upper and lower case, plus a lot more special characters and computer
codes. And what had been the printer's handwritten markup language
that had evolved over some 500 years was abruptly translated into
computer code that could be processed far more quickly. ASCII was even
expressive enough for modern "markup" languages and programming
languages.

But ASCII was not expressive enough for all human languages, notably
the CJK languages that depend on iconic symbols rather than alphabetic
characters. And thus Unicode was born.

UTF-8 is now recognizable as ASCII's successor. It's become the most
common character set specified for web pages. And some of the
most-used programming libraries speak UTF-8, e.g, the multi-platform
GTK and Qt families of libraries.

I work on a GTK-based program [2] that runs on a wide variety of
operating systems and is localized for about 17 human languages,
including CJK languages and right-to-left languages. We couldn't do
that if our data files were written to ASCII. We have to use UTF-8.
And we embed Lua, which means that Lua has to handle UTF-8 strings. We
get there by also embedding Xavier Wang's lua-utf8 code, [3] which
lets our scripters get and set character offsets that allow for UTF-8
multi-byte characters.

But my message here is that Lua either makes the transition to more
elegant handling of UTF-8 data or Lua will go the way of those people
who held those wires, waiting for a shock so they could shout their
character. Lua's code may be written in ASCII, but Lua has to make the
transition to UTF-8 strings or Lua will be obsoleted much sooner than
it has to be.

Or in other words, it's the character set of the data content rather
than the character set of the programming code that points the way
here. The data is going to be UTF-8 and Lua must evolve to handle it.

My 2 cents.

Paul

[1] Virtually all of the basic modern word processing techniques were
originally developed for the newspaper industry while it was still
using TTS code. There is probably more here about TTS and related
technology than you will want to view/read.
<http://www.gochipmunk.com/html/contents.html>. I worked as a
typographer in daily newspapers in the years spanning the introduction
of computers and the transition from hot metal linecasting to
computer-aided phototypesetting.

[2] <https://notecasepro.com>

[3] <https://github.com/starwing/luautf8>.




--
[Notice not included in the above original message: The U.S. National
Security Agency neither confirms nor denies that it intercepted this
message.]


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Could Lua itself become UTF8-aware?

Paul E. Merrell, J.D.
On Sat, Apr 29, 2017 at 11:43 PM, Russell Haley <[hidden email]> wrote:
> I could never understand the need for extra characters in ascii until reading your post and then seeing a picture of a business listing in a 1960 phone book produced with and ACE machine (it was formatted in a box... that had to come from specific character slugs to create the mold for printing! I saw it in the second link in Paul's gochipmunk.com reference).

The ACE Elektron was the peak of hot metal linecasting equipment. Lots
of bells  and whistles. But even the average Linotype machine had
something on the order of 20,000 moving parts. They  were  lots of fun
to use and maintain. You'll find a better view of the basic
linecasting function here.
<http://www.gochipmunk.com/html/teletypeperf3.html>.

I often miss the hot metal typography days.

Paul


--
[Notice not included in the above original message:  The U.S. National
Security Agency neither confirms nor denies that it intercepted this
message.]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Could Lua itself become UTF8-aware?

Rain Gloom
> Solves most of our issues.
I've never seen any useful program that used UTF-8 in identifiers, using it for string processing would make sense, but things like typing π instead of pi will just complicate writing and parsing. Because of that, I must ask: what _are_ those issues?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Could Lua itself become UTF8-aware?

Marc Balmer
In reply to this post by Patrick Donnelly


Am 30.04.17 um 05:05 schrieb Patrick Donnelly:

> On Sat, Apr 29, 2017 at 9:41 AM, Dirk Laurie <[hidden email]> wrote:
>> 2017-04-29 15:21 GMT+02:00 Roberto Ierusalimschy <[hidden email]>:
>>>> At present all the entries from 0x80 to 0xFF in the constant array
>>>> luai_ctype in lctype.c are zero: no bit set.
>>>>
>>>> There are three unused bits. Couldn't two of them be used to mean
>>>> UTF8_FIRST and UTF8_CONT?
>>>>
>>>> This is only the first step, but if the idea is shot down here already,
>>>> the others need not be mentioned.
>>>
>>> This particular idea has very low cost, so I don't see why to shot it
>>> down before knowing the rest of the story. What does it mean for Lua
>>> to be "UTF-8 aware"?
>>>
>>> -- Roberto
>>
>> The next step would be a compiler option under which the lexer
>> accepts a UTF-8 first character followed by the correct number
>> of UTF-8 continuation characters as being alphabetic for the
>> purpose of being an identifier or part of one.
>
> I'm very against even inching towards this destination. Lua is a
> *language*. As soon as we start allowing identifiers outside of ASCII,
> we begin to cultivate "dialects". Only with full support would
> anyone's Lua be able to load scripts written with identifiers from
> another language. And, of course, programmers not fluent in that
> language would be at great disadvantage.
>
> Air traffic control for flight standardized on English so any pilot
> can communicate with any flight controller. In the same way, I think
> it makes a lot of sense for programmers to accept that English is the
> lingua franca for programming, including comments, documentation, and
> identifiers. There really is no upside to allowing non-English (ASCII)
> identifiers.
>
> Maybe that's self-serving as an American for which I apologize.

This comparison is flaky.  Although ATCs _use_ english for their
communication with pilots, this does not mean that they speak other
languages as well.  Using english is a convention and it is not enforced
by making sure a newborn only learns english, learns no other languages,
and 25 years later becomes an ATC.

Using english is not technically enforced, it is a convention.

So even when Lua were to allow emojis as identifiers, you are not forced
to use that.  You can, by convention, restrict yourself to ASCII only.




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Could Lua itself become UTF8-aware?

Marc Balmer
In reply to this post by Paul E. Merrell, J.D.
> [...]
>
> UTF-8 is now recognizable as ASCII's successor. It's become the most
> common character set specified for web pages. And some of the
> most-used programming libraries speak UTF-8, e.g, the multi-platform
> GTK and Qt families of libraries.
>
> [...]


UTF-8 is _not_ a character set.  It stands for UCS Transformation
Format, or Universal Character Set Transformation Format.  Universal
Character Set, or the almost identical Unicode, are character sets.
UTF-8 is just one format.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Could Lua itself become UTF8-aware?

Shmuel Zeigerman
In reply to this post by Dirk Laurie-2
On 29/04/2017 16:41, Dirk Laurie wrote:
>
> The next step would be a compiler option under which the lexer
> accepts a UTF-8 first character followed by the correct number
> of UTF-8 continuation characters as being alphabetic for the
> purpose of being an identifier or part of one.
>

BTW, LuaJIT 2 has it for years already (it allows UTF-8 in identifiers).
But it seems nobody needs it.

--
Shmuel


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Could Lua itself become UTF8-aware?

Soni L.


On 2017-04-30 07:19 AM, Shmuel Zeigerman wrote:

> On 29/04/2017 16:41, Dirk Laurie wrote:
>>
>> The next step would be a compiler option under which the lexer
>> accepts a UTF-8 first character followed by the correct number
>> of UTF-8 continuation characters as being alphabetic for the
>> purpose of being an identifier or part of one.
>>
>
> BTW, LuaJIT 2 has it for years already (it allows UTF-8 in
> identifiers). But it seems nobody needs it.
>

Well, I'd argue mostly nobody needs it because it doesn't allow emoji in
identifiers.

--
Disclaimer: these emails may be made public at any given time, with or without reason. If you don't agree with this, DO NOT REPLY.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Could Lua itself become UTF8-aware?

Gé Weijers


On Sunday, April 30, 2017, Rain Gloom <<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;raingloom42@gmail.com&#39;);" target="_blank">raingloom42@...> wrote:
> Solves most of our issues.
I've never seen any useful program that used UTF-8 in identifiers, using it for string processing would make sense, but things like typing π instead of pi will just complicate writing and parsing. Because of that, I must ask: what _are_ those issues?

The biggest issue is that people want to use other languages, especially now coding classes for teenagers are in fashion. My first programming classes (college level) expected Dutch for names of identifiers, which was easy enough, you just leave off a few diacritical marks here and there, but that won't work for Greek, Russian, Chinese, Korean, Arabic, Ugaritic (Lua in cuneiform on clay tablets, anyone?) etc. etc. What's wrong with:

function 解決問題 ()
end

This is not a trivial exercise to get right, I'm not sure that just pretending any Unicode glyph from 0x80 on up is a 'letter' is sufficient.





--
--


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Could Lua itself become UTF8-aware?

Roberto Ierusalimschy
> [...] What's wrong with:
>
> function 解決問題 ()
> end
>
> This is not a trivial exercise to get right, I'm not sure that just
> pretending any Unicode glyph from 0x80 on up is a 'letter' is sufficient.

What is wrong with

function   ()
end

or

function functiοn ()
end
?

-- Roberto

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Could Lua itself become UTF8-aware?

Gregg Reynolds-2


On Apr 30, 2017 3:04 PM, "Roberto Ierusalimschy" <[hidden email]> wrote:
> [...] What's wrong with:
>
> function 解決問題 ()
> end
>
> This is not a trivial exercise to get right, I'm not sure that just
> pretending any Unicode glyph from 0x80 on up is a 'letter' is sufficient.

What is wrong with

function  ()
end

or

function functiοn ()
end

for that matter, why not
وظيفة ()
نهاية

(arabic)

gregg


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Could Lua itself become UTF8-aware?

Jay Carlson
In reply to this post by Roberto Ierusalimschy
[Why do I write the same messages every year or two? I forget when and whether I've sent them, and every time I write about the subjects, I learn more about them. Hopefully this is more coherent a summary than before, although it needs a lot more polishing. Maybe there's a workshop paper in here if I ever finish the code.]

On Apr 29, 2017, at 9:21 AM, Roberto Ierusalimschy <[hidden email]> wrote:
>
>> At present all the entries from 0x80 to 0xFF in the constant array
>> luai_ctype in lctype.c are zero: no bit set.
>>
>> There are three unused bits. Couldn't two of them be used to mean
>> UTF8_FIRST and UTF8_CONT?

As long as people still check for things like UTF-16 surrogates in UTF-8. IMO C code is almost necessary for very unsexy things like this, because it can be done once, correctly, and with good performance.

Are these ctype values going to be turned into pattern classes?

>>
>> This is only the first step, but if the idea is shot down here already,
>> the others need not be mentioned.
>
> This particular idea has very low cost, so I don't see why to shot it
> down before knowing the rest of the story. What does it mean for Lua
> to be "UTF-8 aware"?


One thing I should make clear at the start of this: I agree that there is not more than one kind of string in Lua, at least not in the obvious future. There can't be a type-distinction between UTF-8, ASCII, or byte buffer--I think?

So what would I like?

I'd like to have Lua better support checking whether something is RFC-legal UTF-8. My goal is to have prompt, syntactically local diagnosis of string manipulation errors. I don't want either me or the bozos down the hall passing my function a bogus byte stream when it needs be in UTF-8 to have a correct program.

I already heavily assert(is_utf8) on application core utilities; that way I won't let bogus byte buffers propagate into my internal text models. For a big class of string-manipulating programs, not letting bad UTF-8 in magically solves it on output, as many strings are just passed through. (IIRC, the biggest place you can shoot yourself in the foot is by using "." in Lua patterns. But you're safe if all your "."s are surrounded by literal codepoints.)

Somewhere I have a patch for 5.3 which uses a stolen alignment byte in a TString to store character subset info. Lua strings are immutable, but the implementation can keep notes about them.

I use the three-valued logic of "string known to be in a charset", "known not in a charset", and "unknown". is_utf8(s) just looks at the UTF-8 field, and if its membership is known, it returns it. Otherwise it scans the string as usual. and stores what it has found in the TString flags before returning.

This is a lot cheaper than it looks, and I think it can be done for minimal costs for most existing programs. The library would be opt-in. The great majority of costs are placed on people who want utf8 validation. But there are some things that only the language can do cheaply.

The first thing we can check for is "does this string belong to ASCII?". For strings we're fully hashing, this test is very cheap: just OR together all the bytes as we process them. Check the high bit.

If we're not fully hashing a string, we don't know anything about it, so set all flags to "unknown".

However, if we know a string is definitely in ASCII, it is also in UTF-8, so we mark the string as known to be both UTF-8 and ASCII whenever we notice. There are a lot of strings in this world written in ASCII, so marking them as UTF-8 too is cheap.

If a string flunks the ASCII test (or its ASCII status is unknown because it's long), the string might be legal UTF-8 data. The first time somebody asks us whether a string is in UTF-8 with is_utf8, we have to walk the entire string and determine if it's legal. Similarly, we do remember the result in the TString before we return.

The good news about ASCII and UTF-8 is that they're closed under concatenation. When two strings are concatenated, ("a".."‡") we can automatically set the result's "known" flags by combining the flags on the two strings. We may not know anything about one of the strings, in which case the result's status is unknown. If anybody had ever called is_ut8f() on  "a" and "‡", we'd immediately know the result is in UTF-8 and know it's not ASCII.

To be honest, the place where I stopped working on this was before adding a library to liblua.  I had proved to myself that this would work on a small scale, and examining TStrings in the debugger showed them working right. Next goal was to make an analog of string.* in utf8.*; all calls to the utf8 library will blow up on receiving or producing invalid UTF-8. Then I got distracted.

Conveniently, any strings that come out of utf8.* functions have to be validated by is_utf8(); this means the strings produced by the library already have "is UTF-8" and "is ASCII" set accordingly.

I've got two more status flags left in TString. To use the first, I've considered "has no NULs" or  "is in UTF-8 and has no NULs", since I've needed both from time to time.

If you've stuck around this long, I have a premature optimization: The last free flag space in TString could be use for diagnosis of why a string failed to validate. "UTF-8 only broken at start of string", "UTF-8 only broken at end", "UTF-8 broken internally", "unknown". Clearly if you concatenate any two strings with any validation failures in the middle, you get a non-UTF-8 string. But there is a chance that concatenating two non-UTF-8 strings will produce a valid string; perhaps a buffer split up a sequence.

If anybody has a good benchmark for me to try, I can figure out how much slowdown this causes for existing code. My hope for the performance hit is less than ~2% slower on existing string.*-only Lua code, and significantly faster than sprinkling assertions around. As with LINQ-for-Lua, no promises I'l ship it.

Fascists like me would type:

  local string = utf8

in most code.

Jay

[0]: How did I make it through that message without writing any footnotes?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Could Lua itself become UTF8-aware?

Sean Conner
In reply to this post by Roberto Ierusalimschy
It was thus said that the Great Roberto Ierusalimschy once stated:
> > [...] What's wrong with:
> >
> > function 解決問題 ()
> > end

  I don't have the right font installed to see the function name.

> > This is not a trivial exercise to get right, I'm not sure that just
> > pretending any Unicode glyph from 0x80 on up is a 'letter' is sufficient.
>
> What is wrong with
>
> function   ()
> end

  Again, missing character in font.

> or
>
> function functiοn ()
> end
> ?

  There was a long discussion about that a few years ago:

        http://lua-users.org/lists/lua-l/2014-04/msg01226.html

  It appears the consensus then was "maybe not a good idea."

  -spc


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Could Lua itself become UTF8-aware?

Dirk Laurie-2
2017-04-30 23:11 GMT+02:00 Sean Conner <[hidden email]>:

>   There was a long discussion about that a few years ago:
>
>         http://lua-users.org/lists/lua-l/2014-04/msg01226.html
>
>   It appears the consensus then was "maybe not a good idea."
>
>   -spc

I did not read that before posting, and Sean is careful not to imply that
old issues are dead and buried,

But UTF-8 has come closer to universal acceptance in the last three
years.  What was "maybe not a good idea" back then, might have
become" maybe not a bad idea" now.

-- Dirk

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Could Lua itself become UTF8-aware?

Daurnimator
On 1 May 2017 at 15:05, Dirk Laurie <[hidden email]> wrote:

> 2017-04-30 23:11 GMT+02:00 Sean Conner <[hidden email]>:
>
>>   There was a long discussion about that a few years ago:
>>
>>         http://lua-users.org/lists/lua-l/2014-04/msg01226.html
>>
>>   It appears the consensus then was "maybe not a good idea."
>>
>>   -spc
>
> I did not read that before posting, and Sean is careful not to imply that
> old issues are dead and buried,
>
> But UTF-8 has come closer to universal acceptance in the last three
> years.  What was "maybe not a good idea" back then, might have
> become" maybe not a bad idea" now.
>
> -- Dirk
>

I don't think UTF-8's acceptance factor has changed at all in the last
~5-10 years: it has always been highly accepted outside of
interoperation with windows native APIs. Around 2005-2006 was when I
started hearing C# and Java devs wish they had UTF-8 instead.

However, to reply to the issue at hand: are unicode classes wanted?
i.e. should a unicode space such as U+2001 count as whitespace for
token separation?
Furthermore, what should be considered valid characters for identifiers?
I guess we still want the rule "alpha followed by any number of alphanumeric"?
Which Unicode standard do we want to pick? (You did realise unicode
gets updated.... right?)
We'd need a strategy to deal with updates (which rarely go well: see
how people are still dealing with fallout from IDNA2003 => IDNA2008)

Which brings us to the next problem: normalisation of identifiers. It
would seem perplexing to many that the identifiers U+00C5 and U+0041
U+030A would refer to different variables.
Even if you don't think normalisation should occur (like myself), then
you'll at least have an easy mechanism for obfuscated code
contests....

1234
Loading...