Allow UTF8 value names in lua?

classic Classic list List threaded Threaded
44 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Allow UTF8 value names in lua?

bil til
Do you possibly consider to allow value/variable names encoded UTF8 in lua?

(of course still NOT allowed to start with a number, and of course still NO
operator signs as +#-*/... must be used for such a name string ... but that
any "wider" UTF8 chars would be allowed?)

This for sure would be VERY nice for very many users in Asia... . (The
Europeans with their crazy languages usually have always workarounds to
write their crazy letters in English, but even those users would like this,
I think).





--
Sent from: http://lua.2524044.n2.nabble.com/Lua-l-f2524044.html

Reply | Threaded
Open this post in threaded view
|

Re: Allow UTF8 value names in lua?

Gabriel Bertilson
You might be interested in trying out my "UTF-8 identifiers patch",
which I posted about in the list:
http://lua.2524044.n2.nabble.com/UTF-8-identifiers-patch-td7684626.html

It's not quite satisfactory (it probably should normalize, and allow
ZWNJ and ZWJ in certain positions; see
https://www.unicode.org/reports/tr31/tr31-31.html#Layout_and_Format_Control_Characters),
but it's an improvement over simply allowing all non-ASCII bytes in
identifiers, and it didn't greatly inflate the binary.

— Gabriel

On Thu, Oct 24, 2019 at 11:55 PM bil til <[hidden email]> wrote:

>
> Do you possibly consider to allow value/variable names encoded UTF8 in lua?
>
> (of course still NOT allowed to start with a number, and of course still NO
> operator signs as +#-*/... must be used for such a name string ... but that
> any "wider" UTF8 chars would be allowed?)
>
> This for sure would be VERY nice for very many users in Asia... . (The
> Europeans with their crazy languages usually have always workarounds to
> write their crazy letters in English, but even those users would like this,
> I think).
>
>
>
>
>
> --
> Sent from: http://lua.2524044.n2.nabble.com/Lua-l-f2524044.html
>

Reply | Threaded
Open this post in threaded view
|

Re: Allow UTF8 value names in lua?

bil til
Thank you, this sounds interesting.

Do you have an idea why this bloats the code by 11/56k (dll / luac)?

I would usually assume, that this does not bloat up the code at all very
much ... .

As I would like to avoid to go through your code in any detail: Are there
MANY functions, which need to be changed? (do you have an idea, which
functions would "bloat up" most, if this is allowed?)

And than you for this nice example with φ - this really looks already
wonderful :) .



--
Sent from: http://lua.2524044.n2.nabble.com/Lua-l-f2524044.html

Reply | Threaded
Open this post in threaded view
|

Re: Allow UTF8 value names in lua?

Gabriel Bertilson
On Fri, Oct 25, 2019 at 12:46 AM bil til <[hidden email]> wrote:
> Do you have an idea why this bloats the code by 11/56k (dll / luac)?

I never did look into this. I would have to learn how to compare the
contents of binary files....

> As I would like to avoid to go through your code in any detail: Are there
> MANY functions, which need to be changed? (do you have an idea, which
> functions would "bloat up" most, if this is allowed?)

The changes only involved the lexer. Loading a Lua script might be
slower (particularly if it contains bytes greater than 0x7F), probably
not by much (no additional allocations are required at least), but I
haven't tried to measure it.

Normalization, so that á and á aren't separate identifiers, would
require even more Unicode data, and probably memory allocation, which
would slow things down more; but I never added it because I didn't
figure out how to store the normalized identifier in the lexer buffer.
Another difficulty is that I'd need to use a Unicode library, like
utf8proc, to do the normalization, and Lua is supposed to be
self-contained! :-(

— Gabriel

Reply | Threaded
Open this post in threaded view
|

Re: Allow UTF8 value names in lua?

bil til
Thank you, this really sounds interesting.

Normalisation I would NOT require ... this should be case of the user then.

But in this other blog you cited, you said that you did this modification
yourself. Don't you have the C code available then?



--
Sent from: http://lua.2524044.n2.nabble.com/Lua-l-f2524044.html

Reply | Threaded
Open this post in threaded view
|

Re: Allow UTF8 value names in lua?

Gabriel Bertilson
On Fri, Oct 25, 2019 at 1:55 AM bil til <[hidden email]> wrote:
> But in this other blog you cited, you said that you did this modification
> yourself. Don't you have the C code available then?

Oh yes, it was linked from my post that I linked to:
https://github.com/Erutuon/lua-utf8-identifiers

— Gabriel

Reply | Threaded
Open this post in threaded view
|

Re: Allow UTF8 value names in lua?

bil til
This post was updated on .
Thank you, I looked at the source code.

I am just bewildered, because only a very small part is this lctype.c,
lctype.h named "throw error early for bytes not used at all in UTF-8
(0xC0...)".

But the main code of this gitub link is "....lua" source code? I would
expect that the necessary changes ONLY apply to c sourcecode, but NOT to lua
code?

PS: I now seem to understand how ingeneously this table in lctype.c is
made .. so in principal to allow the characters 0x80...0xFF to be alpha-
characters, I just place 0x01 there, so that the "ALPHABIT" is set, and
then those "UTF8 characters" should be allowed... .
(of course the user then will have the problem with this Unicode
ambiguities... this anyway is clear...). Thanks, this really looks great.

(Just this then would NOT change the binary code size at all, as I
see it .... it is just that this table has 128 times "0x01" instead of "0x00"...).

PPS: But instead of having the approach that lua's internal compiler
should be able to handle such "Unicode ambiguities", you could also
ask the user to use an editor, which already could check for such
"Unicode ambiguities" when editing or saving the UTF8 file ... I hope
somthing like this is either already available or should be available as
soon as some people might need this... .
--
Sent from: http://lua.2524044.n2.nabble.com/Lua-l-f2524044.html

Reply | Threaded
Open this post in threaded view
|

Re: Allow UTF8 value names in lua?

Gabriel Bertilson
On Fri, Oct 25, 2019 at 2:30 AM bil til <[hidden email]> wrote:

>
> Thank you, I looked at the source code.
>
> I am just bewildered, because only a very small part is this lctype.c,
> lctype.h named "throw error early for bytes not used at all in UTF-8
> (0xC0...)".
>
> But the main code of this gitub link is "....lua" source code? I would
> expect that the necessary changes ONLY apply to c sourcecode, but NOT to lua
> code?

The important stuff is in the src directory. The repository was pretty
messy. I've cleaned it up a bit and added a patch file, which shows
the changes. To use the patch, I unpack a fresh copy of Lua 5.3.5 and
invoke these commands in the top directory:

patch -p1 < utf8_identifiers.patch
make linux 'MYCFLAGS=-D ALLOW_UTF8_IDENTIFIERS'

There are two Lua files, which you don't need: the tests, and
make_data.lua, which generates the Unicode data from
DerivedCoreProperties.txt in the Unicode Character Database and
modifies src/unicodeid_template.h to create src/unicodeid.h.

Reply | Threaded
Open this post in threaded view
|

Re: Allow UTF8 value names in lua?

bil til
I see, so quite a bit more now ... but nice to know, thank you... . (this
github page changed quite a bit in the last 2 days, 2 days before it looked
VERY different...).



--
Sent from: http://lua.2524044.n2.nabble.com/Lua-l-f2524044.html

Reply | Threaded
Open this post in threaded view
|

Re: Allow UTF8 value names in lua?

Egor Skriptunoff-2
In reply to this post by bil til
On Fri, Oct 25, 2019 at 7:55 AM bil til wrote:
Do you possibly consider to allow value/variable names encoded UTF8 in lua?

This for sure would be VERY nice for very many users in Asia...


You can try to use LuaJIT (formally speaking, it's a fork of Lua 5.1).
It does allow using arbitrary Unicode symbols in a variable name.
For example, you can make valid (but invisible!) variable name consisting of non-breaking space (U+00A0).
Or you can use emojis (inspiring picture: https://pbs.twimg.com/media/DiTBXXmU0AAzaMP.jpg)
But developers from other countries may have difficulties reading and modifying your code containing non-English letters.
Reply | Threaded
Open this post in threaded view
|

Re: Allow UTF8 value names in lua?

bil til
Thank you, this example really looks CRAZILY nice :) ... probably only for
emoji insiders... but why not support also emojy insiders :).

... just your demo code looks very "c-like" ... are you sure that this is
luaJIT ... does luaJIT not follow the syntax rules of lua? (does it really
have a real pre-compiler with #define... ???).... (but only side
question...).



--
Sent from: http://lua.2524044.n2.nabble.com/Lua-l-f2524044.html

Reply | Threaded
Open this post in threaded view
|

Re: Allow UTF8 value names in lua?

Egor Skriptunoff-2
On Mon, Oct 28, 2019 at 10:45 AM bil til wrote:
... just your demo code looks very "c-like" ... are you sure that this is
luaJIT ...

Of course, this is NOT a Lua code.
UTF-8 identifiers is not a Lua-specific feature; almost any programming language could be "emojificated".
The picture is just an illustration (taken from another language) of what we would have to be ready for  :-)
Reply | Threaded
Open this post in threaded view
|

Re: Allow UTF8 value names in lua?

Philippe Verdy
In reply to this post by bil til
That's not just Emojis. Consider ANY language not using Latin at all, or rarely except for Brands (there are many such languages), then you need UTF-8.
Russian, Bulgarian, Ukrainian, Arabic, Persian, Urdu, Hindi, Bengali, Tamil, Chinese, Japanese, Korean, Amharic...
Much more than half the world population rarely use Latin or have strong difficulties with it (so much that even if brands in Latin are displayed, advertisers need to transcribe the brand names; even a lot of people don't have any Latin name registered for them, until an authority decides to give them such romanized names (they can't always choose themselves even if they can read Latin) to print on their passport if they travel abroad.
Of course programmers (in Lua or other languages) should know Latin, but for their daily work they'd like to use their own language for their own projects. It's quite common in China to have project names, variables, and APIs written in ideographs (and if they ever try to name them in English, the English produced is very poor and full of typos, and minomers; except for Chinese living in Hong Kong, just ask to those in Beijing and Northern China, or even those living in Taiwan that don't even all speak easily Mandarin, but other austronesian languages; go to Africa, even in those countries where English is an official or old colonial language, it is not always chosen; go to Egypt, they dominantly speak and write Arabic)
Arabic (or Hebrew) in programming languages however can be challenging due to the Bidi reordering that could mess a program (e.g. with Lua's curryfied syntax, or because it would reorder a simple division or substraction...). There are specific issues described in Unicode for Bidi scripts in programming languages. But beside this there should be no issue at all for Chinese, Russian, Hindi, Japanese, or Korean...

Le lun. 28 oct. 2019 à 08:46, bil til <[hidden email]> a écrit :
Thank you, this example really looks CRAZILY nice :) ... probably only for
emoji insiders... but why not support also emojy insiders :).

... just your demo code looks very "c-like" ... are you sure that this is
luaJIT ... does luaJIT not follow the syntax rules of lua? (does it really
have a real pre-compiler with #define... ???).... (but only side
question...).



--
Sent from: http://lua.2524044.n2.nabble.com/Lua-l-f2524044.html

Reply | Threaded
Open this post in threaded view
|

Re: Allow UTF8 value names in lua?

Gabriel Bertilson
In reply to this post by Egor Skriptunoff-2
Just to clarify, my UTF-8 identifiers patch does not allow emojis as
identifiers, only word-like sequences. For instance, in the REPL:

> 💩 = 10
stdin:1: disallowed code point in identifier near '💩'

— Gabriel


On Mon, Oct 28, 2019 at 2:22 PM Egor Skriptunoff
<[hidden email]> wrote:

>
> On Mon, Oct 28, 2019 at 10:45 AM bil til wrote:
>>
>> ... just your demo code looks very "c-like" ... are you sure that this is
>> luaJIT ...
>
>
> Of course, this is NOT a Lua code.
> UTF-8 identifiers is not a Lua-specific feature; almost any programming language could be "emojificated".
> The picture is just an illustration (taken from another language) of what we would have to be ready for  :-)

Reply | Threaded
Open this post in threaded view
|

Re: Allow UTF8 value names in lua?

bil til
This post was updated on .
Thank you for clarification, for me this is fine :).

I think really main target are the Chinese. The "standard Chinese" really
need their writing very much, as the languages in China are extremely
different in speaking, but the writing unites them - this has been in China
like this for many 1000 years... . Further from my China travels I have the
impression that the basic Chinese writing "for daily use" is extremely
efficient, takes typically much less space than western phonetic writing.

The arabic and hebrew is really very crazy ... they for sure also would like
this very much, but it is not only that the writing direction is from right
to left, but the usually numbers are written "inverse" (as western style
from left to right). This really makes it nearly impossible for those
languages to find a "perfect solution" ... but I can imagine that they anway
would like very much to see variable names in their own writing... .




--
Sent from: http://lua.2524044.n2.nabble.com/Lua-l-f2524044.html

Reply | Threaded
Open this post in threaded view
|

[OT] Re: Allow UTF8 value names in lua?

Chris Smith

> On 29 Oct 2019, at 05:45, bil til <[hidden email]> wrote:
> I think really main target are the Chinese. The "standard Chinese" really
> need their writing very much, as the languages in China are extremely
> different in speaking, but the language unites them - this has been in China
> like this for many 1000 years... . Further from my China travels I have the
> impression that the basic Chinese writing "for daily use" is extremely
> efficient, takes typically much less space than western phonetic writing.

It is very spatially efficient for relatively long words because it is very graphically complex, but quite inefficient for short words due to the ambiguity of each character — a character normally represents a concept and you typically need at two characters to form a defined word. In terms of human input it is less efficient than romanised languages, since the only practical way of entering characters is to type the romanised form and manually select the correct characters from a list.

Its graphical complexity alone makes it inappropriate for this use, in my view. I’ve spent hours debug COBOL that wasn’t working because a full stop was in the wrong place; I can’t imagine the pain of debugging Lua code that isn’t working because the character for a variable in one place has the wrong radical, making it an entirely different variable!

Chris

Reply | Threaded
Open this post in threaded view
|

Re: [OT] Re: Allow UTF8 value names in lua?

bil til
I just know, that if I look English videos in airplane with Chinese
subtitles, usually the subtitles are extremely short written ... (compared
to the English spoken text...).

Also for numeric table column headers, Chinese writing is terrific, it all
looks very compact and nice. In western phonetic writing, you typically have
the problems that table column headers tend to be much too wide... .

Just Chinese people told me, that translating western literature in Chinese
gets very complicated often, needing many more words to explain special
western words to the Chinese reader. So e. g. Tolkien Lord of the rings
books in Chinese language are MUCH more paper... . In phonetic writing it is
very easy to "invent" some new world, practically anybody can do this.
"Inventing" a new word in Chinese is possible practically only in spoken
text, but in written text this becomes very complicated, you need then
somhow to add many Chinese signs together... .

... but looking for errors in some text using unknown letters must be really
a nightmare ... this then is more like "encrypted coding"... . nice training
to get a perfect look at small side effects... .




--
Sent from: http://lua.2524044.n2.nabble.com/Lua-l-f2524044.html

Reply | Threaded
Open this post in threaded view
|

Re: [OT] Re: Allow UTF8 value names in lua?

Viacheslav Usov
On Tue, Oct 29, 2019 at 9:10 AM bil til <[hidden email]> wrote:

> Also for numeric table column headers, Chinese writing is terrific, it all looks very compact and nice.

We have re-invented this by using icons in GUI.

> In western phonetic writing

Western writing is generally not phonetic, it is alphabetic. English is not phonetic at all.

Cheers,
V.
Reply | Threaded
Open this post in threaded view
|

Re: Allow UTF8 value names in lua?

Viacheslav Usov
In reply to this post by bil til
On Fri, Oct 25, 2019 at 6:55 AM bil til <[hidden email]> wrote:

> Do you possibly consider to allow value/variable names encoded UTF8 in lua?

Please do not.

Some UTF-8 symbols look very much like the others (and I am not even talking about the fact that most Chinese (for example) characters would look all mostly the same to a person who does not know Chinese). Some characters could be invisible, or visible by only modifying adjacent characters, resulting in more craziness.  Different UTF-8 strings may look exactly the same (or not) in some editors.

Especially in Lua, with its global by default rule, such things could be very confusing, painful and difficult to debug.
 
I have struggled with this enough in other languages, and I can confidently say it would be a disaster in Lua, too.

Cheers,
V.
Reply | Threaded
Open this post in threaded view
|

Re: [OT] Re: Allow UTF8 value names in lua?

bil til
In reply to this post by Viacheslav Usov
Sorry, I am not a linguistic scientist :) ...

Probably I do not even know what phonetic means, I have to admit...

But the basics of any writing is called Alphabet? (even the Chinese one is
called Alphabet in English, although it contains neither alpha nor beta or
anything nearly similar to this...).

... and calling Alphabet "alphabetic" is applied tautology ... then I would
prefer to call the western alphabet phonetic and the Chinese more iconized
... or something similar...



--
Sent from: http://lua.2524044.n2.nabble.com/Lua-l-f2524044.html

123