UTF-8 identifiers patch

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

UTF-8 identifiers patch

Gabriel Bertilson
This is inspired by the recent thread [1] (as well as several other
threads), but I am a new poster to the Lua list and wasn't sure how to
reply to it.

For fun, I created a modified version of Lua 5.3.5, available on
GitHub [2], that allows for UTF-8 identifiers if you define
ALLOW_UTF8_IDENTIFIERS. It modifies lctype.c, lctype.h, and llex.c and
adds an extra header. It starts with the modification of considering
bytes 0x80-0xFF as alphabetic, but then validates the encoding and
checks that the code points belong to the properties XID_Start or
XID_Continue (see DerivedCoreProperties.txt in the Unicode Database
[3]). This seems to be the identifier syntax that Rust uses. [4] Thus
invalid UTF-8 and whitespace, most punctuation, surrogates, or
noncharacters aren't allowed.

So stuff like the following will compile:

π = math.pi
φ = (1 + math.sqrt(5)) / 2

but this will generate an error message:

load "function \u{a0} () end" -- no-break space as identifier

For those who like obfuscation, you can still do fun stuff like
spelling "function" with an omicron (ο) [5] or using identifiers that
are identical except that one has a combining diacritic and the other
a composed character [6]. Preventing those sorts of things would be
possible, but only with data on scripts [7] and with Unicode
normalization, which would add a more weight.

This change doesn't seem to bloat the binaries too much, about 11K for
lua53.dll and 54K for luac.exe (not sure what's going on with the
latter). I've been compiling with the MinGW64 version of GCC in MSYS2.
[8]

-rwxr-xr-x 1  48128 Oct 24 21:21 ./lua-5.3.5-utf8-identifiers/bin/lua.exe
-rwxr-xr-x 1 256222 Oct 24 21:21 ./lua-5.3.5-utf8-identifiers/bin/lua53.dll
-rwxr-xr-x 1 858765 Oct 23 02:48 ./lua-5.3.5-utf8-identifiers/bin/luac.exe
-rwxr-xr-x 1  48128 Oct 25 15:17 ./lua-5.3.5/bin/lua.exe
-rwxr-xr-x 1 244958 Oct 25 15:17 ./lua-5.3.5/bin/lua53.dll
-rwxr-xr-x 1 802926 Oct 25 15:17 ./lua-5.3.5/bin/luac.exe

There's room for improvement. For instance, maybe the error messages
(currently, <msg> near '<potential identifier>') should somehow
include the byte or code point position within the identifier. Also
I'm pretty new to C, so my code might not be great.

I don't know if this should be added to standard Lua, but maybe
someone will find it useful or interesting.

[1] http://lua-users.org/lists/lua-l/2018-10/msg00065.html
[2] https://github.com/Erutuon/lua-utf8-identifiers
[3] https://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
[4] https://doc.rust-lang.org/1.7.0/reference.html#identifiers
[5] http://lua-users.org/lists/lua-l/2017-04/msg00417.html
[6] http://lua-users.org/lists/lua-l/2017-05/msg00002.html
[7] https://en.wikipedia.org/wiki/Module:Unicode_data/scripts
[8] https://www.msys2.org/