Could Lua itself become UTF8-aware?

classic Classic list List threaded Threaded
62 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Jay Carlson

> On May 1, 2017, at 9:05 AM, Roberto Ierusalimschy <[hidden email]> wrote:
>
>> I'd like to have Lua better support checking whether something is RFC-legal UTF-8.
>
> What is wrong with 'utf8.len'?

Nothing I see now. Considering that you fixed it three years ago[1], I am embarrassed; I had written my own and kept it.

The assert-heavy style for is_utf8 is O(n*m). If you know the rules for UTF-8 manipulation in Lua, the number of callpoints, m, can stay small.

I have been saved several times by failed is_utf8 assertions, usually in strings not from my code. OK, my definition of "saved" includes "not producing results outside the domain, possibly causing the next program to crash. maybe." This level of obsession is probably not for everyone.

Jay

[1]: https://github.com/lua/lua/commit/3a044de5a1df82ed5d76f2c5afdf79677c92800f
Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Martin
In reply to this post by Sean Conner
On 05/01/2017 12:40 PM, Sean Conner wrote:

> It was thus said that the Great Dirk Laurie once stated:
>> 2017-05-01 23:25 GMT+02:00 Martin <[hidden email]>:
>>
>>> Problem is how to type UTF glyphs from keyboard? And this will be
>>> another language anyway.
>>
>> The system I use has a fairly easy way to configure a second
>> language on the third and fourth level. I use it for APL. Apart
>> from that, there are plenty of common characters on the compose
>> key, e.g. all accented letters, ©, ß, §, →, ⇒, —, ….
>
>   Compose key?  (Looks down at his US keyboard)
>
>   -spc (And you can pry my 25 year old IBM Modem M keyboard from my cold
> dead hands ... )

Well, I'm not against more glyphs and programming language whose syntax
is binded with them (like APL). They definitely improve clarity of code.
Just waiting consumer-grade (around $10) programmable keyboards with
little display on each button, showing glyph.

But for today as 70 years ago we are still typing text which later fed
to some program (compiler or interpreter).

-- Martin

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Rain Gloom
Since Lua can be embedded in very customised applications, special keyboard functionality should not be expected. I want to be able to write a 10 line primitive REPL loop and use it with a game's soft-keyboard if I have to. If Lua sticks to portable C, it should also stick to portable environments for writing code.

On Tue, May 2, 2017 at 8:42 AM, Martin <[hidden email]> wrote:
On 05/01/2017 12:40 PM, Sean Conner wrote:
> It was thus said that the Great Dirk Laurie once stated:
>> 2017-05-01 23:25 GMT+02:00 Martin <[hidden email]>:
>>
>>> Problem is how to type UTF glyphs from keyboard? And this will be
>>> another language anyway.
>>
>> The system I use has a fairly easy way to configure a second
>> language on the third and fourth level. I use it for APL. Apart
>> from that, there are plenty of common characters on the compose
>> key, e.g. all accented letters, ©, ß, §, →, ⇒, —, ….
>
>   Compose key?  (Looks down at his US keyboard)
>
>   -spc (And you can pry my 25 year old IBM Modem M keyboard from my cold
>       dead hands ... )

Well, I'm not against more glyphs and programming language whose syntax
is binded with them (like APL). They definitely improve clarity of code.
Just waiting consumer-grade (around $10) programmable keyboards with
little display on each button, showing glyph.

But for today as 70 years ago we are still typing text which later fed
to some program (compiler or interpreter).

-- Martin


Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Soni "They/Them" L.
In reply to this post by Jay Carlson


On 2017-05-01 05:34 PM, Jay Carlson wrote:
>> On May 1, 2017, at 9:05 AM, Roberto Ierusalimschy <[hidden email]> wrote:
>>
>>> I'd like to have Lua better support checking whether something is RFC-legal UTF-8.
>> What is wrong with 'utf8.len'?
> Nothing I see now. Considering that you fixed it three years ago[1], I am embarrassed; I had written my own and kept it.
>
> The assert-heavy style for is_utf8 is O(n*m). If you know the rules for UTF-8 manipulation in Lua, the number of callpoints, m, can stay small.

Perhaps you might wanna memoize (with "__mode") such calls? (if the
string is unchanged between calls, it's an O(1) instead of O(n) call,
thus you get O(n+m) instead. if it does get changed, interning still
ends up being more expensive.)

>
> I have been saved several times by failed is_utf8 assertions, usually in strings not from my code. OK, my definition of "saved" includes "not producing results outside the domain, possibly causing the next program to crash. maybe." This level of obsession is probably not for everyone.
>
> Jay
>
> [1]: https://github.com/lua/lua/commit/3a044de5a1df82ed5d76f2c5afdf79677c92800f

--
Disclaimer: these emails may be made public at any given time, with or without reason. If you don't agree with this, DO NOT REPLY.


Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Dirk Laurie-2
In reply to this post by Sean Conner
2017-05-01 21:40 GMT+02:00 Sean Conner <[hidden email]>:

> It was thus said that the Great Dirk Laurie once stated:
>> 2017-05-01 23:25 GMT+02:00 Martin <[hidden email]>:
>>
>> > Problem is how to type UTF glyphs from keyboard? And this will be
>> > another language anyway.
>>
>> The system I use has a fairly easy way to configure a second
>> language on the third and fourth level. I use it for APL. Apart
>> from that, there are plenty of common characters on the compose
>> key, e.g. all accented letters, ©, ß, §, →, ⇒, —, ….
>
>   Compose key?  (Looks down at his US keyboard)

I can choose which it is, and chose AltGr. Third-fourth level
is on (shifted) Alt.

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Jonathan Goble
On Mon, May 1, 2017 at 5:56 PM Dirk Laurie <[hidden email]> wrote:
2017-05-01 21:40 GMT+02:00 Sean Conner <[hidden email]>:
> It was thus said that the Great Dirk Laurie once stated:
>> 2017-05-01 23:25 GMT+02:00 Martin <[hidden email]>:
>>
>> > Problem is how to type UTF glyphs from keyboard? And this will be
>> > another language anyway.
>>
>> The system I use has a fairly easy way to configure a second
>> language on the third and fourth level. I use it for APL. Apart
>> from that, there are plenty of common characters on the compose
>> key, e.g. all accented letters, ©, ß, §, →, ⇒, —, ….
>
>   Compose key?  (Looks down at his US keyboard)

I can choose which it is, and chose AltGr. Third-fourth level
is on (shifted) Alt.

AltGr? (Looks down at his US keyboard)

This is my keyboard: http://imgur.com/a/gmCAg I have two Alt keys and no AltGr key.
Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Luiz Henrique de Figueiredo
In reply to this post by Ahmed Charles
> print("á")
>
> When encoded as UTF-8.
>
> Looking at the manual, I can't tell whether this is intentional or not.
> The program, when executed, does seem to function as expected. Can you
> clarify if this program is intended to be valid Lua?

Calling print with any string is valid Lua. What you see depends on what
the string contains and, more to the point, on how the terminal you use
interprets the contents of the string. Lua is actually completely agnostic
about the contents of a string: strings are just sequence of bytes.

So, yes, A Lua program that uses UTF-8 text in Lua strings will work just
fine, provided you only do I/O with them. The standard string library
does not work on UTF-8 text. The standard utf8 library helps but does not
replace the standard string library.

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Erik Hougaard
In reply to this post by Jonathan Goble
On 01-May-17 3:18 PM, Jonathan Goble wrote:
>
> AltGr? (Looks down at his US keyboard)
>
> This is my keyboard: http://imgur.com/a/gmCAg I have two Alt keys and
> no AltGr key.

Alt+Ctrl should give same result.

/Erik

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Jonathan Goble
On Mon, May 1, 2017 at 6:30 PM Erik Hougaard <[hidden email]> wrote:
On 01-May-17 3:18 PM, Jonathan Goble wrote:
>
> AltGr? (Looks down at his US keyboard)
>
> This is my keyboard: http://imgur.com/a/gmCAg I have two Alt keys and
> no AltGr key.

Alt+Ctrl should give same result.

/Erik

Exactly how? It does nothing for me except trigger menu shortcuts in some programs. I'm confused. Is some sort of configuration required? I'm using Ubuntu 17.04.
Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Gé Weijers
In reply to this post by Ahmed Charles
One of the problems with Go's approach, upper case identifiers at the top level make the identifier visible outside the module, is upper/lower case is not a universal feature of scripts. You can write Go in Coptic, but not in Chinese.

This is one problem Unicode aware Lua would not have...




On Mon, May 1, 2017 at 12:08 PM, Ahmed Charles <[hidden email]> wrote:
On 5/1/2017 6:00 AM, Roberto Ierusalimschy wrote:

> I want to be very clear that I am not against Unicode, quite the
> opposite. However, "accept everything above 127 as valid in identifiers"
> is quite different from supporting Unicode (again, it is quite the
> opposite). For data, the Lua approach is that it gives a very basic
> support and leave the rest for specialized libraries. Several programs
> have been using Lua with Unicode quite successful (e.g., Lightroom and
> LuaTeX).
>
> However, the Lua lexer should not depend on libraries. It would be great
> if the lexer could handle Unicode (correctly!), but I don't know how to
> do it without at least doubling the size of Lua. To do it (very) broken,
> I prefer to keep it as it is.
>
> -- Roberto

Lua 5.3.4 seems to accept the following code:

-- print á
print("á")

When encoded as UTF-8.

Looking at the manual, I can't tell whether this is intentional or not.
The program, when executed, does seem to function as expected. Can you
clarify if this program is intended to be valid Lua?

If this code is intended to be valid Lua, then I'd think it has enough
Unicode support already. The only remaining support I could think of
would be for identifiers, but that would require implementing UAX31 [1]
or similar, which would require at least some Unicode mapping tables.
Lua currently doesn't include these tables and lets this functionality
be up to external libraries. So, I fail to see how including these for
the significantly less useful feature of Unicode identifiers would be a
worthwhile trade-off.

Anecdotally:

Go [2] seems to do something simpler than what UAX31 suggests, which
seems to be allowing any Unicode letter followed by any Unicode letter
or Unicode digit, while considering underscore to be a Unicode letter.

Rust [3] seems to still be on the fence, though since minimalism isn't a
primary goal of the language and providing a larger set of Unicode
functionality would be beneficial, I expect a solution allowing
non-ascii identifiers to be arrived at eventually.

[1]: http://unicode.org/reports/tr31/
[2]: https://golang.org/ref/spec
[3]: https://github.com/rust-lang/rust/issues/28979



--
--

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Gé Weijers
In reply to this post by Soni "They/Them" L.
On Mon, May 1, 2017 at 12:12 PM, Soni L. <[hidden email]> wrote:

Beautiful! See, this is why we need 0x80-0xFF to be valid in identifiers. (Or emoji, but we're trying to keep things simple and adding full UTF-8 handling to Lua isn't simple.)


There are plenty of space characters above 0x7F, including the common em-space. Just making everything above 0x7f a letter would be utterly confusing.

function <invisible space> ()
 print("hello, world")
end

<invisible space> ()

We could start an "obfuscated Lua" contest...

 

--
Disclaimer: these emails may be made public at any given time, with or without reason. If you don't agree with this, DO NOT REPLY.





--
--

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Ahmed Charles
In reply to this post by Luiz Henrique de Figueiredo
On 5/1/2017 3:27 PM, Luiz Henrique de Figueiredo wrote:

>> print("á")
>>
>> When encoded as UTF-8.
>>
>> Looking at the manual, I can't tell whether this is intentional or not.
>> The program, when executed, does seem to function as expected. Can you
>> clarify if this program is intended to be valid Lua?
>
> Calling print with any string is valid Lua. What you see depends on what
> the string contains and, more to the point, on how the terminal you use
> interprets the contents of the string. Lua is actually completely agnostic
> about the contents of a string: strings are just sequence of bytes.
>
> So, yes, A Lua program that uses UTF-8 text in Lua strings will work just
> fine, provided you only do I/O with them. The standard string library
> does not work on UTF-8 text. The standard utf8 library helps but does not
> replace the standard string library.

I guess my question wasn't clear enough. I didn't mean to focus on the
print call. I meant to ask whether UTF-8 in comments is intended and
whether UTF-8 in string literals is intended?

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Dirk Laurie-2
In reply to this post by Jonathan Goble
2017-05-02 0:38 GMT+02:00 Jonathan Goble <[hidden email]>:

> On Mon, May 1, 2017 at 6:30 PM Erik Hougaard <[hidden email]> wrote:
>>
>> On 01-May-17 3:18 PM, Jonathan Goble wrote:
>> >
>> > AltGr? (Looks down at his US keyboard)
>> >
>> > This is my keyboard: http://imgur.com/a/gmCAg I have two Alt keys and
>> > no AltGr key.
>>
>> Alt+Ctrl should give same result.
>>
>> /Erik
>
>
> Exactly how? It does nothing for me except trigger menu shortcuts in some
> programs. I'm confused. Is some sort of configuration required? I'm using
> Ubuntu 17.04.

Excellent. We're on the same wavelength.

<apology> We're off-topic, but it's sort of essential if you actually need
the power of UTF-8 regularly [1], and 99% of non-foreign computer users
don't know this stuff. I only know it because my home language needs
all of áäêëéèïîòóöôüûú. </apology>

Every physical key has a keycap which you can see but your computer
can't. All that the computer knows is what key has been pressed or
released. Your two Alt keys may have identical keycaps but your
computer knows they are different because they send different keycodes.

Each keycode (physical key) is associated with a keysym (what the system
thinks is on the keycap) and a function name (what you would like the key
to do). [2] It's a little like the White Knight's song in "Alice in Wonderland",
making nice distinctions in several steps from "the name of the song is
called" right down to "the song is".

When you installed the system, you were asked what keyboard you
have. You may even have been asked to press the key marked 'Y'.
That determined the keycode-keysym association. But you can easily
change that even now [3].

There probably [4] is a tiny white box near the date on your screen
that says "En" for "English". Click on that and experiment. Look for
"Input". You should be able to add several other keyboards. That
box will then allow you to switch between them right in the middle
of typing something.

Similar things happen on other systems too.

[1] I.e. you don't want to open a character map pop-up, remember
which is Alt-137, or enter the symbol out in hex, all the time.

[2] To know what is currently in effect, from a terminal start a nifty
little utility called "xev". A white box will appear. Now move your
mouse in there. The terminal will go mad with so many mouse
movements to report, so let go of the mouse while you press a few
keys. There will be KeyPress and KeyRelease events, and somewhere
in the middle of each, it reports how the computer currently treats the
key. On mine, for Alt and AltGr it says respectively:
 keycode 64 (keysym 0xfe03, ISO_Level3_Shift)
 keycode 108 (keysym 0xff20, Multi_key)

[3] You can also do it the hard way with a tricky utility called 'xmodmap'.

[4] I'm on 16.04 (I hop from LTS to LTS) so I can only presume
things have not changed too much.

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Paige DePol
In reply to this post by Jay Carlson
Jay Carlson <[hidden email]> wrote:

>> On Apr 30, 2017, at 10:44 AM, Soni L. <[hidden email]> wrote:
>>
>> On 2017-04-30 07:19 AM, Shmuel Zeigerman wrote:
>>> On 29/04/2017 16:41, Dirk Laurie wrote:
>>>> The next step would be a compiler option under which the lexer
>>>> accepts a UTF-8 first character followed by the correct number
>>>> of UTF-8 continuation characters as being alphabetic for the
>>>> purpose of being an identifier or part of one.
>>>
>>> BTW, LuaJIT 2 has it for years already (it allows UTF-8 in identifiers). But it seems nobody needs it.
>>
>> Well, I'd argue mostly nobody needs it because it doesn't allow emoji in identifiers.
>
> Does it prohibit all astral characters, or are emoji singled out?
>
> Either way, that's too bad. You can write this Ruby:
>
> def 🚫(📖)
> 🐦 = 💻(📖)
> 📝🚫🐊 = Array.new
>
> begin
> 🚫📥 = 🐦.blocked_ids
> File.open(File.join(LOGDIR, 'block_ids.current'), 'w+') do |💾|
> 🚫📥.sort.each do |😭🐊|
> 💾.puts(😭🐊)
> 📝🚫🐊 << 😭🐊.to_s
>
> https://github.com/oapi/emojiautoblocker/blob/master/emojiautoblocker

I have been following the conversation about making Lua be UTF-8 aware for
source code identifiers, it has been quite the lively discussion!

Roberto's comment about it making Lua twice as large seemed crazy...
that is until I spent some time looking into UTF-8 libraries and all the
intricacies of implementing a UTF-8 parser. Wow!

I thought it would be interesting to see what could be done with UTF-8 for
source code... but if the result is like the above (source code with emoji
identifiers for those who can't see it) I think I'd rather code in Italian
BASIC or even that one version of Pascal

In all my years of coding I would say 99% of the time everything is in
English... I really hope emoji identifiers do not become standard, that
would be terrible to code in, Soni's obsession with them notwithstanding! ;)

~Paige



Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Dirk Laurie-2
In reply to this post by Ahmed Charles
2017-05-02 7:32 GMT+02:00 Ahmed Charles <[hidden email]>:

> On 5/1/2017 3:27 PM, Luiz Henrique de Figueiredo wrote:
>>> print("á")
>>>
>>> When encoded as UTF-8.
>>>
>>> Looking at the manual, I can't tell whether this is intentional or not.
>>> The program, when executed, does seem to function as expected. Can you
>>> clarify if this program is intended to be valid Lua?
>>
>> Calling print with any string is valid Lua. What you see depends on what
>> the string contains and, more to the point, on how the terminal you use
>> interprets the contents of the string. Lua is actually completely agnostic
>> about the contents of a string: strings are just sequence of bytes.
>>
>> So, yes, A Lua program that uses UTF-8 text in Lua strings will work just
>> fine, provided you only do I/O with them. The standard string library
>> does not work on UTF-8 text. The standard utf8 library helps but does not
>> replace the standard string library.
>
> I guess my question wasn't clear enough. I didn't mean to focus on the
> print call. I meant to ask whether UTF-8 in comments is intended and
> whether UTF-8 in string literals is intended?

The current situation is that comments and strings are byte sequences
and anything is legal inside them.

If it looks like UTF-8 on your screen, that is because the current setting
of your locale or your clever text editor or mail client makes it look so.
At least one  member of this list and I suspect several others still uses
ISO-8859-1 in e-mails and probably their comments and strings would be
ISO-8859-1 too. Note that any program that expects ISO-8859 is not fazed
by UTF-8, it just displays it wrong, whereas a program that expects UTF-8
and gets ISO-8859 is confused.

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Martin
On 05/02/2017 12:23 AM, Dirk Laurie wrote:
> Note that any program that expects ISO-8859 is not fazed
> by UTF-8, it just displays it wrong, whereas a program that expects UTF-8
> and gets ISO-8859 is confused.

(Thinking about returning to hexadecimal editor.)

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Paul E. Merrell, J.D.
In reply to this post by Dirk Laurie-2
On Tue, May 2, 2017 at 12:23 AM, Dirk Laurie <[hidden email]> wrote:

> The current situation is that comments and strings are byte sequences
> and anything is legal inside them.
>
> If it looks like UTF-8 on your screen, that is because the current setting
> of your locale or your clever text editor or mail client makes it look so.
> At least one  member of this list and I suspect several others still uses
> ISO-8859-1 in e-mails and probably their comments and strings would be
> ISO-8859-1 too. Note that any program that expects ISO-8859 is not fazed
> by UTF-8, it just displays it wrong, whereas a program that expects UTF-8
> and gets ISO-8859 is confused.
>

But if  what the user is seeing on the screen is  little
character-size hollow rectangles, it's more likely that he's using a
font that doesn't support the particular human language's character
set. The lion's share of the TTF fonts out there have extremely
limited UTF-8 support.

A a few free fonts that look good and have pretty broad UTF-8 support
(both are available in most (all?) Linux package management systems).
:

* Arial (one of the Microsoft core fonts); [1]

* Deja Vu font family (open source) [2]

Best regards,

Paul


[1] http://corefonts.sourceforge.net/

[2] <https://sourceforge.net/projects/dejavu/>. See unicover.txt in
the source files for the extent of human language coverage.


--
[Notice not included in the above original message:  The U.S. National
Security Agency neither confirms nor denies that it intercepted this
message.]

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Roberto Ierusalimschy
In reply to this post by Sean Conner
> [...]  And the |function|
> with the omicron for the "o" only shows up as "function" (no marks) for me.

That was my point: "omicron is indistinguishable from the Latin
letter O". [1]

[1] https://en.wikipedia.org/wiki/Omicron

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

Axel Kittenberger
In reply to this post by Soni "They/Them" L.
Also, y'know...

emoji.

(I'd argue there's a HUGE upside to allowing emoji. I think everything should just allow emoji. Actually why don't we get rid of ASCII while at it, make things emoji-only?)

It would also be a hugh update for Smileua: http://lua-users.org/lists/lua-l/2011-08/msg00781.html
Reply | Threaded
Open this post in threaded view
|

Re: Could Lua itself become UTF8-aware?

nobody
In reply to this post by Andrew Starks-2
On 2017-05-01 21:41, Andrew Starks wrote:
> There is one single language understood by every culture that is
> doing scientific work: math.

Using math symbols in code requires Unicode.  (Sure, you can write their
names, but then you're back at the problem of different languages.)

> What problem does it solve? Is support for UTF-8 useful for
> automated script processing or some sort of DSL application?

Unicode is extremely useful in combination with custom mixfix[1]
operators/notations.  Without that, not so much.  Lua does not even
permit defining custom operators.  (Which is fine, it just makes
Unicode support much less useful.)

For DSL-purposes (and I'd count math as a DSL), Lua's syntax is already
extremely flexible.  Liberally sprinkling everything with `__call`, you
can write  `x 'op'` or `x 'op' (y)`, where 'op' can be any string
(including Unicode).  (The mandatory parentheses are pretty annoying but
not absolutely terrible.)  And if you want warts-free custom syntax,
there's also LPEG and/or ltokenp.

So while I know from experience that Unicode support plus mixfix
definitions can be absolutely awesome, Lua has neither custom operators
nor custom mixfix notations, and they're not compatible with what Lua is
/ how Lua works.[3][4]  So adding Unicode support would add _some_
flexibility/convenience, but not very much.  Given the complexity, it's
probably not worth it.

-- nobody


[1]: In Agda you can say

     if_then_else_ true  e₁ _ = e₁
     if_then_else_ false _ e₂ = e₂

to define the usual `if <cond> then <expr1> else <expr2>` notation
(which isn't built-in, because booleans aren't built-in[2]).  Just with
plain ASCII, this already allows defining more mathy notation; add
Unicode and you can literally type most of the usual notation (at least
of some sub-fields – logic, PLT, …) into your editor and have runnable code.

[2]: Well, technically they _are_ built-in, just not in the usual way.
You usually go

     data Bool : Set where
       false true : Bool
     {-# BUILTIN BOOL  Bool  #-}
     {-# BUILTIN TRUE  true  #-}
     {-# BUILTIN FALSE false #-}

i.e. you define them yourself, and for some basic types you (optionally)
also tell the compiler that what you have here is equivalent to some
built-in concept so it can treat them specially.  (This is particularly
important for natural numbers, where storing one bignum is vastly more
efficient than storing (succ (… (succ zero)…)) as a linked list of
(essentially dummy) nodes.

[3]: How would you load custom operators defined in a different file?
(That's incompatible with the current "return chunk as a function" model
– the current file is parsed before `require` runs and pulls in other
code & definitions.)  And permitting any operators while parsing, only
complaining once they're found missing at run-time would be a terrible
choice.  (Many many typos would no longer be load-time detectable but
would have to be triggered at run-time.)

[4]: A weak form of "custom operators" might be feasible:  Haskell
permits putting arbitrary functions in back-ticks to use them as a
binary operator, e.g.

     mod x y == x `mod` y

and having syntactic sugar x `f` y --> f( x, y ) would not be much work.
Because of the clear boundary, you can weaken the identifier rules and
so you could have x `⊗` (y `⊗` z) or something like that.
But that's not much better than x '⊗' (y '⊗' (z)) at the syntax level
(although it would get rid of a lot of `__call`s.)

1234