What do you miss most in Lua (was: Why isn't Lua more widely used?)

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

What do you miss most in Lua (was: Why isn't Lua more widely used?)

sergei karhof
I have just corrected the tense in the subject line: what 'do' you
miss..., instead of what 'did' you miss... because grammatically it
makes more sense.

Reply | Threaded
Open this post in threaded view
|

Re: What do you miss most in Lua (was: Why isn't Lua more widely used?)

sergei karhof
On Thu, Feb 2, 2012 at 4:33 PM, Petite Abeille <[hidden email]> wrote:

>> constructive suggestions on what you miss
>> most in Lua and would like to see implemented

> From a language/syntax perspective? And/or from an ecosystem >perspective (e.g. libraries/bindings/distributions/etc)?

>From ANY perspective.

Personally, what I miss most a standardized set of libraries. I know
that I can still find most of what I need in terms of libraries, but I
have to search for it, because lots and lots of libraries are
scattered everywhere over the Internet.
On top of my Lua wishlist is: a centralized, all-inclusive repository
of Lua libraries (at least, all the code would be found searching one
website).
The second step would be: choosing the most important projects,
standardizing the best libraries, and making them (semi) official. By
this I mean making distributing them together with the official Lua
distribution (downloadable as a separate archive, of course, but still
part of the distribution).
This would help in many ways.

Reply | Threaded
Open this post in threaded view
|

Re: What do you miss most in Lua (was: Why isn't Lua more widely used?)

Jay Carlson

What do I miss in lua? [1]

1. Unicode. Deprecate _G.string, make _G.bytes and _G.text. Make them work the same in the bare ANSI build maybe, don't ship a non-char implementation maybe, but let Lua code express intent. Lua in ASCII will remain identical.

2. Unicode. At the very least, filters on io streams to blow up on invalid UTF-8 going in or out, and a memoized assert. The world has moved past DECwriters; one of the highest rated comments on the Reddit lua-wikipedia story was questioning the sanity of using the non-i18n Lua in a project with global cultural scale. Not even having a figleaf is a problem for adoption.

I don't care if there are N good add-ins. (I wrapped librecode.) I need to share a language with other programmers and we have to coordinate with each other on some aspects of how text is handled. (Imagine if everyone used their own flavor of tagged tables/ropes/etc for strings.) We have rough consensus for XML--if everything is ASCII. The language as spoken will balkanize.

I imagine it already has in non-byte locales. A Korean Lua program will not use libraries from a Greek one. Unless everybody is already using UTF-8--with no sanity checking.

ANSI C has wchar, so other people have noticed. A lot of madness can lie in this direction; tcl rewrote stdio in the version ~8 period, and it bloated enormously.

OK. How about an easy one?

3. perl -a -p -i

See man perlrun. I can build that, but I never get around to it. I retain knowledge of perl 20 years later because I use it in pipes; I forget details of string.* because I don't use it as often.

Back to coordination.

4. The smallest of libraries. I'm tired of shallow-copying tables. It was fun the first few times I wrote it. Finding where other people hid their Set{'a'} -> {a=true} in an app sucks but the alternatives are cut&paste or shipping noplib.lua whenever I touch code. And what do other people do about "for k,v in pairs(t) do print(k,v) end"? It's difficult to do *any* interactive work without it.

Another reason I'd like this is so bozos stop overriding tostring etc for the same effect.

I suppose I could go survey the Warhammer or WoW codebases to see what everybody agrees on.

Comparing everybody's init.lua might be interesting too. Not a good use of a mailing list though.

Jay

[1]: I have TF2 on the brain. I hear "meet the demoman" this time.

Reply | Threaded
Open this post in threaded view
|

Re: What do you miss most in Lua

Miles Bader-2
Jay Carlson <[hidden email]> writes:
> I imagine it already has in non-byte locales. A Korean Lua program will not
> use libraries from a Greek one. Unless everybody is already using
> UTF-8--with no sanity checking.

Of course everybody's just using UTF-8 with no sanity checking...

[_real_ "unicode support" is hideously expensive (the size of those
tables...), and it's really kind of hard to say how you could get
anything like it into Lua without completely ruining its "Luaness"...]

-miles

--
Infancy, n. The period of our lives when, according to Wordsworth, 'Heaven
lies about us.' The world begins lying about us pretty soon afterward.

Reply | Threaded
Open this post in threaded view
|

Re: What do you miss most in Lua

Jay Carlson
On Feb 6, 2012 3:14 AM, "Miles Bader" <[hidden email]> wrote:
>
> Jay Carlson <[hidden email]> writes:
> > I imagine it already has in non-byte locales. A Korean Lua program will not
> > use libraries from a Greek one. Unless everybody is already using
> > UTF-8--with no sanity checking.
>
> Of course everybody's just using UTF-8 with no sanity checking...

Got some Koreans to back you up on that? EUC-KR (realistically, CP949)
still lives; see the <head> of http://chosun.com and donga.com.
naver.com is UTF-8 (not a total surprise). Perhaps all the Korean text
processing *in Lua* is being done in unchecked UTF-8, but I kinda
doubt it. I have no idea what the state of ISO 8859-7 vs UTF-8 is in
Greece (but as you're aware, UTF-8 doubles the size of Greek text.
Good thing there's so much ASCII in HTML.)

Note that Ruby 1.9 decided it wanted to keep SJIS more than it wanted
simplicity--unlike everybody else, internationalized strings are kept
around as a bag of octets tagged with their encoding. I get tired just
thinking about writing and maintaining that.

> [_real_ "unicode support" is hideously expensive (the size of those
> tables...),

Right. Many systems have already paid the price though. For example,
Linux machines with a UI have glib in core already.

No, it's not free to link it. Using bad methodology: on a Celeron
E1200 running x86_64 10.04 Ubuntu, strace shows referencing
g_unichar_combining_class() adds an 8ms of *startup* latency above
over bare lua's 4ms. But counting this way Python 2.6 takes 68ms from
start of dynamic linking to first line of execution, so people who
need it don't have to think twice on that account.

> and it's really kind of hard to say how you could get
> anything like it into Lua without completely ruining its "Luaness"...]

One big lesson from Lua is that you don't need to standardize on
mechanism, you just need some rough consensus on minimum syntactic
interface. Everybody has their own object system but people seem
pretty happy if all they know is o:m() and perhaps Constructor{}
syntax.

I don't think the standard build would necessarily have to do anything
besides ISO-8859-x, but an text.isvalid(s) would still be useful if it
failed on the >= 127 !isprint(), and the (iscntrl && !isspace)
characters.

A replacement could look for well-formed UTF-8. That's easy to write
in standalone C, and at least keeps some kinds of bugs out. Notably it
will blow up if you feed it binary data or instead of text.

If another implementation did pay for those Unicode tables, it could
also check for valid Unicode, or Your Favorite Normal Form, or
whatever. lua.org doesn't have to ship anything complicated and
shouldn't.

If you're wondering why I'm obsessed with the issue of coordination
through core or -llualib, it's because I can go build whatever
bindings I want on my own. I can metaprogram all kinds of elaborate
stuff. I can build private syntax. I don't need to discuss that with
lua-l. I've been in the bubble long enough that I *like* just about
everything....

Jay

Reply | Threaded
Open this post in threaded view
|

Re: What do you miss most in Lua

Miles Bader-2
Jay Carlson <[hidden email]> writes:

>> > I imagine it already has in non-byte locales. A Korean Lua program will not
>> > use libraries from a Greek one. Unless everybody is already using
>> > UTF-8--with no sanity checking.
>>
>> Of course everybody's just using UTF-8 with no sanity checking...
>
> Got some Koreans to back you up on that? EUC-KR (realistically, CP949)
> still lives; see the <head> of http://chosun.com and donga.com.
> naver.com is UTF-8 (not a total surprise). Perhaps all the Korean text
> processing *in Lua* is being done in unchecked UTF-8, but I kinda
> doubt it. ...

Of course I'm not claiming that all text-processing (even Lua
text-processing) is now done in UTF-8 -- there was a touch of
tongue-in-cheek to my message (but an element of truth as well).... :]

I live in Japan and write software for a Japanese company, so I have a
little experience in the matter.  Shift-JIS (for "local" use) and
EUC-JP (for email) are still _hugely_ used.  [At my work, we tried to
standardize on UTF-8 for a project, but ended up using Shift-JIS
simply because it's the only encoding that MS dev tools support
sanely, and has a lot of legacy support in other tools.  The encoding
support in our own code is a mess that tries to generally support
various multibyte encodings, but in practice probably only has to work
properly for Shift-JIS and UTF-8.]

Nonetheless, my intuition is that:

  (a) The Lua universe is not the more general universe.  Projects
  using Lua tend to be smaller, and less dependent on giant
  frameworks.  The tradeoffs for Lua projects are often somewhat
  different as a result.  Simple is good.

  (b) UTF-8 is generally considered the future here (something that
  must be supported now, and will increasingly replace other
  encodings) even in areas where there's a lot of legacy need/support
  for other encodings.  People don't use older encodings because they
  _can_, but because they _must_.  There's definitely an awareness
  that moving to UTF-8 is something that should be done if it's
  possible though.

  (c) In a lot of applications, not all that much "detail handling" is
  needed for what text-processing they do -- and if UTF-8 can be
  assumed, things get _much simpler_.  [E.g., if you have manipulate
  multi-byte-encoded pathnames (or anything with meaningful ASCII
  syntax), it's really simple in UTF-8 -- the same code one uses for
  ASCII will work fine -- but miserable in Shift-JIS, because random
  ASCII characters can occur in the middle of multi-byte characters.]

So I'd say there's several levels of text-processing support:

  (1) If you can, don't even bother: treat strings as blobs, and
  don't care whats in them (the default Lua state).

  (2) If you need to do a little manipulation, try to use UTF-8 for
  the encoding, but don't make any particular attempt to hide that
  fact that they are encoded (i.e., don't pretend that strings are
  "sequences of characters").  Only use what small functions you can
  get away with (assuming UTF-8 makes this _much_ easier), e.g.,
  counting characters, converting between byte- and character- offsets
  etc.  Such functions for UTF-8 are generally so easy that for many
  projects it's fine to just write them yourself, but a "tiny-utf8"
  library might not be a bad idea (which basically supports only stuff
  that's either trivial as a result of the encoding properties, or can
  be very compactly encoded).

  (3) If you're doing full-fat text-processing (text-editor, etc),
  maybe you do need real unicode support, giant tables and all.  It
  would be good to have a standard Lua library for this (there is one,
  I think, but I don't remember the name).

For cases where legacy non-ASCII encodings _need_ to be supported,
especially if you need "full-fat" features, I dunno what good choices
there are, especially if you want to be portable (and so can't rely on
e.g. iconv)...  Handling legacy multi-byte encodings is generally a
lot messier and more intrusive, and so should be avoided if possible;
sometimes platform libraries can make things easier, but that sort of
moves out of the realm of general Lua discussion.

-miles

--
Any man who is a triangle, has thee right, when in Cartesian Space,
to have angles, which when summed, come to know more, nor no less,
than nine score degrees, should he so wish.  [TEMPLE OV THEE LEMUR]

Reply | Threaded
Open this post in threaded view
|

Re: What do you miss most in Lua

Rena
On Mon, Feb 6, 2012 at 19:12, Miles Bader <[hidden email]> wrote:

> Jay Carlson <[hidden email]> writes:
>>> > I imagine it already has in non-byte locales. A Korean Lua program will not
>>> > use libraries from a Greek one. Unless everybody is already using
>>> > UTF-8--with no sanity checking.
>>>
>>> Of course everybody's just using UTF-8 with no sanity checking...
>>
>> Got some Koreans to back you up on that? EUC-KR (realistically, CP949)
>> still lives; see the <head> of http://chosun.com and donga.com.
>> naver.com is UTF-8 (not a total surprise). Perhaps all the Korean text
>> processing *in Lua* is being done in unchecked UTF-8, but I kinda
>> doubt it. ...
>
> Of course I'm not claiming that all text-processing (even Lua
> text-processing) is now done in UTF-8 -- there was a touch of
> tongue-in-cheek to my message (but an element of truth as well).... :]
>
> I live in Japan and write software for a Japanese company, so I have a
> little experience in the matter.  Shift-JIS (for "local" use) and
> EUC-JP (for email) are still _hugely_ used.  [At my work, we tried to
> standardize on UTF-8 for a project, but ended up using Shift-JIS
> simply because it's the only encoding that MS dev tools support
> sanely, and has a lot of legacy support in other tools.  The encoding
> support in our own code is a mess that tries to generally support
> various multibyte encodings, but in practice probably only has to work
> properly for Shift-JIS and UTF-8.]
>
> Nonetheless, my intuition is that:
>
>  (a) The Lua universe is not the more general universe.  Projects
>  using Lua tend to be smaller, and less dependent on giant
>  frameworks.  The tradeoffs for Lua projects are often somewhat
>  different as a result.  Simple is good.
>
>  (b) UTF-8 is generally considered the future here (something that
>  must be supported now, and will increasingly replace other
>  encodings) even in areas where there's a lot of legacy need/support
>  for other encodings.  People don't use older encodings because they
>  _can_, but because they _must_.  There's definitely an awareness
>  that moving to UTF-8 is something that should be done if it's
>  possible though.
>
>  (c) In a lot of applications, not all that much "detail handling" is
>  needed for what text-processing they do -- and if UTF-8 can be
>  assumed, things get _much simpler_.  [E.g., if you have manipulate
>  multi-byte-encoded pathnames (or anything with meaningful ASCII
>  syntax), it's really simple in UTF-8 -- the same code one uses for
>  ASCII will work fine -- but miserable in Shift-JIS, because random
>  ASCII characters can occur in the middle of multi-byte characters.]
>
> So I'd say there's several levels of text-processing support:
>
>  (1) If you can, don't even bother: treat strings as blobs, and
>  don't care whats in them (the default Lua state).
>
>  (2) If you need to do a little manipulation, try to use UTF-8 for
>  the encoding, but don't make any particular attempt to hide that
>  fact that they are encoded (i.e., don't pretend that strings are
>  "sequences of characters").  Only use what small functions you can
>  get away with (assuming UTF-8 makes this _much_ easier), e.g.,
>  counting characters, converting between byte- and character- offsets
>  etc.  Such functions for UTF-8 are generally so easy that for many
>  projects it's fine to just write them yourself, but a "tiny-utf8"
>  library might not be a bad idea (which basically supports only stuff
>  that's either trivial as a result of the encoding properties, or can
>  be very compactly encoded).
>
>  (3) If you're doing full-fat text-processing (text-editor, etc),
>  maybe you do need real unicode support, giant tables and all.  It
>  would be good to have a standard Lua library for this (there is one,
>  I think, but I don't remember the name).
>
> For cases where legacy non-ASCII encodings _need_ to be supported,
> especially if you need "full-fat" features, I dunno what good choices
> there are, especially if you want to be portable (and so can't rely on
> e.g. iconv)...  Handling legacy multi-byte encodings is generally a
> lot messier and more intrusive, and so should be avoided if possible;
> sometimes platform libraries can make things easier, but that sort of
> moves out of the realm of general Lua discussion.
>
> -miles
>
> --
> Any man who is a triangle, has thee right, when in Cartesian Space,
> to have angles, which when summed, come to know more, nor no less,
> than nine score degrees, should he so wish.  [TEMPLE OV THEE LEMUR]
>

I do think a simple UTF-8 library would be quite a good thing to have
- basically just have all of Lua's string methods, but operating on
characters instead of bytes. (So e.g. ustring.sub(str, 3, 6) would
extract the 3rd to 6th characters of str, not necessarily bytes.) My
worry though would be ending up like PHP, where you have to remember
to use the mb_* functions instead of the normal ones.

I suspect this could be accomplished by means of a function that
"converts" a string to a UTF-8 string, which would be represented as a
table or userdata with metamethods to make it behave like a string.
Then you could just write:
str = U'this is a UTF-8 string'
print(#str) --gives number of characters, not number of bytes
the main problem I can see then would be that type(str) ~= "string"...

--
Sent from my toaster.

Reply | Threaded
Open this post in threaded view
|

Re: What do you miss most in Lua

Miles Bader-2
HyperHacker <[hidden email]> writes:

> I do think a simple UTF-8 library would be quite a good thing to have
> - basically just have all of Lua's string methods, but operating on
> characters instead of bytes. (So e.g. ustring.sub(str, 3, 6) would
> extract the 3rd to 6th characters of str, not necessarily bytes.) My
> worry though would be ending up like PHP, where you have to remember
> to use the mb_* functions instead of the normal ones.
>
> I suspect this could be accomplished by means of a function that
> "converts" a string to a UTF-8 string, which would be represented as a
> table or userdata with metamethods to make it behave like a string.
> Then you could just write:
> str = U'this is a UTF-8 string'
> print(#str) --gives number of characters, not number of bytes
> the main problem I can see then would be that type(str) ~= "string"...

Even what you're suggesting sounds pretty heavy-weight.

I think many people looking at the issue try too hard to come up with
some pretty abstraction, but that the actual benefit to users of these
abstractions isn't so great... especially for environments (like Lua)
where one is trying to minimize support libraries.

For instance, I don't think the illusion of unit characters is
particularly valuable for most apps, for instance, and trying to
maintain that illusion is expensive.  Nor does it seem necessary to
hide the encoding unless you're in the position of needing to support
legacy multibyte encodings (and I'm ignoring that case because it adds
a huge amount of hair which I think isn't worth messing up the common
case for).

My intuition is that almost all string processing tends to treat
strings not as sequences of "characters" so much as sequences of other
strings, many of which are fixed, and so have known properties.

It seems much more realistic to me -- and perfectly usable -- to
simply say that strings contain UTF-8, and offer a few functions like:

  utf8.unicode_char (STRING[, BYTE_INDEX = 0]) => UNICHAR
  utf8.char_offset (STRING, BYTE_INDEX, NUM_CHARS) => NEW_BYTE_INDEX

["char_offset" maybe defined to work properly if the input byte index
isn't on a proper character boundary, and with a special case for
NUM_CHARS == 0 to align to the beginning of the character containing
the input byte index.]

verrry simple and light-weight.

Most existing string functions are also perfectly usable on UTF-8, and
do something reasonable with it:

   sub

        Works fine if the indices are calculated reasonably -- and I
        think this is almost always the case.  People don't generally
        do [[ string.sub (UNKNOWN_STRING, 3, 6) ]], they calculate a
        string position, e.g. by searching, or string beginning/end,
        and maybe calculate offsets based on _known_ contents, e.g.
        [[ string.sub (s, 1, string.find (s, "/") - 1) ]]

        [One exception might be chopping a string to fit some length
        limit using [[ string.sub (s, 1, LIMIT) ]].  Where it's
        actually a byte limit (fixed buffers etc), something like [[
        string.sub (s, 1, utf8.char_offset (s, LIMIT)) ]] suffices,
        but for things like _display_ limits, calculating display
        widths of unicode characters isn't so easy...even with full
        tables.]

   upper
   lower

        Works fine, but of course only upcases ASCII characters.
        However doing this "properly" requires unicode tables, so
        isn't appropriate for a minimal library I guess.

   len

        Works fine for calculating the string byte length -- which is
        often what is actually wanted -- or calculating the string
        index of the end of the string (for further searching or
        whatever).

   rep
   format

        Work fine (only use concatenation)

   byte
   char

        Work fine

   find
   match
   gmatch
   gsub

        Work fine for the most part.  The main exception, of course,
        is single-character wildcards, ".", "[^abc]", etc, when used
        without a repeat suffix -- but I think in practice, these are
        very rarely used without a repeat suffix.

        Some of the patterns are limited to ASCII in their
        interpration of course (e.g. "%a"), but this isn't really
        fixable without full unicode tables, and the ASCII-only
        interpretation is not dangerous.

   dump

        N/A

   reverse

        Now _this_ will probably simply fail for strings containing
        non-ASCII UTF-8.  But it's also probably not very widely
        used...


IOW, before trying to come up with some pretty (and expensive)
abstraction, it seems worthwhile to think: in what _real_ situations
(i.e., actually occur in practice) does simply "doing nothing" not
work?  In some cases, code might have to be tweaked a little, but I
suspect it's often enough to just say "so don't do that" (because most
code doesn't do that anyway).

The main question I suppose is:  is the resulting user code, using
mostly ordinary string functions plus a little minimal utf8 tweaking,
going to be significantly uglier/harder-to-maintain/confusing, to the
point where using a heavier-weight abstraction might be worthwhile?

My suspicion is that for most apps, the answer is no...

-miles

--
Yossarian was moved very deeply by the absolute simplicity of
this clause of Catch-22 and let out a respectful whistle.
"That's some catch, that Catch-22," he observed.
"It's the best there is," Doc Daneeka agreed.

Reply | Threaded
Open this post in threaded view
|

Re: What do you miss most in Lua

Tim Hill
I think the middle ground here is to distinguish between Lua strings as byte arrays and UTF-8 as sequences of code points. Doing full-blown Unicode support is horrendous, but providing library functions that could (perhaps not very efficiently) discern code points within the UTF-8 byte stream would be useful, without providing interpretation of those code points (which is where all the complexity of Unicode lies). One approach to this (not one I am championing, though), would be routines to convert a UTF-8 string to/from a Lua array of codepoints (with each, presumably, a number). A better approach would be able to walk the string directly one code point at a time. None of these are good solutions, and imho Unicode switched from a solution to a problem when they overflowed the BMP.

--Tim


On Feb 6, 2012, at 9:29 PM, Miles Bader wrote:

> HyperHacker <[hidden email]> writes:
>> I do think a simple UTF-8 library would be quite a good thing to have
>> - basically just have all of Lua's string methods, but operating on
>> characters instead of bytes. (So e.g. ustring.sub(str, 3, 6) would
>> extract the 3rd to 6th characters of str, not necessarily bytes.) My
>> worry though would be ending up like PHP, where you have to remember
>> to use the mb_* functions instead of the normal ones.
>>
>> I suspect this could be accomplished by means of a function that
>> "converts" a string to a UTF-8 string, which would be represented as a
>> table or userdata with metamethods to make it behave like a string.
>> Then you could just write:
>> str = U'this is a UTF-8 string'
>> print(#str) --gives number of characters, not number of bytes
>> the main problem I can see then would be that type(str) ~= "string"...
>
> Even what you're suggesting sounds pretty heavy-weight.
>
> I think many people looking at the issue try too hard to come up with
> some pretty abstraction, but that the actual benefit to users of these
> abstractions isn't so great... especially for environments (like Lua)
> where one is trying to minimize support libraries.
>
> For instance, I don't think the illusion of unit characters is
> particularly valuable for most apps, for instance, and trying to
> maintain that illusion is expensive.  Nor does it seem necessary to
> hide the encoding unless you're in the position of needing to support
> legacy multibyte encodings (and I'm ignoring that case because it adds
> a huge amount of hair which I think isn't worth messing up the common
> case for).
>
> My intuition is that almost all string processing tends to treat
> strings not as sequences of "characters" so much as sequences of other
> strings, many of which are fixed, and so have known properties.
>
> It seems much more realistic to me -- and perfectly usable -- to
> simply say that strings contain UTF-8, and offer a few functions like:
>
>  utf8.unicode_char (STRING[, BYTE_INDEX = 0]) => UNICHAR
>  utf8.char_offset (STRING, BYTE_INDEX, NUM_CHARS) => NEW_BYTE_INDEX
>
> ["char_offset" maybe defined to work properly if the input byte index
> isn't on a proper character boundary, and with a special case for
> NUM_CHARS == 0 to align to the beginning of the character containing
> the input byte index.]
>
> verrry simple and light-weight.
>
> Most existing string functions are also perfectly usable on UTF-8, and
> do something reasonable with it:
>
>   sub
>
>        Works fine if the indices are calculated reasonably -- and I
>        think this is almost always the case.  People don't generally
>        do [[ string.sub (UNKNOWN_STRING, 3, 6) ]], they calculate a
>        string position, e.g. by searching, or string beginning/end,
>        and maybe calculate offsets based on _known_ contents, e.g.
>        [[ string.sub (s, 1, string.find (s, "/") - 1) ]]
>
>        [One exception might be chopping a string to fit some length
>        limit using [[ string.sub (s, 1, LIMIT) ]].  Where it's
>        actually a byte limit (fixed buffers etc), something like [[
>        string.sub (s, 1, utf8.char_offset (s, LIMIT)) ]] suffices,
>        but for things like _display_ limits, calculating display
>        widths of unicode characters isn't so easy...even with full
>        tables.]
>
>   upper
>   lower
>
>        Works fine, but of course only upcases ASCII characters.
>        However doing this "properly" requires unicode tables, so
>        isn't appropriate for a minimal library I guess.
>
>   len
>
>        Works fine for calculating the string byte length -- which is
>        often what is actually wanted -- or calculating the string
>        index of the end of the string (for further searching or
>        whatever).
>
>   rep
>   format
>
>        Work fine (only use concatenation)
>
>   byte
>   char
>
>        Work fine
>
>   find
>   match
>   gmatch
>   gsub
>
>        Work fine for the most part.  The main exception, of course,
>        is single-character wildcards, ".", "[^abc]", etc, when used
>        without a repeat suffix -- but I think in practice, these are
>        very rarely used without a repeat suffix.
>
>        Some of the patterns are limited to ASCII in their
>        interpration of course (e.g. "%a"), but this isn't really
>        fixable without full unicode tables, and the ASCII-only
>        interpretation is not dangerous.
>
>   dump
>
>        N/A
>
>   reverse
>
>        Now _this_ will probably simply fail for strings containing
>        non-ASCII UTF-8.  But it's also probably not very widely
>        used...
>
>
> IOW, before trying to come up with some pretty (and expensive)
> abstraction, it seems worthwhile to think: in what _real_ situations
> (i.e., actually occur in practice) does simply "doing nothing" not
> work?  In some cases, code might have to be tweaked a little, but I
> suspect it's often enough to just say "so don't do that" (because most
> code doesn't do that anyway).
>
> The main question I suppose is:  is the resulting user code, using
> mostly ordinary string functions plus a little minimal utf8 tweaking,
> going to be significantly uglier/harder-to-maintain/confusing, to the
> point where using a heavier-weight abstraction might be worthwhile?
>
> My suspicion is that for most apps, the answer is no...
>
> -miles
>
> --
> Yossarian was moved very deeply by the absolute simplicity of
> this clause of Catch-22 and let out a respectful whistle.
> "That's some catch, that Catch-22," he observed.
> "It's the best there is," Doc Daneeka agreed.
>


Reply | Threaded
Open this post in threaded view
|

Re: What do you miss most in Lua

Roberto Ierusalimschy
In reply to this post by Miles Bader-2
> The main question I suppose is:  is the resulting user code, using
> mostly ordinary string functions plus a little minimal utf8 tweaking,
> going to be significantly uglier/harder-to-maintain/confusing, to the
> point where using a heavier-weight abstraction might be worthwhile?
>
> My suspicion is that for most apps, the answer is no...

You are my idol :)

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: What do you miss most in Lua

David Given
In reply to this post by Rena
HyperHacker wrote:
[...]
> I do think a simple UTF-8 library would be quite a good thing to have
> - basically just have all of Lua's string methods, but operating on
> characters instead of bytes.

What do you mean by a 'character'? A Unicode code point? A grapheme
cluster?

If you split the string on code points you'll end up breaking grapheme
clusters in the middle, which will break any combining characters. If
you split the string on grapheme clusters you'll preserve the ability to
do random access into the string, but your string manipulation library
now becomes hideously heavyweight: grapheme clusters can be *any length*
(although there seems to be a promise that normalised Unicode won't have
any grapheme clusters longer than 32 code points).

The standard intuition that strings are made up of an array of
characters is, unfortunately, not really true in Unicode. It's basically
not possible to do random access into a Unicode string without jumping
through painful hoops.

--
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────
│ "I have always wished for my computer to be as easy to use as my
│ telephone; my wish has come true because I can no longer figure out
│ how to use my telephone." --- Bjarne Stroustrup


signature.asc (270 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: What do you miss most in Lua

Tim Mensch
On 2/7/2012 4:13 AM, David Given wrote:
> What do you mean by a 'character'? A Unicode code point?

I can't speak to what HyperHacker means, but when I manipulated UTF-8
for handling internationalization on over 50 games, I never had to deal
with a SINGLE instance of grapheme clusters breaking the code. I ignored
them, and those 50 games were translated into at least 3 languages each,
and some as many as 9 languages (including Chinese, Japanese, and Arabic).

I would suggest that a LOT of UTF-8 usage in the real world follows that
pattern; not everyone is writing text entry fields.

If you're having to deal with completely arbitrary Unicode, then yes,
you need to deal with grapheme clusters. An optimization we added for
UTF-8 code points would actually be useful there, though: In each string
we cached the last code point offset requested. If you asked for s[i],
it would find the i'th code point, and remember the binary offset at
that code point, so loops like:

for (int i=0; i<s.length(); ++i)
{
     // do something with s[i]
}

...would only need to "walk" the string from the last code point, which
is an O(1) operation up or down. You could use the same cache to "walk"
grapheme clusters, and then it's mostly O(1), unless you have crazy
large clusters making up the text, at which point it's O(M) where M is
cluster length.

In neither case does it necessarily make the library hideously
heavyweight, though, unless adding an index/offset pair to each UTF-8
string is hideously overweight. The only obvious change is that, instead
of returning a code point, which can fit in an int, the s[i] above would
return a binary blob (effectively, an opaque string).

Tim

Reply | Threaded
Open this post in threaded view
|

Re: What do you miss most in Lua

Jay Carlson
In reply to this post by Miles Bader-2
On Tue, Feb 7, 2012 at 5:29 AM, Miles Bader <[hidden email]> wrote:
> HyperHacker <[hidden email]> writes:
>> I do think a simple UTF-8 library would be quite a good thing to have
>> - basically just have all of Lua's string methods, but operating on
>> characters instead of bytes. (So e.g. ustring.sub(str, 3, 6) would
>> extract the 3rd to 6th characters of str, not necessarily bytes.) My
>> worry though would be ending up like PHP, where you have to remember
>> to use the mb_* functions instead of the normal ones.

So I started down this path, and realized the same thing Miles did: I
very rarely did this. General multilingual text is far more
complicated than ASCII, and there's not much one really can do in,
say, a loop iteration with COMBINING DOUBLE INVERTED BREVE. Contrary
to monolingual practice, iteration or addressing by code point is just
not that common.

> I think many people looking at the issue try too hard to come up with
> some pretty abstraction, but that the actual benefit to users of these
> abstractions isn't so great... especially for environments (like Lua)
> where one is trying to minimize support libraries.

Yeah. "Just throw it in" seems like a typical disaster like the HTML
DOM. http://c2.com/xp/YouArentGonnaNeedIt.html says:

|    "Always implement things when you actually need them, never when
you just foresee that you need them."
|    Even if you're totally, totally, totally sure that you'll need a
feature later on, don't implement it now. Usually, it'll turn out
either a) you don't need it after all, or b) what you actually need is
quite different from what you foresaw needing earlier.

I like to say that when you build generality for a future you don't
understand, when the future arrives you find out you were right: you
didn't understand it.

> My intuition is that almost all string processing tends to treat
> strings not as sequences of "characters" so much as sequences of other
> strings, many of which are fixed, and so have known properties.

Yeah. string.gmatch is the real string iteration operation, for
multiple reasons. It expresses intent compactly, it's implemented in C
so it's faster than an interpreter, and it's common enough that it
should implemented once by smart, focused people who are probably
going to do a better job at it than you are in each individual
instance.

As I type that, I notice those look remarkably like the arguments for
any inclusion in string.* anyway.

> It seems much more realistic to me -- and perfectly usable -- to
> simply say that strings contain UTF-8,

...well-formed UTF-8...

> and offer a few functions like:
>
>  utf8.unicode_char (STRING[, BYTE_INDEX = 0]) => UNICHAR
>  utf8.char_offset (STRING, BYTE_INDEX, NUM_CHARS) => NEW_BYTE_INDEX

I agree, although I would prefer we talk about code points, since
people coming from "one glyph, one character" environments (the
precomposed world) are just going to lose when their mental model
encounters U+202B RIGHT-TO-LEFT EMBEDDING or combining characters or
all of the oddities of scripts I never have seen.

> Most existing string functions are also perfectly usable on UTF-8, and
> do something reasonable with it:

...when the functions' domain is well-formed UTF-8...

>   sub
>
>        Works fine if the indices are calculated reasonably -- and I
>        think this is almost always the case.  People don't generally
>        do [[ string.sub (UNKNOWN_STRING, 3, 6) ]], they calculate a
>        string position, e.g. by searching, or string beginning/end,
>        and maybe calculate offsets based on _known_ contents, e.g.
>        [[ string.sub (s, 1, string.find (s, "/") - 1) ]]

In an explicitly strongly-typed language, these numbers, call them
UINDEXs, would belong to a separate type, because you can't do
arithmetic on them in any obviously useful way.

I appreciate that concise examples are hard, but [[
string.sub(s,1,string.find(s,"/")-1) ]] sounds more like a weakness in
the string library. I'm lazy so I would write [[ string.match(s,
"(.-)/") ]]. This violates "write code in Lua, not in regexps"
principle, especially if you write it as the string.match(s,
"([^/]*)/") or other worse things that happen before I've finished my
coffee. Sprinkle with assert() to taste.

I have a mental blind spot on non-greedy matching. Iterating on
"(.-)whatever" is one of those things I wish was in that hjypothetical
annotated manual.

In regexp languages without non-greedy capture I have to write a
function returning [[ string_before_match, match = sfind(s, "/") ]].
This is the single-step split function, and is a primitive in how I
think about processing strings.[1]

>        [One exception might be chopping a string to fit some length
>        limit using [[ string.sub (s, 1, LIMIT) ]].  Where it's
>        actually a byte limit (fixed buffers etc), something like [[
>        string.sub (s, 1, utf8.char_offset (s, LIMIT)) ]] suffices,

Agree. A clamp function that returns at most n bytes of valid UTF-8.
This may separate composed characters, but the usage model is you'll
either be concatenating with the following characters later, or you
really don't care about textual fidelity because truncation is the
most important goal.

Because of how UTF-8 works, it is easy to produce valid UTF-8 output
when given valid UTF-8 input. (Get used to that sentence.)

>        but for things like _display_ limits, calculating display
>        widths of unicode characters isn't so easy...even with full
>        tables.]

This looks like it should be done in a library, but there is a useful
thing like it. I could see clamping to n code points instead of bytes.
For the precomposed world, this can approximate character cells in an
xterm, especially if you count for CJK fullwidth as two. So as an
exercise, here's what you'd need to do.

Theoretically you need the whole table, but given the sloppy goal of
"don't run off the end of the 80 'column' CJK line if you can help it"
it's 0x11xx, 0x2Fxx-0x9Fxx, 0xAC-0xD7, 0xF9-0xFA. Outside the BMP,
0x200-0x2FF. Yes, I shortchanged the I Ching.[3]

The arrangement of Unicode does suggest a iterator/mapper primitive:
given a code point c, look up t[c >> 16]. If it's a
number/string/boolean, return it; if it's a table, return t[c>>16][c
&& 0xff]. Presumably these would be gets rather than rawgets, so
subtables would have an __index which could look up and memoize on the
fly. This would handle the jagged U+1160-U+11FF more correctly. I
dunno. I said nobody wants to iterate over strings, and now I've
contradicted myself.

>   upper
>   lower

Because of how UTF-8 works, it is easy to produce valid UTF-8 output
when given valid UTF-8 input.

>        Works fine, but of course only upcases ASCII characters.

...if you're in the C locale. Amusingly--well, no, frustratingly--the
MacPorts version of lua run interactively gives different results
because readline sets the locale:

$ export LC_ALL=en_US.ISO8859-15
$ lua -e 's=string f="%02x" print(f:format(s.byte(s.upper(s.char(0xE9)))))'
e9
$ lua
Lua 5.1.4  Copyright (C) 1994-2008 Lua.org, PUC-Rio
> s=string f="%02x" print(f:format(s.byte(s.upper(s.char(0xE9)))))
c9

>   len
>        [...works] for calculating the string
>        index of the end of the string (for further searching or
>        whatever).

Yeah. It returns that UINDEX opaque number type when used that way.

>   rep
>   format

Because of how UTF-8 works, it is *guaranteed* to produce valid UTF-8
output when given valid UTF-8 input. This guarantee not valid if you
end up in a non-UTF-8 locale with number formatting outside ASCII....

>   byte
>   char
>
>        Work fine

Whaaa? #char(0xE9) needs to be 2 if we're working in the UTF-8 text
domain. Similarly, string.byte("Я")) needs to return 1071. Which says
"byte" is a bad name but those two need to be inverses.

byte(s, i, j) is also only defined when beginning and ending at UINDEX
positions; that is to say, you can't start in the middle of a UTF-8
sequence and you can't stop in the middle of one either. But given the
#char(0xE9)==2 definition, char is not a loophole for bogus UTF-8 to
sneak in.

Obviously we still need to keep around bytewise operations on some
stringlike thing. (Wait, obviously? If you're not in a single-byte
locale the candidates for bytewise ops look like things you don't
necessarily want to intern at first sight. The counterexample is
something like binary SHA-1 for table lookup.)

>   find
>   match
>   gmatch
>   gsub
>
>        Work fine for the most part.  The main exception, of course,
>        is single-character wildcards, ".", "[^abc]", etc, when used
>        without a repeat suffix -- but I think in practice, these are
>        very rarely used without a repeat suffix.

Agree. I think "." and friends need to consume a whole UTF-8 code
point, since otherwise they could return confusing values which aren't
UINDEXs, and produce captures with invalid content. I imagine this is
not that hard as long as non-ASCII subpatterns are not allowed. "Я" is
not a single byte, and "Я+" looks painful.) But you could easily
search for a literal "Я" the same way you search for "abc" now.

With the consumption caveat, it is (relatively) easy to produce valid
UTF-8 output when given valid UTF-8 input.

With that definition, I just noticed that string.gmatch(".") *is* the
code point iterator. Hmm.

>   reverse
>        Now _this_ will probably simply fail for strings containing
>        non-ASCII UTF-8.

And you don't want to reverse your combining code points either.
Nobody will use it.

> IOW, before trying to come up with some pretty (and expensive)
> abstraction, it seems worthwhile to think: in what _real_ situations
> (i.e., actually occur in practice) does simply "doing nothing" not
> work?  In some cases, code might have to be tweaked a little, but I
> suspect it's often enough to just say "so don't do that" (because most
> code doesn't do that anyway).

I'm beginning to think we could get away with doing most of this in
UTF-8 with existing operations, *and* have some hope of retaining
well-formedness, without too much additional code or runtime overhead.
But the catch is that you must guarantee well-formed UTF-8 on the way
in or you'll get garbage out. So that's why I want a memoized
assert_utf8(): defend the border, and a lot of other things take care
of themselves. Otherwise, you'll undoubtedly get invalid singleton
high bytes wandering into a string, causing random other bugs with
little hope of tracing back to the place where you added one to a
UOFFSET.

If we are interning strings with a hash, we have to walk their whole
length anyway; might be worth checking then. For those operations the
C code is really certain are valid UTF-8, they could tell luaS_newlstr
about it. lua_concat of UTF-8 strings is guaranteed to be UTF-8....

> The main question I suppose is:  is the resulting user code, using
> mostly ordinary string functions plus a little minimal utf8 tweaking,
> going to be significantly uglier/harder-to-maintain/confusing, to the
> point where using a heavier-weight abstraction might be worthwhile?
>
> My suspicion is that for most apps, the answer is no...

Well, that certainly makes Roberto happy. I think after going through
this exercise, the unresolved question is whether there should be a
byte vs text distinction in operations.

I think I've made a good case text.* would reduce the cost of bugs by
localizing the source of error, and the complexity of implementation
doesn't look that bad.  With such a distinction between text and
bytes, it's not as high a compatibility cost to switch or add UTF-16
or UTF-32 internal representations; in the latter, it just turns out
every UINDEX is valid.

The elephant in the room is normalization forms; once you've got all
these parts, you're going to want NFC. But that's big-table, and a
loadable library can provide a string-to-string transformation.

Jay

[1]: (string_before, match) is a perl4 habit I think, from the $`
special variable. It has an easy implementation for literal matches,
and *that* habit probably goes back to Applesoft, Commodore, or Harris
VULCAN BASIC. Along with jwz's "now they have two problems"[2] I
believe I've personally hit the Dijkstra trifecta (
http://www.cs.utexas.edu/users/EWD/transcriptions/EWD04xx/EWD498.html
):

"PL/I—'the fatal disease'—belongs more to the problem set than to the
solution set.

"It is practically impossible to teach good programming to students
that have had a prior exposure to BASIC: as potential programmers they
are mentally mutilated beyond hope of regeneration.

"The use of COBOL cripples the mind; its teaching should, therefore,
be regarded as a criminal offence."

[2]: See http://regex.info/blog/2006-09-15/247 for a lot of
original-source background on "now they have two problems". I spent a
bunch of time tracking down a *bogus* jwz attribution which I would in
turn cite, but handhelds.org mail archives have been down for like a
year, and http://article.gmane.org/gmane.comp.handhelds.ipaq.general/12198
and jwz's followup at
http://article.gmane.org/gmane.comp.handhelds.ipaq.general/12226 are
not responding this morning either....

[3]: No blame.

Reply | Threaded
Open this post in threaded view
|

Re: What do you miss most in Lua

Egil Hjelmeland
What about sorting/collating? That would be useful. But is that a
big-table-thing in Unicode?
Best regards
ignorant Egil


On 2012-02-07 22:40, Jay Carlson wrote:

> On Tue, Feb 7, 2012 at 5:29 AM, Miles Bader<[hidden email]>  wrote:
>> HyperHacker<[hidden email]>  writes:
>>> I do think a simple UTF-8 library would be quite a good thing to have
>>> - basically just have all of Lua's string methods, but operating on
>>> characters instead of bytes. (So e.g. ustring.sub(str, 3, 6) would
>>> extract the 3rd to 6th characters of str, not necessarily bytes.) My
>>> worry though would be ending up like PHP, where you have to remember
>>> to use the mb_* functions instead of the normal ones.
> So I started down this path, and realized the same thing Miles did: I
> very rarely did this. General multilingual text is far more
> complicated than ASCII, and there's not much one really can do in,
> say, a loop iteration with COMBINING DOUBLE INVERTED BREVE. Contrary
> to monolingual practice, iteration or addressing by code point is just
> not that common.
>
>> I think many people looking at the issue try too hard to come up with
>> some pretty abstraction, but that the actual benefit to users of these
>> abstractions isn't so great... especially for environments (like Lua)
>> where one is trying to minimize support libraries.
> Yeah. "Just throw it in" seems like a typical disaster like the HTML
> DOM. http://c2.com/xp/YouArentGonnaNeedIt.html says:
>
> |    "Always implement things when you actually need them, never when
> you just foresee that you need them."
> |    Even if you're totally, totally, totally sure that you'll need a
> feature later on, don't implement it now. Usually, it'll turn out
> either a) you don't need it after all, or b) what you actually need is
> quite different from what you foresaw needing earlier.
>
> I like to say that when you build generality for a future you don't
> understand, when the future arrives you find out you were right: you
> didn't understand it.
>
>> My intuition is that almost all string processing tends to treat
>> strings not as sequences of "characters" so much as sequences of other
>> strings, many of which are fixed, and so have known properties.
> Yeah. string.gmatch is the real string iteration operation, for
> multiple reasons. It expresses intent compactly, it's implemented in C
> so it's faster than an interpreter, and it's common enough that it
> should implemented once by smart, focused people who are probably
> going to do a better job at it than you are in each individual
> instance.
>
> As I type that, I notice those look remarkably like the arguments for
> any inclusion in string.* anyway.
>
>> It seems much more realistic to me -- and perfectly usable -- to
>> simply say that strings contain UTF-8,
> ...well-formed UTF-8...
>
>> and offer a few functions like:
>>
>>   utf8.unicode_char (STRING[, BYTE_INDEX = 0]) =>  UNICHAR
>>   utf8.char_offset (STRING, BYTE_INDEX, NUM_CHARS) =>  NEW_BYTE_INDEX
> I agree, although I would prefer we talk about code points, since
> people coming from "one glyph, one character" environments (the
> precomposed world) are just going to lose when their mental model
> encounters U+202B RIGHT-TO-LEFT EMBEDDING or combining characters or
> all of the oddities of scripts I never have seen.
>
>> Most existing string functions are also perfectly usable on UTF-8, and
>> do something reasonable with it:
> ...when the functions' domain is well-formed UTF-8...
>
>>    sub
>>
>>         Works fine if the indices are calculated reasonably -- and I
>>         think this is almost always the case.  People don't generally
>>         do [[ string.sub (UNKNOWN_STRING, 3, 6) ]], they calculate a
>>         string position, e.g. by searching, or string beginning/end,
>>         and maybe calculate offsets based on _known_ contents, e.g.
>>         [[ string.sub (s, 1, string.find (s, "/") - 1) ]]
> In an explicitly strongly-typed language, these numbers, call them
> UINDEXs, would belong to a separate type, because you can't do
> arithmetic on them in any obviously useful way.
>
> I appreciate that concise examples are hard, but [[
> string.sub(s,1,string.find(s,"/")-1) ]] sounds more like a weakness in
> the string library. I'm lazy so I would write [[ string.match(s,
> "(.-)/") ]]. This violates "write code in Lua, not in regexps"
> principle, especially if you write it as the string.match(s,
> "([^/]*)/") or other worse things that happen before I've finished my
> coffee. Sprinkle with assert() to taste.
>
> I have a mental blind spot on non-greedy matching. Iterating on
> "(.-)whatever" is one of those things I wish was in that hjypothetical
> annotated manual.
>
> In regexp languages without non-greedy capture I have to write a
> function returning [[ string_before_match, match = sfind(s, "/") ]].
> This is the single-step split function, and is a primitive in how I
> think about processing strings.[1]
>
>>         [One exception might be chopping a string to fit some length
>>         limit using [[ string.sub (s, 1, LIMIT) ]].  Where it's
>>         actually a byte limit (fixed buffers etc), something like [[
>>         string.sub (s, 1, utf8.char_offset (s, LIMIT)) ]] suffices,
> Agree. A clamp function that returns at most n bytes of valid UTF-8.
> This may separate composed characters, but the usage model is you'll
> either be concatenating with the following characters later, or you
> really don't care about textual fidelity because truncation is the
> most important goal.
>
> Because of how UTF-8 works, it is easy to produce valid UTF-8 output
> when given valid UTF-8 input. (Get used to that sentence.)
>
>>         but for things like _display_ limits, calculating display
>>         widths of unicode characters isn't so easy...even with full
>>         tables.]
> This looks like it should be done in a library, but there is a useful
> thing like it. I could see clamping to n code points instead of bytes.
> For the precomposed world, this can approximate character cells in an
> xterm, especially if you count for CJK fullwidth as two. So as an
> exercise, here's what you'd need to do.
>
> Theoretically you need the whole table, but given the sloppy goal of
> "don't run off the end of the 80 'column' CJK line if you can help it"
> it's 0x11xx, 0x2Fxx-0x9Fxx, 0xAC-0xD7, 0xF9-0xFA. Outside the BMP,
> 0x200-0x2FF. Yes, I shortchanged the I Ching.[3]
>
> The arrangement of Unicode does suggest a iterator/mapper primitive:
> given a code point c, look up t[c>>  16]. If it's a
> number/string/boolean, return it; if it's a table, return t[c>>16][c
> &&  0xff]. Presumably these would be gets rather than rawgets, so
> subtables would have an __index which could look up and memoize on the
> fly. This would handle the jagged U+1160-U+11FF more correctly. I
> dunno. I said nobody wants to iterate over strings, and now I've
> contradicted myself.
>
>>    upper
>>    lower
> Because of how UTF-8 works, it is easy to produce valid UTF-8 output
> when given valid UTF-8 input.
>
>>         Works fine, but of course only upcases ASCII characters.
> ...if you're in the C locale. Amusingly--well, no, frustratingly--the
> MacPorts version of lua run interactively gives different results
> because readline sets the locale:
>
> $ export LC_ALL=en_US.ISO8859-15
> $ lua -e 's=string f="%02x" print(f:format(s.byte(s.upper(s.char(0xE9)))))'
> e9
> $ lua
> Lua 5.1.4  Copyright (C) 1994-2008 Lua.org, PUC-Rio
>> s=string f="%02x" print(f:format(s.byte(s.upper(s.char(0xE9)))))
> c9
>
>>    len
>>         [...works] for calculating the string
>>         index of the end of the string (for further searching or
>>         whatever).
> Yeah. It returns that UINDEX opaque number type when used that way.
>
>>    rep
>>    format
> Because of how UTF-8 works, it is *guaranteed* to produce valid UTF-8
> output when given valid UTF-8 input. This guarantee not valid if you
> end up in a non-UTF-8 locale with number formatting outside ASCII....
>
>>    byte
>>    char
>>
>>         Work fine
> Whaaa? #char(0xE9) needs to be 2 if we're working in the UTF-8 text
> domain. Similarly, string.byte("Я")) needs to return 1071. Which says
> "byte" is a bad name but those two need to be inverses.
>
> byte(s, i, j) is also only defined when beginning and ending at UINDEX
> positions; that is to say, you can't start in the middle of a UTF-8
> sequence and you can't stop in the middle of one either. But given the
> #char(0xE9)==2 definition, char is not a loophole for bogus UTF-8 to
> sneak in.
>
> Obviously we still need to keep around bytewise operations on some
> stringlike thing. (Wait, obviously? If you're not in a single-byte
> locale the candidates for bytewise ops look like things you don't
> necessarily want to intern at first sight. The counterexample is
> something like binary SHA-1 for table lookup.)
>
>>    find
>>    match
>>    gmatch
>>    gsub
>>
>>         Work fine for the most part.  The main exception, of course,
>>         is single-character wildcards, ".", "[^abc]", etc, when used
>>         without a repeat suffix -- but I think in practice, these are
>>         very rarely used without a repeat suffix.
> Agree. I think "." and friends need to consume a whole UTF-8 code
> point, since otherwise they could return confusing values which aren't
> UINDEXs, and produce captures with invalid content. I imagine this is
> not that hard as long as non-ASCII subpatterns are not allowed. "Я" is
> not a single byte, and "Я+" looks painful.) But you could easily
> search for a literal "Я" the same way you search for "abc" now.
>
> With the consumption caveat, it is (relatively) easy to produce valid
> UTF-8 output when given valid UTF-8 input.
>
> With that definition, I just noticed that string.gmatch(".") *is* the
> code point iterator. Hmm.
>
>>    reverse
>>         Now _this_ will probably simply fail for strings containing
>>         non-ASCII UTF-8.
> And you don't want to reverse your combining code points either.
> Nobody will use it.
>
>> IOW, before trying to come up with some pretty (and expensive)
>> abstraction, it seems worthwhile to think: in what _real_ situations
>> (i.e., actually occur in practice) does simply "doing nothing" not
>> work?  In some cases, code might have to be tweaked a little, but I
>> suspect it's often enough to just say "so don't do that" (because most
>> code doesn't do that anyway).
> I'm beginning to think we could get away with doing most of this in
> UTF-8 with existing operations, *and* have some hope of retaining
> well-formedness, without too much additional code or runtime overhead.
> But the catch is that you must guarantee well-formed UTF-8 on the way
> in or you'll get garbage out. So that's why I want a memoized
> assert_utf8(): defend the border, and a lot of other things take care
> of themselves. Otherwise, you'll undoubtedly get invalid singleton
> high bytes wandering into a string, causing random other bugs with
> little hope of tracing back to the place where you added one to a
> UOFFSET.
>
> If we are interning strings with a hash, we have to walk their whole
> length anyway; might be worth checking then. For those operations the
> C code is really certain are valid UTF-8, they could tell luaS_newlstr
> about it. lua_concat of UTF-8 strings is guaranteed to be UTF-8....
>
>> The main question I suppose is:  is the resulting user code, using
>> mostly ordinary string functions plus a little minimal utf8 tweaking,
>> going to be significantly uglier/harder-to-maintain/confusing, to the
>> point where using a heavier-weight abstraction might be worthwhile?
>>
>> My suspicion is that for most apps, the answer is no...
> Well, that certainly makes Roberto happy. I think after going through
> this exercise, the unresolved question is whether there should be a
> byte vs text distinction in operations.
>
> I think I've made a good case text.* would reduce the cost of bugs by
> localizing the source of error, and the complexity of implementation
> doesn't look that bad.  With such a distinction between text and
> bytes, it's not as high a compatibility cost to switch or add UTF-16
> or UTF-32 internal representations; in the latter, it just turns out
> every UINDEX is valid.
>
> The elephant in the room is normalization forms; once you've got all
> these parts, you're going to want NFC. But that's big-table, and a
> loadable library can provide a string-to-string transformation.
>
> Jay
>
> [1]: (string_before, match) is a perl4 habit I think, from the $`
> special variable. It has an easy implementation for literal matches,
> and *that* habit probably goes back to Applesoft, Commodore, or Harris
> VULCAN BASIC. Along with jwz's "now they have two problems"[2] I
> believe I've personally hit the Dijkstra trifecta (
> http://www.cs.utexas.edu/users/EWD/transcriptions/EWD04xx/EWD498.html
> ):
>
> "PL/I—'the fatal disease'—belongs more to the problem set than to the
> solution set.
>
> "It is practically impossible to teach good programming to students
> that have had a prior exposure to BASIC: as potential programmers they
> are mentally mutilated beyond hope of regeneration.
>
> "The use of COBOL cripples the mind; its teaching should, therefore,
> be regarded as a criminal offence."
>
> [2]: See http://regex.info/blog/2006-09-15/247 for a lot of
> original-source background on "now they have two problems". I spent a
> bunch of time tracking down a *bogus* jwz attribution which I would in
> turn cite, but handhelds.org mail archives have been down for like a
> year, and http://article.gmane.org/gmane.comp.handhelds.ipaq.general/12198
> and jwz's followup at
> http://article.gmane.org/gmane.comp.handhelds.ipaq.general/12226 are
> not responding this morning either....
>
> [3]: No blame.
>
>


Reply | Threaded
Open this post in threaded view
|

Re: What do you miss most in Lua

Tim Mensch
On 2/7/2012 3:11 PM, Egil Hjelmeland wrote:
> What about sorting/collating? That would be useful. But is that a
> big-table-thing in Unicode?

What language are you sorting for?

In Spanish, "LL" comes after "LZ" in sort order, and "CH" comes after
"CZ", but not in any other language. ("LL" and "CH" are considered
"letters", and so therefore sort differently).

There are many other rules that apply to specific languages and
contexts; a dictionary sort in English is different than an alphanumeric
sort, for instance (a1,a10,a9 compared to a1,a9,a10, respectively).

But the short answer is: Yes, it's one of the things you need a huge
table and supporting source code to completely handle in Unicode. Or an
even bigger table if you REALLY need to do it right in multiple locales.

If you're curious, there's a table generator online where you can create
those Big Tables based on what you need (mapping from other charsets, a
break iterator, collators, rule based number format handling, and more).
[1] Which is good, because ALL of it comes to about 18Mb. Keep in mind
that's JUST the data table size, and not the code needed to parse the
data table, which in one case built to a 900k DLL on Windows (with 300k
of embedded tables) just for generic collation handling (trying to make
collation sane, though not actually correct, for all languages).

More examples of strange collation exceptions and general detail about
collation can be found on unicode.org [2].

Tim

[1] http://apps.icu-project.org/datacustom/
[2] http://www.unicode.org/reports/tr10/

Reply | Threaded
Open this post in threaded view
|

Re: What do you miss most in Lua

Rena
In reply to this post by Jay Carlson
> Obviously we still need to keep around bytewise operations on some
> stringlike thing. (Wait, obviously? If you're not in a single-byte
> locale the candidates for bytewise ops look like things you don't
> necessarily want to intern at first sight. The counterexample is
> something like binary SHA-1 for table lookup.)
>

Mind, Lua strings are often not strings of text, but strings of binary data...

--
Sent from my toaster.

Reply | Threaded
Open this post in threaded view
|

Re: What do you miss most in Lua

Vaughan McAlley-2
In reply to this post by Jay Carlson
On 8 February 2012 08:40, Jay Carlson <[hidden email]> wrote:
>> The main question I suppose is:  is the resulting user code, using
>> mostly ordinary string functions plus a little minimal utf8 tweaking,
>> going to be significantly uglier/harder-to-maintain/confusing, to the
>> point where using a heavier-weight abstraction might be worthwhile?
>>
>> My suspicion is that for most apps, the answer is no...
>
> Well, that certainly makes Roberto happy.

The Lua team would have handled a lot of Portuguese strings by now. If
it gave them issues LHF surely would have written a library by now...

Vaughan

Reply | Threaded
Open this post in threaded view
|

Re: What do you miss most in Lua

Jay Carlson
On Wed, Feb 8, 2012 at 2:04 AM, Vaughan McAlley <[hidden email]> wrote:

> On 8 February 2012 08:40, Jay Carlson <[hidden email]> wrote:
>>> The main question I suppose is:  is the resulting user code, using
>>> mostly ordinary string functions plus a little minimal utf8 tweaking,
>>> going to be significantly uglier/harder-to-maintain/confusing, to the
>>> point where using a heavier-weight abstraction might be worthwhile?
>>>
>>> My suspicion is that for most apps, the answer is no...
>>
>> Well, that certainly makes Roberto happy.
>
> The Lua team would have handled a lot of Portuguese strings by now. If
> it gave them issues LHF surely would have written a library by now...

The unfortunate Tower of Babel incident left us with a lot of
different writing systems, many of which do not fit into 8-bit bytes,
and certainly not at once. Much of the EU may have (had?) a common
currency, but it does not have a common single-byte character set.

Many of the non-gray parts of
http://en.wikipedia.org/wiki/List_of_writing_systems are
well-populated, and many are computerizing and joining the Internet,
although sometimes haltingly.
http://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal) may be
interesting; the PPP tables finally show Brazil within inches of
passing the UK! But computers are already pervasive in Japan and the
Republic of Korea too, and the People's Republic of China is on its
way--and the CJK written languages are difficult to handle in
unextended Lua. Some people would argue many aspects of CJK are fairly
easy by now compared to some other scripts....

I'm going to put your skepticism a different way: is processing text
in languages covering n% of, say, Internet population is a goal for
Lua? When Lua was young, it was easier to choose an n such that text
and sequences of bytes shared structure. It's getting harder.

There is a separate question of "OK, so then what?" and perhaps there
is not a good answer. Unicode is not the ideal basis for any
particular script, for starters.

Jay

Reply | Threaded
Open this post in threaded view
|

Re: What do you miss most in Lua

Miles Bader-2
Jay Carlson <[hidden email]> writes:
>  But computers are already pervasive in Japan and the Republic of
> Korea too, and the People's Republic of China is on its way--and the
> CJK written languages are difficult to handle in unextended Lua.

Wait... how are CJK "hard to handle in unextended Lua" ...?

   $ cat > '短歌.lua' << EOF
   > local lfs = require 'lfs'
   >
   > local tanka_dir = "短歌"
   >
   > function tanka ()
   >    for file in lfs.dir (os.getenv ("HOME").."/"..tanka_dir) do
   >       author = string.match (file, "^(.*)[.]txt$")
   >       if author then
   >          print ("歌人: "..author)
   >          for line in io.lines (tanka_dir.."/"..file) do
   >             print ("   "..line)
   >          end
   >       end
   >    end
   > end
   > EOF

   $ lua5.1
   Lua 5.1.4  Copyright (C) 1994-2008 Lua.org, PUC-Rio
   > loadfile "短歌.lua" ()
   > tanka ()
   歌人: 俵万智
      「寒いね」と話しかければ「寒いね」と答える人のいるあったかさ

hmm, seems OK to me...

-miles

--
Brain, n. An apparatus with which we think we think.

Reply | Threaded
Open this post in threaded view
|

Re: What do you miss most in Lua

Jay Carlson
In reply to this post by Rena
On Wed, Feb 8, 2012 at 12:17 AM, HyperHacker <[hidden email]> wrote:
>> Obviously we still need to keep around bytewise operations on some
>> stringlike thing. (Wait, obviously? If you're not in a single-byte
>> locale the candidates for bytewise ops look like things you don't
>> necessarily want to intern at first sight. The counterexample is
>> something like binary SHA-1 for table lookup.)
>>
>
> Mind, Lua strings are often not strings of text, but strings of binary data...

I think I was jumping ahead of myself there. Let me explain the whole
mess for people not following the situation.

The nice no-nonsense value-semantics bags of arbitrary bytes are
missing from so many other languages. It would be a shame to give that
up.[1] In the discussion of the hash collision problem, an
implementation strategy of simply not hashing long strings until
needed was discussed. Your code wouldn't notice, except that the
performance characteristics would change. This is similar to the array
part of tables. You can use any key you want in a table, but if you
use ascending integers, performance will be better.

One use of strings is to slurp whole files in and hand them to
external code, never to be seen again. Another is to serve as the
stereotypical 4kbyte input buffer. In neither case is the identity of
the string important. You never say "is this 4k I read from the
network the same as another transient buffer?" Nor do you use them as
keys in tables. If the only operations on long strings are to search
and extract substrings, then adding them to the global string table
has little benefit--the only use is if their lifetime overlaps with a
completely identical string, in which case they can share storage.

Plus implementation simplicity. Behind the scenes there would have to
be two different kinds of strings, but they'd have to work
identically.

What I was thinking is that the bag-of-bytes usage often matches up
with the uninterned string usage pattern and wondering how much of the
current string.* is really useful in both cases. In particular, I was
wondering out loud if there was reason to have two distinct
user-visible types to reduce the complexity of making the two work
identically, or if they should be one type with two distinct
implementations.

I'm thinking out loud in a lot of this, and if seems a little jumpy,
it's because my thoughts are not settled.

How often do you use the blobs as table keys or compare them for equality?

Jay
[1]: Well, give that up again perhaps. Userdata used to have value
semantics. People didn't like it. But interned bags of bytes look a
lot like that, just with accessors.

12