Plea for the support of unicode escape sequences

classic Classic list List threaded Threaded
53 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Plea for the support of unicode escape sequences

Edgar Toernig
As it's the last chance for probably 5 or 6 years to ask for it:

Could the next version have support for Unicode escape sequences?
(like "A smiley: \u263a, an en-dash: \u2013, an ellipsis: \u2026")

Unicode is in wide use now but encoding characters using the \x hex
escapes is annoying.  Even now most extension libraries are at least
UTF-8 transparent but there's no sane way to enter non-trivial
unicode characters.  I.e. the above string encoded by hand would
become "A smiley: \xe2\x98\xba, an en-dash: \xe2\x80\x93, an
ellipsis: \xe2\x80\xa6" and as unicode-tables usually don't contain
the UTF-8 encoded form you have to do the conversion manually.

For stock Lua, these Unicode escape sequences should generate UTF-8.
Modified version using wchars may use an appropriate encoding, UTF-16
or UTF-32).  I don't care whether the common \u+4hexdigits and
\U+8hexdigits or a variable \u+1to6or8hexdigits sequence is implemented
(but don't make it decimal, unicode tables usually use hex numbering).

I know that Lua's authors try to avoid bloat, but these additional
176 bytes (that's what an implementation of the \u4x/\U8x variant on
x86-32 costs) are IMHO very well spent.

Ciao, ET.

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Rebel Neurofog
> I know that Lua's authors try to avoid bloat, but these additional
> 176 bytes (that's what an implementation of the \u4x/\U8x variant on
> x86-32 costs) are IMHO very well spent.

Good call. I'm not sure if I personally want in official release,
but that could be handy for my project.

Could you show up a patch for Lua 5.1.4, please?

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Dirk Laurie
In reply to this post by Edgar Toernig
>
> Could the next version have support for Unicode escape sequences?
> (like "A smiley: \u263a, an en-dash: \u2013, an ellipsis: \u2026")
>

Why not simply enter them as UTF-8 using your particular system's
support for such characters?  

"A smiley: ☺, an en-dash: –, an ellipsis: …"

To enter these, I did not need to know the four-hexdigit codes:
my system has an easier method.

Dirk

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Edgar Toernig
In reply to this post by Rebel Neurofog
Rebel Neurofog wrote:
> > I know that Lua's authors try to avoid bloat, but these additional
> > 176 bytes (that's what an implementation of the \u4x/\U8x variant on
> > x86-32 costs) are IMHO very well spent.
>
> Good call. I'm not sure if I personally want in official release,
> but that could be handy for my project.
>
> Could you show up a patch for Lua 5.1.4, please?

I originally implemented it for the 5.2-rc and had to modify it
slightly for 5.1.  It got a little larger as 5.1 has no readhexaesc.
As a bonus you get hex-escapes, too.

Here you go:

diff --git a/src/llex.c b/src/llex.c
index 6dc3193..bc22303 100644
--- a/src/llex.c
+++ b/src/llex.c
@@ -273,6 +273,52 @@ static void read_long_string (LexState *ls, SemInfo *seminfo, int sep) {
 }
 
 
+static lu_int32 readhexaesc (LexState *ls, int ndigits) {
+  int buf[16], c, i;
+  lu_int32 x = 0;
+  buf[0] = '\\';
+  buf[1] = ls->current;
+  for (i = 0; i < ndigits; ++i) {
+    buf[2 + i] = c = next(ls);
+    if (!isxdigit(c))
+    {
+      /* prepare error message - show the valid part of the sequence */
+      int j;
+      luaZ_resetbuffer(ls->buff);
+      for (j = 0; j < i + 2; ++j)
+        save(ls, buf[j]);
+      luaX_lexerror(ls, "hexadecimal digit expected", TK_STRING);
+    }
+    if (isdigit(c))
+      c -= '0';
+    else
+      c = tolower(c) - 'a' + 10;
+    x = (x << 4) + c;
+  }
+  return x;
+}
+
+
+static void read_and_save_uniesc (LexState *ls, int ndigits) {
+  lu_int32 x = readhexaesc(ls, ndigits);
+  next(ls);
+  if (x > 0x7f) {
+    int buf[8], n = 0;
+    lu_int32 m = 0x3f;
+    do {
+      buf[n++] = 0x80 + (x & 0x3f);
+      x >>= 6;
+      m >>= 1;
+    } while (x > m);
+    save(ls, (~m << 1) + x);
+    while (n)
+      save(ls, buf[--n]);
+  }
+  else
+    save(ls, x);
+}
+
+
 static void read_string (LexState *ls, int del, SemInfo *seminfo) {
   save_and_next(ls);
   while (ls->current != del) {
@@ -295,6 +341,9 @@ static void read_string (LexState *ls, int del, SemInfo *seminfo) {
           case 'r': c = '\r'; break;
           case 't': c = '\t'; break;
           case 'v': c = '\v'; break;
+          case 'x': c = readhexaesc(ls, 2); break;
+          case 'u': read_and_save_uniesc(ls, 4); continue;
+          case 'U': read_and_save_uniesc(ls, 8); continue;
           case '\n':  /* go through */
           case '\r': save(ls, '\n'); inclinenumber(ls); continue;
           case EOZ: continue;  /* will raise an error next loop */


Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Ico Doornekamp
In reply to this post by Edgar Toernig
* On Tue Jun 28 21:08:10 +0200 2011, Edgar Toernig wrote:
 
> As it's the last chance for probably 5 or 6 years to ask for it:
>
> Could the next version have support for Unicode escape sequences?
> (like "A smiley: \u263a, an en-dash: \u2013, an ellipsis: \u2026")

Could you just not enter the literal UTF-8 symbol into your strings?
Any decent editor supports UTF-8 these days, so there should be no need
for special encoding of these characters ? ☺ is easier to read then
\u263a, imho.

Ico


--
:wq
^X^Cy^K^X^C^C^C^C

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Florian Weimer
In reply to this post by Edgar Toernig
* Edgar Toernig:

> +          case 'U': read_and_save_uniesc(ls, 8); continue;

Shouldn't \U take only 6 digits, and only characters up to \U1fffff?
Code points beyond that are no longer representable in the current
version of UTF-8, after all.

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Rebel Neurofog
In reply to this post by Edgar Toernig
Edgar Toernig, thanks alot!

Reply | Threaded
Open this post in threaded view
|

Re: *** GMX Spamverdacht *** Re: Plea for the support of unicode escape sequences

Edgar Toernig
In reply to this post by Dirk Laurie
Dirk Laurie wrote:

> >
> > Could the next version have support for Unicode escape sequences?
> > (like "A smiley: \u263a, an en-dash: \u2013, an ellipsis: \u2026")
>
> Why not simply enter them as UTF-8 using your particular system's
> support for such characters?  
>
> "A smiley: ☺, an en-dash: –, an ellipsis: …"
>
> To enter these, I did not need to know the four-hexdigit codes:
> my system has an easier method.

Not everyone has an editor which a) supports unicode and b) allows
one to insert every possible codepoint, for example non-printable
codes like \u200b - zero-width-space.  And even if you can insert
them - you won't see them.  I would even say, that on most systems,
the majority of unicodes printable characters just produce a square.

Ciao, ET.

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Lorenzo Donati-2
In reply to this post by Ico Doornekamp
On 28/06/2011 21.49, Ico wrote:

> * On Tue Jun 28 21:08:10 +0200 2011, Edgar Toernig wrote:
>
>> As it's the last chance for probably 5 or 6 years to ask for it:
>>
>> Could the next version have support for Unicode escape sequences?
>> (like "A smiley: \u263a, an en-dash: \u2013, an ellipsis: \u2026")
>
> Could you just not enter the literal UTF-8 symbol into your strings?
> Any decent editor supports UTF-8 these days, so there should be no need
> for special encoding of these characters ? ☺ is easier to read then
> \u263a, imho.

I see two problems:

1. sometimes two different glyphs may seem the same (think as a 3rd
millenium version of "don't put uppercase Os where they can be mistaken
for 0s"). Or they may look different, but in different ways on different
systems.

2. even if you have an editor capable of displaying and entering glyphs,
you may lack the appropriate fonts. Don't assume the files containing
Unicode chars are always opened for viewing/editing on the same system
where they were created.

Unicode escape sequences are platform independent. They are useful for
the same reasons why ASCII codes are useful, at least for people working
with Unicode.

>
> Ico
>
>

-- Lorenzo


Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Lorenzo Donati-2
In reply to this post by Edgar Toernig
On 28/06/2011 21.08, Edgar Toernig wrote:

> As it's the last chance for probably 5 or 6 years to ask for it:
>
> Could the next version have support for Unicode escape sequences?
> (like "A smiley: \u263a, an en-dash: \u2013, an ellipsis: \u2026")
>
> Unicode is in wide use now but encoding characters using the \x hex
> escapes is annoying.  Even now most extension libraries are at least
> UTF-8 transparent but there's no sane way to enter non-trivial
> unicode characters.  I.e. the above string encoded by hand would
> become "A smiley: \xe2\x98\xba, an en-dash: \xe2\x80\x93, an
> ellipsis: \xe2\x80\xa6" and as unicode-tables usually don't contain
> the UTF-8 encoded form you have to do the conversion manually.
>
> For stock Lua, these Unicode escape sequences should generate UTF-8.

Yes. I'd find those useful sometimes, but wouldn't a simple library in
pure Lua be equally suitable (beside maybe performance)?

I'm not really an expert of UTF-8, but if you say the needed tables are
small, wouldn't suffice a simple library with, for example, a function like:

utf8lib.encode [[\u263a\u2026]] --> [[\xe2\x98\xba\xe2\x80\xa6]]


> Modified version using wchars may use an appropriate encoding, UTF-16
> or UTF-32).  I don't care whether the common \u+4hexdigits and
> \U+8hexdigits or a variable \u+1to6or8hexdigits sequence is implemented
> (but don't make it decimal, unicode tables usually use hex numbering).
>
> I know that Lua's authors try to avoid bloat, but these additional
> 176 bytes (that's what an implementation of the \u4x/\U8x variant on
> x86-32 costs) are IMHO very well spent.
>




> Ciao, ET.
>
>
>
Cheers,
-- Lorenzo

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Edgar Toernig
In reply to this post by Florian Weimer
Florian Weimer wrote:
> * Edgar Toernig:
>
> > +          case 'U': read_and_save_uniesc(ls, 8); continue;
>
> Shouldn't \U take only 6 digits, and only characters up to \U1fffff?
> Code points beyond that are no longer representable in the current
> version of UTF-8, after all.

The (computer-)languages I know all use \u plus 4 hex digits and,
if they support non-bmp escapes, \U plus 8 hex digits.
And they don't deal with the complex Unicode standard per se but
only with the encoding of code points used by the standard.

What is representable and what is covered by some standard are
two different things.  I.e. the _encoding method_ given in UTF-8
supports encoding of all 32 bits of the \U-escape (as does UTF-32)
but the Unicode standard covers only the range from 0 to 0x10ffff.
Which code-points are actually valid depends on context and the
version number of the standard and are thus hard to check ;-)

Point is: Any valid Unicode code-point, when given as \u/\U-escape,
produces a valid UTF-8 byte stream for now and the foreseeable future.
Sure, you _may_ give and invalid code-points and produce non-valid
bytes but that's easy anyway ("\xff broken unicode").

Ciao, ET.

Reply | Threaded
Open this post in threaded view
|

Re: *** GMX Spamverdacht *** Re: Plea for the support of unicode escape sequences

Paul E. Merrell, J.D.
In reply to this post by Edgar Toernig
On Tue, Jun 28, 2011 at 1:04 PM, Edgar Toernig <[hidden email]> wrote:

>  I would even say, that on most systems,
> the majority of unicodes printable characters just produce a square.

Almost invariably that's because the font being used does not have
full unicode support.

Best regards,

Paul

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Tom N Harris
In reply to this post by Lorenzo Donati-2
On 06/28/2011 04:24 PM, Lorenzo Donati wrote:
> Unicode escape sequences are platform independent. They are useful for
> the same reasons why ASCII codes are useful, at least for people working
> with Unicode.
>

Technically, Lua doesn't even require ASCII, as the recent adventures
with lctype.c have shown. Unicode is platform specific because not all
platforms use the same encoding (UTF-8 vs UTF-16). And when Unicode
isn't being used at all this will just be dead-weight in the parser.

How about supporting escape sequences greater than 255 when sizeof(char)>1 ?

--
- tom
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Xavier Wang
In reply to this post by Edgar Toernig


在 2011-6-29 上午11:28,"Tom N Harris" <[hidden email]>写道:
>
> On 06/28/2011 04:24 PM, Lorenzo Donati wrote:
>>
>> Unicode escape sequences are platform independent. They are useful for
>> the same reasons why ASCII codes are useful, at least for people working
>> with Unicode.
>>
>
> Technically, Lua doesn't even require ASCII, as the recent adventures with lctype.c have shown. Unicode is platform specific because not all platforms use the same encoding (UTF-8 vs UTF-16). And when Unicode isn't being used at all this will just be dead-weight in the parser.
>
> How about supporting escape sequences greater than 255 when sizeof(char)>1 ?
>
> --
> - tom
> [hidden email]
>
As ISO C Standard specified, the size of char is always 1.

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Lorenzo Donati-2
In reply to this post by Tom N Harris
On 29/06/2011 5.29, Tom N Harris wrote:
> On 06/28/2011 04:24 PM, Lorenzo Donati wrote:
>> Unicode escape sequences are platform independent. They are useful for
>> the same reasons why ASCII codes are useful, at least for people working
>> with Unicode.
>>
>
> Technically, Lua doesn't even require ASCII,

I admit I cut the sentence short, but I didn't mean that Lua supports
ASCII (the manual expressly states that string.byte returns non-portable
codes), but that, in general, if a language supports a specific
character set (ASCII was an example), it is useful to specify character
codes in a program instead of characters. And if it is useful for a
given pre-unicode charset, it is useful for Unicode too (for the same
reasons).

 >
as the recent adventures
> with lctype.c have shown. Unicode is platform specific because not all
> platforms use the same encoding (UTF-8 vs UTF-16). And when Unicode
> isn't being used at all this will just be dead-weight in the parser.
>

Well, I'm not an expert, but aside from the different encodings (UTF-8,
16, 32 and endianness variants), Unicode is standardized. So if you are
going to write a file in UTF-8, then the byte sequence for, say, a
smiley, will be the seme on any computer on Earth that claims support
for UTF-8. There is no risk of "codepage hell". Of course there are lots
of non- or partially conforming applications/systems, but that's another
point.


> How about supporting escape sequences greater than 255 when
> sizeof(char)>1 ?
>

I don't understand exactly what you mean. Do you mean writing, for
example (assuming a new \GXXXX...multibyte esc sequence),\G10fa1b
instead of \x10\xfa\x1b (here I assume translation to Lua 5.2 new esc
sequences)?

The power of specific unicode esc sequences is that Lua will make the
table lookup for you, so it will translate a code point to the specific
byte sequence for, say, UTF-8 encoding.

-- Lorenzo

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

David Given
In reply to this post by Xavier Wang
Xavier Wang wrote:
[...]
> As ISO C Standard specified, the size of char is always 1.

Just to expand (I got very confused by this when working on Clue): the
standard decrees that sizeof() units are all relative to the size of a
char. This means that on some esoteric platforms where chars are bigger
than 8 bits, sizeof(char) will *still* be 1. To figure out how big it
actually is you need CHAR_BIT in <limits.h>.

Plus, sizeof(char) has nothing to do with the machine's address
granularity (which may not be bytes; PICs use 14-bit words each
individually addressable to store their program, for example).

--
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────
│ "I have always wished for my computer to be as easy to use as my
│ telephone; my wish has come true because I can no longer figure out
│ how to use my telephone." --- Bjarne Stroustrup


signature.asc (270 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Sean Conner
In reply to this post by Tom N Harris
It was thus said that the Great Tom N Harris once stated:

> On 06/28/2011 04:24 PM, Lorenzo Donati wrote:
> >Unicode escape sequences are platform independent. They are useful for
> >the same reasons why ASCII codes are useful, at least for people working
> >with Unicode.
> >
>
> Technically, Lua doesn't even require ASCII, as the recent adventures
> with lctype.c have shown. Unicode is platform specific because not all
> platforms use the same encoding (UTF-8 vs UTF-16). And when Unicode
> isn't being used at all this will just be dead-weight in the parser.
>
> How about supporting escape sequences greater than 255 when sizeof(char)>1 ?

  The C standard dictates that sizeof(char) == 1.  You instead want to check
CHAR_BIT from limits.h.

  -spc (Or CHAR_MAX/UCHAR_MAX ... )




Reply | Threaded
Open this post in threaded view
|

Re: *** GMX Spamverdacht *** Re: Plea for the support of unicode escape sequences

Dirk Laurie
In reply to this post by Edgar Toernig
> > >
> > > Could the next version have support for Unicode escape sequences?
> > > (like "A smiley: \u263a, an en-dash: \u2013, an ellipsis: \u2026")
> >
> > Why not simply enter them as UTF-8 using your particular system's
> > support for such characters?  
> >
> > "A smiley: ☺, an en-dash: –, an ellipsis: …"
> >
> > To enter these, I did not need to know the four-hexdigit codes:
> > my system has an easier method.
>
> Not everyone has an editor which a) supports unicode and b) allows
> one to insert every possible codepoint, for example non-printable
> codes like \u200b - zero-width-space.  And even if you can insert
> them - you won't see them.  I would even say, that on most systems,
> the majority of unicodes printable characters just produce a square.
>
"Not everyone has …" is not a reason for everyone's Lua to be bloated
with a feature few people need.  The suggestion made in another post,
to provide a module that has a function (actually, why not name the
function `_u`?) such that

_u[["A smiley: \u263a, an en-dash: \u2013, an ellipsis: \u2026"]]

returns

"A smiley: ☺, an en-dash: –, an ellipsis: …"

seems to make a lot more sense.

Dirk

Reply | Threaded
Open this post in threaded view
|

Re: *** GMX Spamverdacht *** Re: Plea for the support of unicode escape sequences

steve donovan
On Wed, Jun 29, 2011 at 9:49 AM, Dirk Laurie <[hidden email]> wrote:
> _u[["A smiley: \u263a, an en-dash: \u2013, an ellipsis: \u2026"]]
>
> returns
>
> "A smiley: ☺, an en-dash: –, an ellipsis: …"
>
> seems to make a lot more sense.

Yes, and it's possible to do it efficiently in a library. So adding it
to the language is unnecessary.

steve d.

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

David Heiko Kolf
In reply to this post by Edgar Toernig
Edgar Toernig wrote:
> I know that Lua's authors try to avoid bloat, but these additional
> 176 bytes (that's what an implementation of the \u4x/\U8x variant on
> x86-32 costs) are IMHO very well spent.

It's not just those 176 bytes. To support UTF-8 properly the string
pattern functions (find, match, ...) would also need to recognize UTF-8
characters as single characters in character sets. And then you would
need to extend all the predefined classes (%a, %c, %g, ...). This
updated pattern matching would break the classic C character handling.

To avoid incompatibilities a second version of the pattern matching
functions would be needed. (Though I guess they could share a lot of
code). Maybe this could be a compile time option.

A half baked solution (just escapes, not patterns) should be avoided in
my opinion, as I guess many users will try something like
string.match(s, "[\u2013\u2026]").

-- David

123