Plea for the support of unicode escape sequences

classic Classic list List threaded Threaded
53 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Marc Balmer
Am 29.06.2011 18:55, schrieb David Kolf:

> Edgar Toernig wrote:
>> I know that Lua's authors try to avoid bloat, but these additional
>> 176 bytes (that's what an implementation of the \u4x/\U8x variant on
>> x86-32 costs) are IMHO very well spent.
>
> It's not just those 176 bytes. To support UTF-8 properly the string
> pattern functions (find, match, ...) would also need to recognize UTF-8
> characters as single characters in character sets. And then you would
> need to extend all the predefined classes (%a, %c, %g, ...). This
> updated pattern matching would break the classic C character handling.
>
> To avoid incompatibilities a second version of the pattern matching
> functions would be needed. (Though I guess they could share a lot of
> code). Maybe this could be a compile time option.
>
> A half baked solution (just escapes, not patterns) should be avoided in
> my opinion, as I guess many users will try something like
> string.match(s, "[\u2013\u2026]").

I strongly second this.  Unicode is far more than a few escape sequences...

If you need unicode support, provide the needed functions to Lua through
the C API, I'd say.


Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Mike Pall-28
In reply to this post by David Heiko Kolf
David Kolf wrote:
> A half baked solution (just escapes, not patterns) should be avoided in
> my opinion

Adding an external UTF-8 pattern matching library is easy enough,
but it won't be able to extend the syntax for strings. Asking
users to insert only literal glyphs or to pre-process all UTF-8
strings is a half baked solution.

So one either has to bite the bullet and add those 176 bytes or
one has to make the lexer extensible. The latter is not easy to
get right and would certainly need a lot more code.

Ignoring the problem and doing nothing is not an option. Unicode
is there to stay. Deciding not to support UTF-8 escapes in 2011 is
rather anachronistic.

--Mike

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Louis Mamakos
In reply to this post by David Heiko Kolf
When reading this thread on my smartphone over lunch earlier today, the subject line as displayed in the message list was:

        Re: Plea for the support of unico....

I initially thought the word was "Unicorns".  That might be easier to arrange and get consensus for.  Perhaps rainbows, too.

louie


Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Lorenzo Donati-2
On 29/06/2011 19.32, Louis Mamakos wrote:
> When reading this thread on my smartphone over lunch earlier today, the subject line as displayed in the message list was:
>
> Re: Plea for the support of unico....
>
> I initially thought the word was "Unicorns".  That might be easier to arrange and get consensus for.  Perhaps rainbows, too.

What's the UTF-8 sequence for the unicorn symbol? ;-) :-D

>
> louie
>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Petite Abeille

On Jun 29, 2011, at 7:36 PM, Lorenzo Donati wrote:

> What's the UTF-8 sequence for the unicorn symbol? ;-) :-D

0xE9 0xBA 0x9F obviously :P


Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Petite Abeille
In reply to this post by Mike Pall-28

On Jun 29, 2011, at 7:16 PM, Mike Pall wrote:

> Deciding not to support UTF-8 escapes in 2011 is
> rather anachronistic.

Meh.

What's wrong with hex sequences?

print( '\xE2\x86\x92' )



Or literals?

print( '←' )



Plus, the OP requested unicode sequences, e.g. \u2192 instead of \xE2\x86\x92 for →. Doesn't seem worth the trouble.



Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Dirk Laurie
On Wed, Jun 29, 2011 at 08:33:28PM +0200, Petite Abeille wrote:

>
> On Jun 29, 2011, at 7:16 PM, Mike Pall wrote:
>
> > Deciding not to support UTF-8 escapes in 2011 is
> > rather anachronistic.
>
> Meh.
>
> What's wrong with hex sequences?
>
> print( '\xE2\x86\x92' )
>
> →
>
> Or literals?
>
> print( '←' )
>
> ←
>
> Plus, the OP requested unicode sequences, e.g. \u2192 instead of \xE2\x86\x92 for →. Doesn't seem worth the trouble.
>

Yes, I don't see the point of supporting UTF-8 sequences while e.g
the pattern "[a-z]*" doesn't do what you expect.

Just for clarification: is it true that UTF-8 is locale-independent
if you take care not to use any one-byte characters above '\x7f'?

Dirk

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

David Heiko Kolf
In reply to this post by Petite Abeille
Petite Abeille wrote:
> What's wrong with hex sequences?
>
> print( '\xE2\x86\x92' )

I guess the problem is that most tables only tell you the codepoint but
not the UTF-8 encoding. The UTF-8 encoding currently has the advantage
that it is clear to the user why the usual pattern matching fails.

However, Mike does have a point. If someone builds UTF-8 libraries it
would be quite convenient if codepoint escape sequences are available.
However, I guess the sequences shouldn't be advertised in the Lua manual
and the limitations have to be clearly stated. Without the correct
library functions to support them, the codepoint escapes are likely to
cause confusion.

I wonder how compact you can store the character classes for the 65k
codepoints in the BMP and the lowercase/uppercase pairs (for
string.lower, string.upper). Maybe that can be compressed far enough to
be included in official Lua (5.3?). That would be great.

-- David


PS: An illustration for the usefulness of UTF-8 libraries:

In my JSON implementation I wanted to include the JavaScript-regexp

> /[\\\"\x00-\x1f\x7f-\x9f\u00ad\u0600-\u0604\u070f\u17b4\u17b5\u200c-\u200f\u2028-\u202f\u2060-\u206f\ufeff\ufff0-\uffff]/g


In Lua that turned to:

> local function quotestring (value)
>   -- based on the regexp "escapable" in https://github.com/douglascrockford/JSON-js
>   value = fsub (value, "[%z\1-\31\"\\\127]", escapeutf8)
>   if strfind (value, "[\194\216\220\225\226\239]") then
>     value = fsub (value, "\194[\128-\159\173]", escapeutf8)
>     value = fsub (value, "\216[\128-\132]", escapeutf8)
>     value = fsub (value, "\220\143", escapeutf8)
>     value = fsub (value, "\225\158[\180\181]", escapeutf8)
>     value = fsub (value, "\226\128[\140-\143\168\175]", escapeutf8)
>     value = fsub (value, "\226\129[\160-\175]", escapeutf8)
>     value = fsub (value, "\239\187\191", escapeutf8)
>     value = fsub (value, "\239\191[\190\191]", escapeutf8)
>   end
>   return "\"" .. value .. "\""
> end

(fsub is just an optimization for gsub).

Or LPeg:

>   local SpecialChars = (R"\0\31" + S"\"\\\127" +
>     P"\194" * (R"\128\159" + P"\173") +
>     P"\216" * R"\128\132" +
>     P"\220\132" +
>     P"\225\158" * S"\180\181" +
>     P"\226\128" * (R"\140\143" + S"\168\175") +
>     P"\226\129" * R"\160\175" +
>     P"\239\187\191" +
>     P"\229\191" + S"\190\191") / escapeutf8
>
>   local QuoteStr = g.Cs (g.Cc "\"" * (SpecialChars + 1)^0 * g.Cc "\"")

(I guess there are already libraries and Lua bindings to make this
easier, but the point of my JSON library was to stay independent and
easy to use in environments like MUD clients where you might not have
much more than pure Lua).

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Petite Abeille

On Jun 29, 2011, at 9:23 PM, David Kolf wrote:

> I guess the problem is that most tables only tell you the codepoint but
> not the UTF-8 encoding.

Get a better table ☺

http://www.fileformat.info/info/unicode/char/2192/index.htm
http://www.isthisthingon.org/unicode/index.phtml
http://www.utf8-chartable.de/
etc, etc, etc…

Some operating systems offer built-in lookups as well, e.g. Mac OS X, under Edit → Special Characters… provides both name, unicode and utf8 encoding at your finger tip.  

No point in inventing a new wheel for something so trivial.

On the other hand, if Lua provided full unicode support across the board for all string operations it would be as bloated as ICU… better leave such choice to the language user in the form of a third party library instead of pushing it on everyone through the language itself…  

Just my 2¢ ☺
Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Patrick Rapin
In reply to this post by Petite Abeille
>> What's the UTF-8 sequence for the unicorn symbol? ;-) :-D
>
> 0xE9 0xBA 0x9F obviously :P

Great, this is in fact true [1].
Well quite, since it is only for female unicorns...

[1] http://www.fileformat.info/info/unicode/char/9e9f/index.htm

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Lorenzo Donati-2
On 29/06/2011 22.17, Patrick Rapin wrote:
>>> What's the UTF-8 sequence for the unicorn symbol? ;-) :-D
>>
>> 0xE9 0xBA 0x9F obviously :P
>
> Great, this is in fact true [1].
> Well quite, since it is only for female unicorns...
>
> [1] http://www.fileformat.info/info/unicode/char/9e9f/index.htm

BTW, great source of info!

I haven't installed a full unicode font, so that sequence, put into a
text file, only showed a little box with 9e9f inside. I discovered by
googling that it was a CJK code point, but wasn't able to display it,
since every unicode table I found assumed I had a full unicode font.

So I inferred that it should have been an Unicorn :-)

The link above show you the glyph using an image, not relying on the
font installed on your machine.

Cool! Thanks! (Marked for future reference! :-)



-- Lorenzo



Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Edgar Toernig
In reply to this post by David Heiko Kolf
David Kolf wrote:
>
> I wonder how compact you can store the character classes for the 65k
> codepoints in the BMP and the lowercase/uppercase pairs (for
> string.lower, string.upper).

There's already a UTF-8 version of the string library called slnunicode.
Iirc, it uses the tables from Tcl which are about 13k.  The whole library
is about 32k.

With it and the unicode escape sequences you could write i.e.

    unicode.utf8.gsub(s, "[\uf000-\uffff]", "?")

With Lua 5.1 that would be

    unicode.utf8.gsub(s, "[\239\128\128-\239\191\191]", "?")

Now, which one is cleaner?


> Maybe that can be compressed far enough to be included in official Lua
> (5.3?). That would be great.

I think that's not really necessary.  You need both versions anyway, the
simple byte-oriented variant to parse and match arbitrary bytes sequences
(incl. binary data) and the UTF-8 version for unicode character strings.
An external library would be good enough.  But you want the escape sequences
to make the external library a pleasance to use (s.a.).

Ciao, ET.

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Petite Abeille
In reply to this post by Lorenzo Donati-2

On Jun 29, 2011, at 10:52 PM, Lorenzo Donati wrote:

>> [1] http://www.fileformat.info/info/unicode/char/9e9f/index.htm
>
> BTW, great source of info!

You might like decodeunicode as well:

http://www.decodeunicode.org/en/u+f9f3/properties

If everything else fails, just print it out:

Basic Multilingual Plane (BMP) Map
http://shop.designinmainz.de/en/Poster/decodeunicode-Basic-Multilingual-Plane-BMP-Map
Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Petite Abeille
In reply to this post by Edgar Toernig

On Jun 29, 2011, at 11:06 PM, Edgar Toernig wrote:

> But you want the escape sequences
> to make the external library a pleasance to use (s.a.).

Bah.

print( u'2192' )



'u' being a magic function to translate your unicode to any encoding you wish for.
Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

David Given
In reply to this post by Lorenzo Donati-2
Lorenzo Donati wrote:
> On 29/06/2011 19.32, Louis Mamakos wrote:
[...]
>> I initially thought the word was "Unicorns".  That might be easier to
>> arrange and get consensus for.  Perhaps rainbows, too.
>
> What's the UTF-8 sequence for the unicorn symbol? ;-) :-D

I can't find one, but you might be able to combine 🐎 (U+1F40E HORSE)
with the right kind of accent. Also there's 🌈 (U+1F308 RAINBOW).

(A lot of people think the new Unicode 6 extensions are a load of 💩
(U+1F4A9 PILE OF POO).

http://www.unicode.org/charts/PDF/U1F300.pdf

--
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────
│ "I have always wished for my computer to be as easy to use as my
│ telephone; my wish has come true because I can no longer figure out
│ how to use my telephone." --- Bjarne Stroustrup


signature.asc (270 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

KHMan
On 6/30/2011 11:34 AM, David Given wrote:

> Lorenzo Donati wrote:
>> On 29/06/2011 19.32, Louis Mamakos wrote:
> [...]
>>> I initially thought the word was "Unicorns".  That might be easier to
>>> arrange and get consensus for.  Perhaps rainbows, too.
>>
>> What's the UTF-8 sequence for the unicorn symbol? ;-) :-D
>
> I can't find one, but you might be able to combine 🐎 (U+1F40E HORSE)
> with the right kind of accent. Also there's 🌈 (U+1F308 RAINBOW).
>
> (A lot of people think the new Unicode 6 extensions are a load of 💩
> (U+1F4A9 PILE OF POO).
>
> http://www.unicode.org/charts/PDF/U1F300.pdf

Oh wow, I didn't think to check when Perl 5.14 mentioned Unicode 6
support.

I think I just lost most of my respect for the Unicode
organization... :-) ;-)

--
Cheers,
Kein-Hong Man (esq.)
Kuala Lumpur, Malaysia

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

David Given
KHMan wrote:
[...]
> I think I just lost most of my respect for the Unicode organization...
> :-) ;-)

ITYM "😃 😉" (U+1F603 SMILING FACE WITH OPEN MOUTH, U+1F609 WINKING
FACE)... unless you prefer the cat versions.

http://www.unicode.org/charts/PDF/U1F600.pdf

--
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────
│ "I have always wished for my computer to be as easy to use as my
│ telephone; my wish has come true because I can no longer figure out
│ how to use my telephone." --- Bjarne Stroustrup


signature.asc (270 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Philippe Lhoste
In reply to this post by Patrick Rapin
On 29/06/2011 22:17, Patrick Rapin wrote:
>>> What's the UTF-8 sequence for the unicorn symbol? ;-) :-D
>>
>> 0xE9 0xBA 0x9F obviously :P
>
> Great, this is in fact true [1].
> Well quite, since it is only for female unicorns...

0xE5 0xBB 0x8C

http://www.fileformat.info/info/unicode/char/5ecc/index.htm

Good tool, indeed.
I prefer to use BabelMap [1] on my desktop, but it doesn't seem to have the definitions
themselves (only shows the Chinese names).

[1] http://www.babelstone.co.uk/Software/BabelMap.html

--
Philippe Lhoste
--  (near) Paris -- France
--  http://Phi.Lho.free.fr
--  --  --  --  --  --  --  --  --  --  --  --  --  --


Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Matthew Frazier
In reply to this post by David Given
On 06/29/2011 11:52 PM, David Given wrote:

> KHMan wrote:
> [...]
>> I think I just lost most of my respect for the Unicode organization...
>> :-) ;-)
>
> ITYM "😃 😉" (U+1F603 SMILING FACE WITH OPEN MOUTH, U+1F609 WINKING
> FACE)... unless you prefer the cat versions.
>
> http://www.unicode.org/charts/PDF/U1F600.pdf
>

Why do cat smiley faces get in and not Tengwar or D'ni?

--
Regards, Matthew "LeafStorm" Frazier
http://leafstorm.us/

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Jim Whitehead II
On Thu, Jun 30, 2011 at 12:49 PM, Matthew Frazier
<[hidden email]> wrote:

> On 06/29/2011 11:52 PM, David Given wrote:
>>
>> KHMan wrote:
>> [...]
>>>
>>> I think I just lost most of my respect for the Unicode organization...
>>> :-) ;-)
>>
>> ITYM "😃 😉" (U+1F603 SMILING FACE WITH OPEN MOUTH, U+1F609 WINKING
>> FACE)... unless you prefer the cat versions.
>>
>> http://www.unicode.org/charts/PDF/U1F600.pdf
>>
>
> Why do cat smiley faces get in and not Tengwar or D'ni?

The official unicode roadmap includes a code map for Tengwar:
http://en.wikipedia.org/wiki/Tengwar#Unicode

- Jim

123