Plea for the support of unicode escape sequences

classic Classic list List threaded Threaded
53 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

David Given
Jim Whitehead II wrote:
[...]
> The official unicode roadmap includes a code map for Tengwar:
> http://en.wikipedia.org/wiki/Tengwar#Unicode

However, as I discovered the other day when I needed them for a
particularly esoteric program I was writing, there is no Malachim,
Celestial, Theban, or Transitus Fluvii scripts. There isn't even any
Enochian. Unicode 6 *has* added a bunch of alchemical symbols, but
there's only so much you can do with those...

Wrenching the discussion at least back in the direction of being on
topic, I think that at some point they're going to have to lift the
0x110000 limit on the Unicode space size. I'm pretty sure that limit was
only imposed to keep Java and Windows happy; they standardised on UCS-2
way too soon, back when they thought 0x10000 was more than anyone would
need, and as a result shot themselves in the foot really badly. If you
don't believe me, just go look at surrogates, and then check out the
Java String API and the hideous mess that is charAt() vs
codePointAt()... and then try using astral plane code points in online
services and seeing how many of them actually work.

Which is why, of course, people should not be using UCS-2 or UTF-16 for
anything. In fact, I'd suggest not using UCS-4 either --- it encourages
shortcuts in handling Unicode that aren't actually valid, like assuming
you can split strings anywhere. UTF-8 FTW.

--
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────
│ "I have always wished for my computer to be as easy to use as my
│ telephone; my wish has come true because I can no longer figure out
│ how to use my telephone." --- Bjarne Stroustrup


signature.asc (270 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Javier Guerra Giraldez
On Thu, Jun 30, 2011 at 7:20 AM, David Given <[hidden email]> wrote:
> Which is why, of course, people should not be using UCS-2 or UTF-16 for
> anything. In fact, I'd suggest not using UCS-4 either --- it encourages
> shortcuts in handling Unicode that aren't actually valid, like assuming
> you can split strings anywhere. UTF-8 FTW.

+1

i have some faint memories that windows got some time ago a new set of
UTF-8 API functions that any application can use instead of the old
'widechar' ones.  is that true?  (of course that wouldn't mean UCS-2
is anywhere closer to the painful death it deserves)

--
Javier

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

steve donovan
On Thu, Jun 30, 2011 at 4:27 PM, Javier Guerra Giraldez
<[hidden email]> wrote:
> i have some faint memories that windows got some time ago a new set of
> UTF-8 API functions that any application can use instead of the old
> 'widechar' ones.

There's things like '_mbslen' (mbs for 'multi-byte string'); UTF-8 is
definitely encouraged if you don't go wide.

 (of course that wouldn't mean UCS-2
> is anywhere closer to the painful death it deserves)

For desktop Windows at least, it's been UTF-16 since Windows 2K.

steve d.

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Tom N Harris
On 06/30/2011 10:37 AM, steve donovan wrote:
>
> There's things like '_mbslen' (mbs for 'multi-byte string'); UTF-8 is
> definitely encouraged if you don't go wide.
>

Unfortunately, Microsoft's _setmbcp does not support UTF-8 (nor UTF-7).
It's like they want to make things more difficult for us.

You could redefine Lua to use wchar_t in strings. And that's why I
mentioned escapes greater than 255. As more than one person pointed out,
I confused sizeof() with CHAR_BIT. Well, you could
"#define char int" but that would be cheating. Or not? I see a few
"sizeof(char)" in the Lua source. But it does assume that "char" has not
been redefined in other places. It does test against UCHAR_MAX but with
the assumption that it is less than 1000. (Only three decimal digits or
two hexadecimal.)

--
- tom
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Miles Bader-2
In reply to this post by David Given
David Given <[hidden email]> writes:
> Which is why, of course, people should not be using UCS-2 or UTF-16 for
> anything. In fact, I'd suggest not using UCS-4 either --- it encourages
> shortcuts in handling Unicode that aren't actually valid, like assuming
> you can split strings anywhere. UTF-8 FTW.

Totally agree, which is why the unicode string constant notation in
C++0x is a bit of a shame:  the most "obvious" notation, u"xxx", yields
a _UTF-16_ wide-character string constant; one must write u8"yyy" to get
a proper UTF-8 constant string.  This is particularly awkward given the
long historical emphasis on, and generally much better handling of,
char* strings in C/C++.

I imagine MS lobbied for this though, and nobody else cared enough to
resist...

Oh well.

-miles

--
"She looks like the wax version of herself."
              [Comment under a Paris Hilton fashion pic]

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

"J.Jørgen von Bargen"
In reply to this post by Tom N Harris
Am 01.07.2011 04:43, schrieb Tom N Harris:
> "#define char int" but that would be cheating. Or not? I see a few
> "sizeof(char)" in the Lua source. But it does assume that "char" has
> not been redefined in other places. It does test against UCHAR_MAX but
> with the assumption that it is less than 1000. (Only three decimal
> digits or two hexadecimal.)
/Please/ dont. Then you'll end, where perl is: reading an 1MB file into
a string consumes 4MB of memory :-/ . One of the mayor reasons, I've
switched to Lua. I'm quite happy with Lua the way it is. A small utf8
module and you can do what you like to. Lua's "one char is one byte and
Lua doesnt care, if it's ascii or utf8 or utf16be or utf16le in the
bytes" is exactly what I want. In perl with every release you have to
struggle again, how to get unicode data to be handled.




Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Tom N Harris
On 07/01/2011 01:04 AM, "J.Jørgen von Bargen" wrote:
> /Please/ dont. Then you'll end, where perl is: reading an 1MB file into
> a string consumes 4MB of memory :-/ . One of the mayor reasons, I've
> switched to Lua. I'm quite happy with Lua the way it is. A small utf8
> module and you can do what you like to. Lua's "one char is one byte and
> Lua doesnt care, if it's ascii or utf8 or utf16be or utf16le in the
> bytes" is exactly what I want. In perl with every release you have to
> struggle again, how to get unicode data to be handled.
>

Well I was going to try it just for fun. But now I see that Linux
doesn't have wfopen, so that's the end of that.


--
- tom
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Sean Conner
It was thus said that the Great Tom N Harris once stated:

> On 07/01/2011 01:04 AM, "J.Jørgen von Bargen" wrote:
> >/Please/ dont. Then you'll end, where perl is: reading an 1MB file into
> >a string consumes 4MB of memory :-/ . One of the mayor reasons, I've
> >switched to Lua. I'm quite happy with Lua the way it is. A small utf8
> >module and you can do what you like to. Lua's "one char is one byte and
> >Lua doesnt care, if it's ascii or utf8 or utf16be or utf16le in the
> >bytes" is exactly what I want. In perl with every release you have to
> >struggle again, how to get unicode data to be handled.
> >
>
> Well I was going to try it just for fun. But now I see that Linux
> doesn't have wfopen, so that's the end of that.

  But Linux *does* have iconv (or at least, all the Linux distributions I've
seen), which handles not only UTF-8, but a whole slew of character sets (on
my home system, 961 different character sets).  Iconv isn't that hard to
use---it's only three functions.  

  In looking up wfopen() (never heard of it before), it appears that all it
does is open the file up in binary mode.  Can you use the regular
fgetc()/fputc() functions?  Or does it have corresponding "w" functions for
those?  

  -spc (Implemented a Lua iconv module, which wasn't that hard)




Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

David Given
Sean Conner wrote:
[...]
>   In looking up wfopen() (never heard of it before), it appears that all it
> does is open the file up in binary mode.

I think wfopen() is just fopen() with a 'wide character' (UTF-16) filename.

I believe the Unix way to do this is to set the appropriate locale and
then pass in a UTF-8 filename to fopen().

--
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────
│ "I have always wished for my computer to be as easy to use as my
│ telephone; my wish has come true because I can no longer figure out
│ how to use my telephone." --- Bjarne Stroustrup


signature.asc (270 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Roberto Ierusalimschy
In reply to this post by Tom N Harris
> You could redefine Lua to use wchar_t in strings. And that's why I
> mentioned escapes greater than 255. As more than one person pointed
> out, I confused sizeof() with CHAR_BIT. Well, you could
> "#define char int" but that would be cheating. Or not? I see a few
> "sizeof(char)" in the Lua source. But it does assume that "char" has
> not been redefined in other places. It does test against UCHAR_MAX
> but with the assumption that it is less than 1000. (Only three
> decimal digits or two hexadecimal.)

The main goal of the "sizeof(char)" in the source is as a form of
documentation. It intends to easy the task of anyone who wants to
change 'char' to something else, but not with a #define char....
Several details still must be handed carefully.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Florian Weimer
In reply to this post by Miles Bader-2
* Miles Bader:

> Totally agree, which is why the unicode string constant notation in
> C++0x is a bit of a shame:  the most "obvious" notation, u"xxx", yields
> a _UTF-16_ wide-character string constant; one must write u8"yyy" to get
> a proper UTF-8 constant string.

Is it really UTF-16?  I would expect it to result in an a wide string
constant whose characters are of type const wchar_t.

Pitty that the committee pulled the C++ standard from public view.

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Miles Bader-2
Florian Weimer <[hidden email]> writes:
>> Totally agree, which is why the unicode string constant notation in
>> C++0x is a bit of a shame:  the most "obvious" notation, u"xxx", yields
>> a _UTF-16_ wide-character string constant; one must write u8"yyy" to get
>> a proper UTF-8 constant string.
>
> Is it really UTF-16?  I would expect it to result in an a wide string
> constant whose characters are of type const wchar_t.

I don't have the standard to check, but my impression was that it yields
an object of type "char16_t*" or some such (and there's also U"..." to
get strings of type "char32_t*").  Yielding UTF-16 would be consistent
with what u8"..." does.

[and of course as the real _intent_ of this feature is probably to match
whatever Microsoft does, and wide-strings on their systems apparently
use UTF-16 these days, that suggests this feature would as well.... a
bit shoddy, but there you have it.]

-Miles

--
Egotist, n. A person of low taste, more interested in himself than in me.

Reply | Threaded
Open this post in threaded view
|

Re: Plea for the support of unicode escape sequences

Klaus Ripke
In reply to this post by Edgar Toernig
hi


Am 29.06.2011 um 23:06 schrieb Edgar Toernig <[hidden email]>:
> There's already a UTF-8 version of the string library called slnunicode.
> Iirc, it uses the tables from Tcl which are about 13k.  The whole library
> is about 32k.
btw that covers only the BMP but I guess there is not much need
for a fairy tale character class or uppercased unicorns.
A gsub('<pile of poo>+', '<rainbow>') should work tho.

>    unicode.utf8.gsub(s, "[\uf000-\uffff]", "?")

The magic u function is basically itself a utf8.gsub('%\(%x+)', utf8.char).
Note it replaces \xxxx not \uxxxx.
You have to use \\ (or [[) to get \ so it would be u"\\f000-\\ffff".

Anyway, as I am struggling to lift slnunicode to latest 5.2,
I think I may add such a function in C and especially
use it on all patterns automatically,
since you should escape a literal \ as %\ in patterns anyway.
Thoughts?
I guess besides patterns there is not much need to specify characters
by code values.


And BTW, unicode is not 100% locale independent even in the basic
features as provided by string.
upper('i') has to yield an I with dot in the turkish locale.

Collation is highly locale dependent, but that's another story.
I'll probably just expose libc's strcoll/strxfrm.


best
Klaus


123