Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

classic Classic list List threaded Threaded
66 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Lorenzo Donati-3
On 07/07/2018 18:18, Axel Kittenberger wrote:

[snip]

> My guess is you are severely underestimating the complexity of unicode and
> think of it ASCII with extras. It's not. There are so many (strange)
> featuers in unicode when used in general programming would go haywire.
> Unicode is designed as a typesetting tool, not as a programming tool.
>
[snip]

This is spot-on : "Unicode is designed as a typesetting tool, not as a
programming tool."

I was going to post a message saying the exact same thing.

I remember when I was learning programming for the first time in my life
(self-learning, with the Italian reference manual of a 2nd hand TI99/4a
home computer): they stressed out the importance to avoid using
visually-ambiguous identifiers, because some characters could be
mistaken (i.e. '1' with 'l', 'o' with 'O' and with '0').

Unicode in programs outside strings would make bugs skyrocket: how many
code points are visually similar to a, let's say, a zero '0' or a
lowercase hex 'x'? Not counting the fact that changing the font of your
editor will change the aspect of *all* those "zero-like" "characters".
Nightmarish! For no real benefit.

Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Albert Chan

> I remember when I was learning programming for the first time in my life (self-learning, with the Italian reference manual of a 2nd hand TI99/4a home computer): they stressed out the importance to avoid using visually-ambiguous identifiers, because some characters could be mistaken (i.e. '1' with 'l', 'o' with 'O' and with '0').
>
> -- Lorenzo

My bank once send me a password, that look like a number.
After many unsucessfully logins (then block any more attempts),
I had to call them to fix it.

Even they cannot login !

Finally I suggest just cut + paste it to the password field, and it work !
Reason ? the "number" is mixed *only* with I O !!!

Now, I write number 0 with a slash in it, 1 with exaggerated slant.

And, I never use O, I as variable name (lowercase ok)

BTW, non-breaking "fake" space in filename is a bad idea.

http://boston.conman.org/2018/02/28.2
Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Lorenzo Donati-3
On 09/07/2018 19:09, Albert Chan wrote:

>
>> I remember when I was learning programming for the first time in my life (self-learning, with the Italian reference manual of a 2nd hand TI99/4a home computer): they stressed out the importance to avoid using visually-ambiguous identifiers, because some characters could be mistaken (i.e. '1' with 'l', 'o' with 'O' and with '0').
>>
>> -- Lorenzo
>
> My bank once send me a password, that look like a number.
> After many unsucessfully logins (then block any more attempts),
> I had to call them to fix it.
>
> Even they cannot login !
>
> Finally I suggest just cut + paste it to the password field, and it work !
> Reason ? the "number" is mixed *only* with I O !!!
>
> Now, I write number 0 with a slash in it, 1 with exaggerated slant.
>
> And, I never use O, I as variable name (lowercase ok)
>

Yep!!

ICT departments young (and not so young) players still do all the same
errors over and over again. :-(



> BTW, non-breaking "fake" space in filename is a bad idea.
>
> http://boston.conman.org/2018/02/28.2
>

I dare say also "normal" space (0x20) is somewhat evil (I use underscore
almost religiously even for non-sw development files names).

Modern OS are supposed to handle spaces well, but I always find some
obscure tool or program (more often scripts) whose programmer forgot to
quote an argument somewhere in his code. And almost magically an
innocent file cannot be opened in a subtle corner case just because its
name contains spaces!








Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Sean Conner
In reply to this post by Albert Chan
It was thus said that the Great Albert Chan once stated:
>
> BTW, non-breaking "fake" space in filename is a bad idea.
>
> http://boston.conman.org/2018/02/28.2

  What's bad about it?

  -spc (Or are you in the "no spaces in filename" camp?)


Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Gregg Reynolds-2


On Mon, Jul 9, 2018, 3:07 PM Sean Conner <[hidden email]> wrote:
It was thus said that the Great Albert Chan once stated:
>
> BTW, non-breaking "fake" space in filename is a bad idea.
>
> http://boston.conman.org/2018/02/28.2

  What's bad about it?

If you have to ask you'll never know.

  -spc (Or are you in the "no spaces in filename" camp?)


Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Albert Chan
In reply to this post by Sean Conner

It was thus said that the Great Albert Chan once stated:

BTW, non-breaking "fake" space in filename is a bad idea.

http://boston.conman.org/2018/02/28.2

 What's bad about it?

 -spc (Or are you in the "no spaces in filename" camp?)

>> No, the output is not faked. Yes, the filename is hello world.c

The need for the warning is reason enough not to do this.

Since the program were fooled by the fake space trick, it
will not put quotes around the ambiguous filename.

Without the heads up, who will know that hello world.c is 1 file ?

For proof-of-concept, however, your idea is neat.













Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Sean Conner
It was thus said that the Great Albert Chan once stated:
>
> Since the program were fooled by the fake space trick, it
> will not put quotes around the ambiguous filename.
>
> Without the heads up, who will know that hello world.c is 1 file ?

  Well, perhaps a shading on the non-breaking space would work.  But that's
getting far outside Lua.

> For proof-of-concept, however, your idea is neat.

  Thank you, but not many who read it liked the idea.

  -spc


Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Lorenzo Donati-3
In reply to this post by Sean Conner
On 09/07/2018 22:06, Sean Conner wrote:

> It was thus said that the Great Albert Chan once stated:
>>
>> BTW, non-breaking "fake" space in filename is a bad idea.
>>
>> http://boston.conman.org/2018/02/28.2
>
>   What's bad about it?
>
>   -spc (Or are you in the "no spaces in filename" camp?)
>
>
>
Well, it's a nice and smart trick, but I'm in the camp of "If I see a
space, I want to know it's a space (0x20)".

Moreover it is not ASCII, and I tend to avoid non-ASCII names. I work
mainly on Windows, which doesn't support UTF-8 natively the way Linux
does. So handling characters outside the ASCII set may be a nightmare.

BTW, try handling such a file to someone (especially a non-programmer)
who is unaware and see the puzzlement in his eyes when he cannot
understand why he cannot delete "my invoice.pdf" using the command line
or why he sees "my invoice.pdf" and "my invoice.pdf" in the same
directory!!!

If you want to be very evil, put /two/ spaces between words, where the
first is ASCII 0x20 and the second is char 160!

Moreover, I've the gut feeling that there are plenty of badly-written
Windows programs/scripts that will choke on char 160 when some code-page
is set differently than the way the programmer has assumed.

Yes, underscore is not nice to see, but at least I see exactly what
character is that (well, in ASCII at least. I'm sure there is some
obscure UNICODE code point that is almost identical to an underscore and
that will appear identical in some font!)

Unicode is great for typesetting (I use regularly LaTeX and it's fun to
find almost every symbol you may imagine, even ancient German runic
scripts!), but it sucks (IMHO) for general programming or
computer-related stuff. Too much mind overhead to use correctly for
little gain.

<disclaimer>
I know my view is a bit "western-centric" (or "latin-centrinc") and
people speaking languages who need thousands of symbols to be written
might think differently (especially Asian languages).

Anyway I'm curious to know how, say, Chinese programmers view the thing.
Would they find coding more "easy" if they could write programs using
ideograms or do they think using transliteration of their words in a
Latin alphabet.

BTW, I code religiously "in English" (even comments), and I teach my
students to try to do so, but I understand that sometimes this requires
higher English skills than many programmers have.

</disclaimer>

On a related note: sometimes I've dreamt of having an universally
*standardized* "extended ASCII" charset for programming, without all the
human-language-related stuff of unicode. A 16 bit "universal" charset
should be big enough to accomodate any symbol useful in programming
(e.g. common math operators and symbols, greek letters, currency
symbols), but I'm digressing. :-)


  Cheers!

-- Lorenzo










Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Dirk Laurie-2
2018-07-10 15:30 GMT+02:00 Lorenzo Donati <[hidden email]>:

> Unicode is great for typesetting (I use regularly LaTeX and it's fun to find
> almost every symbol you may imagine, even ancient German runic scripts!),
> but it sucks (IMHO) for general programming or computer-related stuff. Too
> much mind overhead to use correctly for little gain.

Yes, yes, but — if you will allow me to return to Lua and UTF-8 — there would
be more gain for a programmer if we had (if it is not too late already
for Lua 5.4)
utf8 versions of find, sub, match, gsub, gmatch, reverse. Just those, not asking
for upper/lower, operating only on simple codepoints, no combining characters,
no need for a C library.

utf8.find ("Hélène",'n')  --> 5 5
utf8.sub ("Hélène",5)   --> 'ne'
utf8.gsub ("Hélène","[éè]","e")  --> 'Helene' 2
utf8.reverse ("Hélène")   --> 'enèléH'

Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Sam Putman


On Tue, Jul 10, 2018 at 4:00 PM, Dirk Laurie <[hidden email]> wrote:
2018-07-10 15:30 GMT+02:00 Lorenzo Donati <[hidden email]>:

utf8.gsub ("Hélène","[éè]","e")  --> 'Helene' 2


Would you expect this to work on both "Hélène" and "Hélène"?

I've separated out the acute combining mark for clarity.

Such a function might easily be larger than the rest of the string library combined.
Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Dirk Laurie-2
2018-07-10 16:51 GMT+02:00 Sam Putman <[hidden email]>:

>
>
> On Tue, Jul 10, 2018 at 4:00 PM, Dirk Laurie <[hidden email]> wrote:
>>
>> 2018-07-10 15:30 GMT+02:00 Lorenzo Donati <[hidden email]>:
>>
>> utf8.gsub ("Hélène","[éè]","e")  --> 'Helene' 2
>>
>
> Would you expect this to work on both "Hélène" and "Hélène"?
>
> I've separated out the acute combining mark for clarity.
>
> Such a function might easily be larger than the rest of the string library
> combined.

On Tue, Jul 10, 2018 at 4:00 PM, Dirk Laurie <[hidden email]> wrote:
| not asking for upper/lower,
| operating only on simple codepoints,| no combining characters,
| no need for a C library.


Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Sam Putman


On Tue, Jul 10, 2018 at 5:00 PM, Dirk Laurie <[hidden email]> wrote:

>
> Would you expect this to work on both "Hélène" and "Hélène"?
>
> I've separated out the acute combining mark for clarity.
>
> Such a function might easily be larger than the rest of the string library
> combined.

On Tue, Jul 10, 2018 at 4:00 PM, Dirk Laurie <[hidden email]> wrote:
| not asking for upper/lower,
| operating only on simple codepoints,| no combining characters,
| no need for a C library.


Alright, that's clear enough.  Honestly, I had thought the 5.3 upgrade provided this
behavior.  I've been using [this library](https://github.com/starwing/luautf8); since LuaJIT is 
my daily driver, I had never had occasion to test this premise. 

Granted, it might be a good addition to the Lua standard library.  If you happen to need it,
well, you can have it right now. 
Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Javier Guerra Giraldez
In reply to this post by Dirk Laurie-2
On 10 July 2018 at 15:00, Dirk Laurie <[hidden email]> wrote:
> utf8.find ("Hélène",'n')  --> 5 5
> utf8.sub ("Hélène",5)   --> 'ne'
> utf8.gsub ("Hélène","[éè]","e")  --> 'Helene' 2
> utf8.reverse ("Hélène")   --> 'enèléH'


fully untested:

function utf8.find(s, ...)
    local a, b = s:find(...)
    return utf8.len(s, 1, a), utf8.len(s, 1, b)
end

function utf8.sub(s, a, b)
    return s:sub(utf8.offset(s, a), utf8.offset(s, b))
end

function utf8.reverse(s)
    local l, t = utf8.len(s), {}
    for p, c in utf8.codes(s) do
        t[l] = c
        l = l -1
    end
    return utf8.char(table.unpack(t))
end


--
Javier

Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Gregg Reynolds-2
In reply to this post by Dirk Laurie-2


On Tue, Jul 10, 2018, 9:00 AM Dirk Laurie <[hidden email]> wrote:
2018-07-10 15:30 GMT+02:00 Lorenzo Donati <[hidden email]>:

> Unicode is great for typesetting (I use regularly LaTeX and it's fun to find
> almost every symbol you may imagine, even ancient German runic scripts!),
> but it sucks (IMHO) for general programming or computer-related stuff. Too
> much mind overhead to use correctly for little gain.

Yes, yes, but — if you will allow me to return to Lua and UTF-8 — there would
be more gain for a programmer if we had (if it is not too late already
for Lua 5.4)
utf8 versions of find, sub, match, gsub, gmatch, reverse. Just those, not asking
for upper/lower, operating only on simple codepoints, no combining characters,
no need for a C library.

Utf8 != Unicode. It's an encoding; you don't get to pick a subset and still claim Unicode support.

"Simple codepoints"? Does Unicode define that? If not, who decides what that means? Zero-width space is pretty simple.

No combining chars? Ok, but that would not be Unicode. Practical result: massive confusion and complaining. You cannot accept Unicode and reject combining chars.



utf8.find ("Hélène",'n')  --> 5 5
utf8.sub ("Hélène",5)   --> 'ne'
utf8.gsub ("Hélène","[éè]","e")  --> 'Helene' 2
utf8.reverse ("Hélène")   --> 'enèléH'

Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Soni "They/Them" L.


On 2018-07-10 05:31 PM, Gregg Reynolds wrote:

>
>
> On Tue, Jul 10, 2018, 9:00 AM Dirk Laurie <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     2018-07-10 15:30 GMT+02:00 Lorenzo Donati
>     <[hidden email] <mailto:[hidden email]>>:
>
>     > Unicode is great for typesetting (I use regularly LaTeX and it's
>     fun to find
>     > almost every symbol you may imagine, even ancient German runic
>     scripts!),
>     > but it sucks (IMHO) for general programming or computer-related
>     stuff. Too
>     > much mind overhead to use correctly for little gain.
>
>     Yes, yes, but — if you will allow me to return to Lua and UTF-8 —
>     there would
>     be more gain for a programmer if we had (if it is not too late already
>     for Lua 5.4)
>     utf8 versions of find, sub, match, gsub, gmatch, reverse. Just
>     those, not asking
>     for upper/lower, operating only on simple codepoints, no combining
>     characters,
>     no need for a C library.
>
>
> Utf8 != Unicode. It's an encoding; you don't get to pick a subset and
> still claim Unicode support.
>
> "Simple codepoints"? Does Unicode define that? If not, who decides
> what that means? Zero-width space is pretty simple.
>
> No combining chars? Ok, but that would not be Unicode. Practical
> result: massive confusion and complaining. You cannot accept Unicode
> and reject combining chars.
>
>
>
>     utf8.find ("Hélène",'n')  --> 5 5
>     utf8.sub ("Hélène",5)   --> 'ne'
>     utf8.gsub ("Hélène","[éè]","e")  --> 'Helene' 2
>     utf8.reverse ("Hélène")   --> 'enèléH'
>

https://gist.github.com/SoniEx2/ecd119507f160d9c26e3eabd9e012dc0

Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Gregg Reynolds-2
You point being?

On Tue, Jul 10, 2018, 4:15 PM Soni "They/Them" L. <[hidden email]> wrote:


On 2018-07-10 05:31 PM, Gregg Reynolds wrote:
>
>
> On Tue, Jul 10, 2018, 9:00 AM Dirk Laurie <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     2018-07-10 15:30 GMT+02:00 Lorenzo Donati
>     <[hidden email] <mailto:[hidden email]>>:
>
>     > Unicode is great for typesetting (I use regularly LaTeX and it's
>     fun to find
>     > almost every symbol you may imagine, even ancient German runic
>     scripts!),
>     > but it sucks (IMHO) for general programming or computer-related
>     stuff. Too
>     > much mind overhead to use correctly for little gain.
>
>     Yes, yes, but — if you will allow me to return to Lua and UTF-8 —
>     there would
>     be more gain for a programmer if we had (if it is not too late already
>     for Lua 5.4)
>     utf8 versions of find, sub, match, gsub, gmatch, reverse. Just
>     those, not asking
>     for upper/lower, operating only on simple codepoints, no combining
>     characters,
>     no need for a C library.
>
>
> Utf8 != Unicode. It's an encoding; you don't get to pick a subset and
> still claim Unicode support.
>
> "Simple codepoints"? Does Unicode define that? If not, who decides
> what that means? Zero-width space is pretty simple.
>
> No combining chars? Ok, but that would not be Unicode. Practical
> result: massive confusion and complaining. You cannot accept Unicode
> and reject combining chars.
>
>
>
>     utf8.find ("Hélène",'n')  --> 5 5
>     utf8.sub ("Hélène",5)   --> 'ne'
>     utf8.gsub ("Hélène","[éè]","e")  --> 'Helene' 2
>     utf8.reverse ("Hélène")   --> 'enèléH'
>

https://gist.github.com/SoniEx2/ecd119507f160d9c26e3eabd9e012dc0

Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Soni "They/Them" L.


On 2018-07-10 06:20 PM, Gregg Reynolds wrote:
> You point being?

I mean, it's a joke, really, but if I were to actually redesign unicode,
I'd throw away all those annoying character tables and encode them as
part of the bits.

It would solve all practical problems with unicode. But we aren't gonna
have that, so we should instead stick with no unicode support for the
time being. At least until they finally decide that unicode was a huge
mistake and restart the whole thing.

>
> On Tue, Jul 10, 2018, 4:15 PM Soni "They/Them" L. <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>
>
>     On 2018-07-10 05:31 PM, Gregg Reynolds wrote:
>     >
>     >
>     > On Tue, Jul 10, 2018, 9:00 AM Dirk Laurie <[hidden email]
>     <mailto:[hidden email]>
>     > <mailto:[hidden email] <mailto:[hidden email]>>>
>     wrote:
>     >
>     >     2018-07-10 15:30 GMT+02:00 Lorenzo Donati
>     >     <[hidden email]
>     <mailto:[hidden email]>
>     <mailto:[hidden email]
>     <mailto:[hidden email]>>>:
>     >
>     >     > Unicode is great for typesetting (I use regularly LaTeX
>     and it's
>     >     fun to find
>     >     > almost every symbol you may imagine, even ancient German runic
>     >     scripts!),
>     >     > but it sucks (IMHO) for general programming or
>     computer-related
>     >     stuff. Too
>     >     > much mind overhead to use correctly for little gain.
>     >
>     >     Yes, yes, but — if you will allow me to return to Lua and
>     UTF-8 —
>     >     there would
>     >     be more gain for a programmer if we had (if it is not too
>     late already
>     >     for Lua 5.4)
>     >     utf8 versions of find, sub, match, gsub, gmatch, reverse. Just
>     >     those, not asking
>     >     for upper/lower, operating only on simple codepoints, no
>     combining
>     >     characters,
>     >     no need for a C library.
>     >
>     >
>     > Utf8 != Unicode. It's an encoding; you don't get to pick a
>     subset and
>     > still claim Unicode support.
>     >
>     > "Simple codepoints"? Does Unicode define that? If not, who decides
>     > what that means? Zero-width space is pretty simple.
>     >
>     > No combining chars? Ok, but that would not be Unicode. Practical
>     > result: massive confusion and complaining. You cannot accept
>     Unicode
>     > and reject combining chars.
>     >
>     >
>     >
>     >     utf8.find ("Hélène",'n')  --> 5 5
>     >     utf8.sub ("Hélène",5)   --> 'ne'
>     >     utf8.gsub ("Hélène","[éè]","e")  --> 'Helene' 2
>     >     utf8.reverse ("Hélène")   --> 'enèléH'
>     >
>
>     https://gist.github.com/SoniEx2/ecd119507f160d9c26e3eabd9e012dc0
>


Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Dirk Laurie-2
In reply to this post by Gregg Reynolds-2
2018-07-10 22:31 GMT+02:00 Gregg Reynolds <[hidden email]>:

>
>
>
> On Tue, Jul 10, 2018, 9:00 AM Dirk Laurie <[hidden email]> wrote:
>>
>> 2018-07-10 15:30 GMT+02:00 Lorenzo Donati <[hidden email]>:
>>
>> > Unicode is great for typesetting (I use regularly LaTeX and it's fun to find
>> > almost every symbol you may imagine, even ancient German runic scripts!),
>> > but it sucks (IMHO) for general programming or computer-related stuff. Too
>> > much mind overhead to use correctly for little gain.
>>
>> Yes, yes, but — if you will allow me to return to Lua and UTF-8 — there would
>> be more gain for a programmer if we had (if it is not too late already
>> for Lua 5.4)
>> utf8 versions of find, sub, match, gsub, gmatch, reverse. Just those, not asking
>> for upper/lower, operating only on simple codepoints, no combining characters,
>> no need for a C library.
>
>
> Utf8 != Unicode. It's an encoding; you don't get to pick a subset and still claim Unicode support.
>
> "Simple codepoints"? Does Unicode define that? If not, who decides what that means? Zero-width space is pretty simple.
>
> No combining chars? Ok, but that would not be Unicode. Practical result: massive confusion and complaining. You cannot accept Unicode and reject combining chars.

I. Am. Not. Asking. For. Unicode.

I am merely asking for extra functions along the lines of what the
utf8 library already does.
E.g. Sam's examples:

> s1 = "Hélène"
> s2 = "Hélène"
> > utf8.len(s1)
6
> utf8.len(s2)
7

If you really not understand what I mean, I can elaborate.

Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Gregg Reynolds-2
In reply to this post by Soni "They/Them" L.


On Tue, Jul 10, 2018, 4:28 PM Soni "They/Them" L. <[hidden email]> wrote:


On 2018-07-10 06:20 PM, Gregg Reynolds wrote:
> You point being?

I mean, it's a joke, really, but if I were to actually redesign unicode,
I'd throw away all those annoying character tables and encode them as
part of the bits.

It would solve all practical problems with unicode. But we aren't gonna
have that, so we should instead stick with no unicode support for the
time being. At least until they finally decide that unicode was a huge
mistake and restart the whole thing.

Well, I'll give the Unicode folks credit for playing a bad hand about as good as could be expected.  The rtl stuff is an utter monstrosity, but they did not really have the option of fixing it, they had to be compatible with stuff that was already broken (e.g. numbers in ltr scripts).

>
> On Tue, Jul 10, 2018, 4:15 PM Soni "They/Them" L. <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>
>
>     On 2018-07-10 05:31 PM, Gregg Reynolds wrote:
>     >
>     >
>     > On Tue, Jul 10, 2018, 9:00 AM Dirk Laurie <[hidden email]
>     <mailto:[hidden email]>
>     > <mailto:[hidden email] <mailto:[hidden email]>>>
>     wrote:
>     >
>     >     2018-07-10 15:30 GMT+02:00 Lorenzo Donati
>     >     <[hidden email]
>     <mailto:[hidden email]>
>     <mailto:[hidden email]
>     <mailto:[hidden email]>>>:
>     >
>     >     > Unicode is great for typesetting (I use regularly LaTeX
>     and it's
>     >     fun to find
>     >     > almost every symbol you may imagine, even ancient German runic
>     >     scripts!),
>     >     > but it sucks (IMHO) for general programming or
>     computer-related
>     >     stuff. Too
>     >     > much mind overhead to use correctly for little gain.
>     >
>     >     Yes, yes, but — if you will allow me to return to Lua and
>     UTF-8 —
>     >     there would
>     >     be more gain for a programmer if we had (if it is not too
>     late already
>     >     for Lua 5.4)
>     >     utf8 versions of find, sub, match, gsub, gmatch, reverse. Just
>     >     those, not asking
>     >     for upper/lower, operating only on simple codepoints, no
>     combining
>     >     characters,
>     >     no need for a C library.
>     >
>     >
>     > Utf8 != Unicode. It's an encoding; you don't get to pick a
>     subset and
>     > still claim Unicode support.
>     >
>     > "Simple codepoints"? Does Unicode define that? If not, who decides
>     > what that means? Zero-width space is pretty simple.
>     >
>     > No combining chars? Ok, but that would not be Unicode. Practical
>     > result: massive confusion and complaining. You cannot accept
>     Unicode
>     > and reject combining chars.
>     >
>     >
>     >
>     >     utf8.find ("Hélène",'n')  --> 5 5
>     >     utf8.sub ("Hélène",5)   --> 'ne'
>     >     utf8.gsub ("Hélène","[éè]","e")  --> 'Helene' 2
>     >     utf8.reverse ("Hélène")   --> 'enèléH'
>     >
>
>     https://gist.github.com/SoniEx2/ecd119507f160d9c26e3eabd9e012dc0
>


Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Gregg Reynolds-2
In reply to this post by Dirk Laurie-2


On Tue, Jul 10, 2018, 4:44 PM Dirk Laurie <[hidden email]> wrote:
...

I. Am. Not. Asking. For. Unicode.

I am merely asking for extra functions along the lines of what the
utf8 library already does.
E.g. Sam's examples:

> s1 = "Hélène"
> s2 = "Hélène"

FYI these look identical on Android.

> > utf8.len(s1)
6
> utf8.len(s2)
7

If you really not understand what I mean, I can elaborate.

Please do.

What does "len" mean? Number of Unicode chars ot number of bytes?
1234