Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

classic Classic list List threaded Threaded
66 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Alysson Cunha
Issue #1) ---- Character 160
Lua 5.3 is not recognizing the character 160 / 0xA0 (https://en.wikipedia.org/wiki/Non-breaking_space) as space inside code as space.

When pasting a text from some browsers/text editors, the following text come to my code:
"function(stream, contentType)"

The space that separates "stream," and "contentType" is a Character 160, not Character 32.. Lua do not accept this character as valid space when the content is loaded throught "lua_load" function.

lua_load fails with the following message: :4: <name> or '...' expected near '<\160>'

Issue #2) ----- UTF-8
The same character #160 when encoded as UTF-8 becomes the 2 bytes 0xC2 0xA0.
The 0xC2 character in ISO 8859-1 (Latin-1) codification is the character "Â". What? 

In my app, I strongly advise the users to encode their .lua file as UTF-8 because all of my system function expects utf-8 coding as string parameter

UTF-8 is a growing trend for internationalization, and lua_load should have a parameter that force the engines handle the lua script content as utf-8 encoded. Another sollution is to create lua_loadutf8 function. or... lua_loadutf16 (since UTF-16 is used as standard in many programming languages)

--
Alysson Cunha / AlyssonRPG
http://www.rrpg.com.br - Jogue o tradicional RPG de mesa online
Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Roberto Ierusalimschy
> Issue #1) ---- Character 160
> Lua 5.3 is not recognizing the character 160 / 0xA0 (
> https://en.wikipedia.org/wiki/Non-breaking_space) as space inside code as
> space.

Like several other languages such as C and Java, Lua recognizes as space
inside source code only the standard ASCII spaces.


> Issue #2) ----- UTF-8
> The same character #160 when encoded as UTF-8 becomes the 2 bytes 0xC2 0xA0.
> The 0xC2 character in ISO 8859-1 (Latin-1) codification is the character "Â".
> What?

These are facts of live, derived directly from the meanings of UTF-8 and
ISO 8859-1.


> In my app, I strongly advise the users to encode their .lua file as UTF-8
> because all of my system function expects utf-8 coding as string parameter

There are no problems at all encoding .lua files as UTF-8. However,
identifiers, numerals, and whitespace are restricted to the "standard"
stuff (Latin letters, Indo-Arab numerals, and ASCII white-spaces).

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Hugo Musso Gualandi
In reply to this post by Alysson Cunha
Em sáb, 2018-07-07 às 09:44 -0300, Alysson Cunha escreveu:
> Issue #1) ---- Character 160
> Lua 5.3 is not recognizing the character 160 / 0xA0
> (https://en.wikipedia.org/wiki/Non-breaking_space) as space inside
> code as space.

The slippery slope of Unicode-supporting language syntax is that once
you allow some non-ASCII characters there is a temptation to allow all
of them (for instance there are many other whitespace characters[1] you
did not mention). This can be confusing (there are different characters
that look similar to each other) and also problematic to implement (the
large character-class tables would bloat the interpreter)

[1] https://en.wikipedia.org/wiki/Whitespace_character

> When pasting a text from some browsers/text editors, the following
> text come to my code:
> "function(stream, contentType)"
>
> The space that separates "stream," and "contentType" is a Character
> 160, not Character 32.

This sounds like an issue with the browser / text editor.

> Issue #2) ----- UTF-8
> The same character #160 when encoded as UTF-8 becomes the 2 bytes
> 0xC2 0xA0.
> The 0xC2 character in ISO 8859-1 (Latin-1) codification is the
> character "Â". What?

That is just how UTF-8 and Latin-1 work. UTF-8 text can look all messed
up and full of "é" and so on if your misconfigured software mistakenly
tries to interpret it as Latin-1. (Like the inf.puc-rio.br web servers
do all the time. Aaargh!)

Make sure that you configure things to properly display things as UTF-
8. For example if you are making a webpage make sure you add a <meta
charset="utf-8"> near the top of the HTML.

> In my app, I strongly advise the users to encode their .lua file as
> UTF-8 because all of my system function expects utf-8 coding as
> string parameter

Current versions of Lua are perfectly content with non-ASCII utf8-
encoded characters, as long as they only appear inside strings or
comments. Non-ASCII characters in other parts of the program result in
syntax errors, as you found out.

> UTF-8 is a growing trend for internationalization, and lua_load
> should have a parameter that force the engines handle the lua script
> content as utf-8 encoded. Another sollution is to create lua_loadutf8
> function.

This would effectly fork Lua into two versions of the language -- an
ASCII-only one and a full Unicode aware one. I'm not sure the
compatibility headache from that would be worth the hassle. IMO, if Lua
is ever to allow Unicode syntax it should be part of the default
language and not require a separate "load" function.

> or... lua_loadutf16 (since UTF-16 is used as standard in many
> programming languages)

UTF-16 is an awful standard, and is inferior to UTF-8 in pretty much
every way. Unfortunately we will need to live with it for a long time
due to how it is entrenched in Windows, Java and Javascript.


Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Alysson Cunha
I am raising the question: Should the future Lua 5.5 have unicode support? Since a lot (a lot a lot) encoding issues were solved with unicode and we are observing an international trend for the utf-8 use....

In my opinion: unicode is the future (actually, unicode is already the present for the past years), and ASCII was developed in 1960. Today, it is an old and very limitted character encoding.....

I would love to see LUA keep up to date.

On Sat, Jul 7, 2018 at 11:16 AM Hugo Musso Gualandi <[hidden email]> wrote:
Em sáb, 2018-07-07 às 09:44 -0300, Alysson Cunha escreveu:
> Issue #1) ---- Character 160
> Lua 5.3 is not recognizing the character 160 / 0xA0
> (https://en.wikipedia.org/wiki/Non-breaking_space) as space inside
> code as space.

The slippery slope of Unicode-supporting language syntax is that once
you allow some non-ASCII characters there is a temptation to allow all
of them (for instance there are many other whitespace characters[1] you
did not mention). This can be confusing (there are different characters
that look similar to each other) and also problematic to implement (the
large character-class tables would bloat the interpreter)

[1] https://en.wikipedia.org/wiki/Whitespace_character

> When pasting a text from some browsers/text editors, the following
> text come to my code:
> "function(stream, contentType)"
>
> The space that separates "stream," and "contentType" is a Character
> 160, not Character 32.

This sounds like an issue with the browser / text editor.

> Issue #2) ----- UTF-8
> The same character #160 when encoded as UTF-8 becomes the 2 bytes
> 0xC2 0xA0.
> The 0xC2 character in ISO 8859-1 (Latin-1) codification is the
> character "Â". What?

That is just how UTF-8 and Latin-1 work. UTF-8 text can look all messed
up and full of "é" and so on if your misconfigured software mistakenly
tries to interpret it as Latin-1. (Like the inf.puc-rio.br web servers
do all the time. Aaargh!)

Make sure that you configure things to properly display things as UTF-
8. For example if you are making a webpage make sure you add a <meta
charset="utf-8"> near the top of the HTML.

> In my app, I strongly advise the users to encode their .lua file as
> UTF-8 because all of my system function expects utf-8 coding as
> string parameter

Current versions of Lua are perfectly content with non-ASCII utf8-
encoded characters, as long as they only appear inside strings or
comments. Non-ASCII characters in other parts of the program result in
syntax errors, as you found out.

> UTF-8 is a growing trend for internationalization, and lua_load
> should have a parameter that force the engines handle the lua script
> content as utf-8 encoded. Another sollution is to create lua_loadutf8
> function.

This would effectly fork Lua into two versions of the language -- an
ASCII-only one and a full Unicode aware one. I'm not sure the
compatibility headache from that would be worth the hassle. IMO, if Lua
is ever to allow Unicode syntax it should be part of the default
language and not require a separate "load" function.

> or... lua_loadutf16 (since UTF-16 is used as standard in many
> programming languages)

UTF-16 is an awful standard, and is inferior to UTF-8 in pretty much
every way. Unfortunately we will need to live with it for a long time
due to how it is entrenched in Windows, Java and Javascript.




--
Alysson Cunha / AlyssonRPG
http://www.rrpg.com.br - Jogue o tradicional RPG de mesa online
Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Alysson Cunha
In reply to this post by Hugo Musso Gualandi
>UTF-16 is an awful standard, and is inferior to UTF-8 in pretty much
>every way. Unfortunately we will need to live with it for a long time
>due to how it is entrenched in Windows, Java and Javascript.

I Agree... Surrogates Pairs..... Ewwww

On Sat, Jul 7, 2018 at 11:16 AM Hugo Musso Gualandi <[hidden email]> wrote:
Em sáb, 2018-07-07 às 09:44 -0300, Alysson Cunha escreveu:
> Issue #1) ---- Character 160
> Lua 5.3 is not recognizing the character 160 / 0xA0
> (https://en.wikipedia.org/wiki/Non-breaking_space) as space inside
> code as space.

The slippery slope of Unicode-supporting language syntax is that once
you allow some non-ASCII characters there is a temptation to allow all
of them (for instance there are many other whitespace characters[1] you
did not mention). This can be confusing (there are different characters
that look similar to each other) and also problematic to implement (the
large character-class tables would bloat the interpreter)

[1] https://en.wikipedia.org/wiki/Whitespace_character

> When pasting a text from some browsers/text editors, the following
> text come to my code:
> "function(stream, contentType)"
>
> The space that separates "stream," and "contentType" is a Character
> 160, not Character 32.

This sounds like an issue with the browser / text editor.

> Issue #2) ----- UTF-8
> The same character #160 when encoded as UTF-8 becomes the 2 bytes
> 0xC2 0xA0.
> The 0xC2 character in ISO 8859-1 (Latin-1) codification is the
> character "Â". What?

That is just how UTF-8 and Latin-1 work. UTF-8 text can look all messed
up and full of "é" and so on if your misconfigured software mistakenly
tries to interpret it as Latin-1. (Like the inf.puc-rio.br web servers
do all the time. Aaargh!)

Make sure that you configure things to properly display things as UTF-
8. For example if you are making a webpage make sure you add a <meta
charset="utf-8"> near the top of the HTML.

> In my app, I strongly advise the users to encode their .lua file as
> UTF-8 because all of my system function expects utf-8 coding as
> string parameter

Current versions of Lua are perfectly content with non-ASCII utf8-
encoded characters, as long as they only appear inside strings or
comments. Non-ASCII characters in other parts of the program result in
syntax errors, as you found out.

> UTF-8 is a growing trend for internationalization, and lua_load
> should have a parameter that force the engines handle the lua script
> content as utf-8 encoded. Another sollution is to create lua_loadutf8
> function.

This would effectly fork Lua into two versions of the language -- an
ASCII-only one and a full Unicode aware one. I'm not sure the
compatibility headache from that would be worth the hassle. IMO, if Lua
is ever to allow Unicode syntax it should be part of the default
language and not require a separate "load" function.

> or... lua_loadutf16 (since UTF-16 is used as standard in many
> programming languages)

UTF-16 is an awful standard, and is inferior to UTF-8 in pretty much
every way. Unfortunately we will need to live with it for a long time
due to how it is entrenched in Windows, Java and Javascript.




--
Alysson Cunha / AlyssonRPG
http://www.rrpg.com.br - Jogue o tradicional RPG de mesa online
Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Matthew Wild
In reply to this post by Alysson Cunha
On 7 July 2018 at 15:54, Alysson Cunha <[hidden email]> wrote:
> I am raising the question: Should the future Lua 5.5 have unicode support?
> Since a lot (a lot a lot) encoding issues were solved with unicode and we
> are observing an international trend for the utf-8 use....

I think you're missing the points that have been raised.

I work with Lua and Unicode every day. My files are UTF-8 encoded.

However the correct space character is 0x20 (32). Other space
characters are not valid according to the language syntax, and
allowing others would be a horrible mess. For example would you also
include zero-width spaces?

Just because there are other space characters available does not mean
Lua needs to accept them. It's similar to making an argument that Lua
should accept 'lõçál' as equivalent to 'local'. The characters are not
the same, even if they are related.

Regards,
Matthew

Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Alysson Cunha
>>> However the correct space character is 0x20 (32).

This is what I am telling.. What? Who said that 0x20 is the correct space character? Answer: ASCII

But in Unicode, we have more than 1 "correct space character", because it is Unicode, not ASCII... So, current LUA version does not support unicode characters.

By miracle, if you do not use the "wrong" unicode characters, LUA accept it, because UNICODE was made to be backward compatible with ASCII till some point

Note: Using the public unicode character database it's easy to handle all white space characters of unicode.

On Sat, Jul 7, 2018 at 12:02 PM Matthew Wild <[hidden email]> wrote:
On 7 July 2018 at 15:54, Alysson Cunha <[hidden email]> wrote:
> I am raising the question: Should the future Lua 5.5 have unicode support?
> Since a lot (a lot a lot) encoding issues were solved with unicode and we
> are observing an international trend for the utf-8 use....

I think you're missing the points that have been raised.

I work with Lua and Unicode every day. My files are UTF-8 encoded.

However the correct space character is 0x20 (32). Other space
characters are not valid according to the language syntax, and
allowing others would be a horrible mess. For example would you also
include zero-width spaces?

Just because there are other space characters available does not mean
Lua needs to accept them. It's similar to making an argument that Lua
should accept 'lõçál' as equivalent to 'local'. The characters are not
the same, even if they are related.

Regards,
Matthew



--
Alysson Cunha / AlyssonRPG
http://www.rrpg.com.br - Jogue o tradicional RPG de mesa online
Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Ką Mykolas
What's LUA anyways? o_0

On Sat, Jul 7, 2018 at 6:11 PM, Alysson Cunha <[hidden email]> wrote:
>>> However the correct space character is 0x20 (32).

This is what I am telling.. What? Who said that 0x20 is the correct space character? Answer: ASCII

But in Unicode, we have more than 1 "correct space character", because it is Unicode, not ASCII... So, current LUA version does not support unicode characters.

By miracle, if you do not use the "wrong" unicode characters, LUA accept it, because UNICODE was made to be backward compatible with ASCII till some point

Note: Using the public unicode character database it's easy to handle all white space characters of unicode.

On Sat, Jul 7, 2018 at 12:02 PM Matthew Wild <[hidden email]> wrote:
On 7 July 2018 at 15:54, Alysson Cunha <[hidden email]> wrote:
> I am raising the question: Should the future Lua 5.5 have unicode support?
> Since a lot (a lot a lot) encoding issues were solved with unicode and we
> are observing an international trend for the utf-8 use....

I think you're missing the points that have been raised.

I work with Lua and Unicode every day. My files are UTF-8 encoded.

However the correct space character is 0x20 (32). Other space
characters are not valid according to the language syntax, and
allowing others would be a horrible mess. For example would you also
include zero-width spaces?

Just because there are other space characters available does not mean
Lua needs to accept them. It's similar to making an argument that Lua
should accept 'lõçál' as equivalent to 'local'. The characters are not
the same, even if they are related.

Regards,
Matthew



--
Alysson Cunha / AlyssonRPG
http://www.rrpg.com.br - Jogue o tradicional RPG de mesa online

Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Axel Kittenberger
In reply to this post by Alysson Cunha
Alysson, are you talking about utf-8 in string lineral? Well fine, I believe Lua already supports that.

Any utf-8 as variable names, any utf-8 encoding valid through the whole problem? No please not

My guess is you are severely underestimating the complexity of unicode and think of it ASCII with extras. It's not. There are so many (strange) featuers in unicode when used in general programming would go haywire. Unicode is designed as a typesetting tool, not as a programming tool.

For example there are hairspaces. Spaces that are none. Would make understaning a program/debugging horrible. In typesetting they are there for allowing line breaks, but compacting letters when not.

Then there is right-to-left control sequences. For right-to-left languages.

Then there are homoglyphs: This would all be different variables: A Α А Ꭺ ᗅ ᴀ ꓮ A 𐊠 𝐀 𝐴

Then there are combining modifier characters. 

And I guess more features I didn't had yet have the joy to encounter.

On Sat, Jul 7, 2018 at 4:54 PM, Alysson Cunha <[hidden email]> wrote:
I am raising the question: Should the future Lua 5.5 have unicode support? Since a lot (a lot a lot) encoding issues were solved with unicode and we are observing an international trend for the utf-8 use....

In my opinion: unicode is the future (actually, unicode is already the present for the past years), and ASCII was developed in 1960. Today, it is an old and very limitted character encoding.....

I would love to see LUA keep up to date.

On Sat, Jul 7, 2018 at 11:16 AM Hugo Musso Gualandi <[hidden email]> wrote:
Em sáb, 2018-07-07 às 09:44 -0300, Alysson Cunha escreveu:
> Issue #1) ---- Character 160
> Lua 5.3 is not recognizing the character 160 / 0xA0
> (https://en.wikipedia.org/wiki/Non-breaking_space) as space inside
> code as space.

The slippery slope of Unicode-supporting language syntax is that once
you allow some non-ASCII characters there is a temptation to allow all
of them (for instance there are many other whitespace characters[1] you
did not mention). This can be confusing (there are different characters
that look similar to each other) and also problematic to implement (the
large character-class tables would bloat the interpreter)

[1] https://en.wikipedia.org/wiki/Whitespace_character

> When pasting a text from some browsers/text editors, the following
> text come to my code:
> "function(stream, contentType)"
>
> The space that separates "stream," and "contentType" is a Character
> 160, not Character 32.

This sounds like an issue with the browser / text editor.

> Issue #2) ----- UTF-8
> The same character #160 when encoded as UTF-8 becomes the 2 bytes
> 0xC2 0xA0.
> The 0xC2 character in ISO 8859-1 (Latin-1) codification is the
> character "Â". What?

That is just how UTF-8 and Latin-1 work. UTF-8 text can look all messed
up and full of "é" and so on if your misconfigured software mistakenly
tries to interpret it as Latin-1. (Like the inf.puc-rio.br web servers
do all the time. Aaargh!)

Make sure that you configure things to properly display things as UTF-
8. For example if you are making a webpage make sure you add a <meta
charset="utf-8"> near the top of the HTML.

> In my app, I strongly advise the users to encode their .lua file as
> UTF-8 because all of my system function expects utf-8 coding as
> string parameter

Current versions of Lua are perfectly content with non-ASCII utf8-
encoded characters, as long as they only appear inside strings or
comments. Non-ASCII characters in other parts of the program result in
syntax errors, as you found out.

> UTF-8 is a growing trend for internationalization, and lua_load
> should have a parameter that force the engines handle the lua script
> content as utf-8 encoded. Another sollution is to create lua_loadutf8
> function.

This would effectly fork Lua into two versions of the language -- an
ASCII-only one and a full Unicode aware one. I'm not sure the
compatibility headache from that would be worth the hassle. IMO, if Lua
is ever to allow Unicode syntax it should be part of the default
language and not require a separate "load" function.

> or... lua_loadutf16 (since UTF-16 is used as standard in many
> programming languages)

UTF-16 is an awful standard, and is inferior to UTF-8 in pretty much
every way. Unfortunately we will need to live with it for a long time
due to how it is entrenched in Windows, Java and Javascript.




--
Alysson Cunha / AlyssonRPG
http://www.rrpg.com.br - Jogue o tradicional RPG de mesa online

Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Hugo Musso Gualandi
In reply to this post by Alysson Cunha
> By miracle, if you do not use the "wrong" unicode characters, LUA
> accept it, because UNICODE was made to be backward compatible with
> ASCII till some point

To be pedantic, the backwards compatibility is because of the utf-8
encoding, not because of Unicode. And that was on purpose, not by
miracle :)

> Note: Using the public unicode character database it's easy to handle
> all white space characters of unicode.

A full unicode character database takes multiple megabytes[1]. That is
dozens of times larger than the whole Lua interpreter is right now.

You would need to trim down the database, which would mean either a
restrictive "whitelist" of allowed characters (for example, different
whitespace is allowed but not chinese characters) or an overly
permissive system (for example, all characters are allowed in
identifiers, including non-alphabetical ones). I'm not sure either of
these are better than the ASCII status quo.

[1] http://apps.icu-project.org/datacustom/

Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Alysson Cunha
> To be pedantic, the backwards compatibility is because of the utf-8
> encoding, not because of Unicode. And that was on purpose, not by
> miracle :)

The first 128 Unicode Code Points (That 32bit unsigned numbers that map to known characters) / 7bit mask were also made to be ASCII compatible.. Not just the UTF-8 Encoding pattern.

> A full unicode character database takes multiple megabytes[1]. That is
> dozens of times larger than the whole Lua interpreter is right now.

Thats right. And I agree with you... We would not need the full unicode data. Initially, the unicode white spaces should be compatible, because whitespace is part of the Lua language, but they are represented slighty differently in unicode.



On Sat, Jul 7, 2018 at 1:41 PM Hugo Musso Gualandi <[hidden email]> wrote:
> By miracle, if you do not use the "wrong" unicode characters, LUA
> accept it, because UNICODE was made to be backward compatible with
> ASCII till some point

To be pedantic, the backwards compatibility is because of the utf-8
encoding, not because of Unicode. And that was on purpose, not by
miracle :)

> Note: Using the public unicode character database it's easy to handle
> all white space characters of unicode.

A full unicode character database takes multiple megabytes[1]. That is
dozens of times larger than the whole Lua interpreter is right now.

You would need to trim down the database, which would mean either a
restrictive "whitelist" of allowed characters (for example, different
whitespace is allowed but not chinese characters) or an overly
permissive system (for example, all characters are allowed in
identifiers, including non-alphabetical ones). I'm not sure either of
these are better than the ASCII status quo.

[1] http://apps.icu-project.org/datacustom/



--
Alysson Cunha / AlyssonRPG
http://www.rrpg.com.br - Jogue o tradicional RPG de mesa online
Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Francisco Olarte
In reply to this post by Alysson Cunha
On Sat, Jul 7, 2018 at 5:11 PM, Alysson Cunha <[hidden email]> wrote:
>>>> However the correct space character is 0x20 (32).

> This is what I am telling.. What? Who said that 0x20 is the correct space
> character? Answer: ASCII

Lua designers decide what characters are correct space. They (
correctly, IMO ), decided non-breaking-space is not one of them.

> But in Unicode, we have more than 1 "correct space character", because it is
> Unicode, not ASCII... So, current LUA version does not support unicode
> characters.

If your definition of "correct space char" is so narrow as "being
usable as space separator in lua", you have more than one. I think at
least tabs work too.

And, also, 0xA0 is not a new unicode stuff. It's present in latin-1
and many other iso8859 ( 1 byte per char ) encodings.

It works in unicode because  the first 0x100 code points are the same
as latin-1.

> By miracle, if you do not use the "wrong" unicode characters, LUA accept it,
> because UNICODE was made to be backward compatible with ASCII till some
> point

With latin-1, and, also, utf-8 ( which is a byte encoding ) encodes
the first 0x80 chars the same as ASCII. ( but it does not encode the
second half of latin-1 the same as the usual 1 byte latin1 encoding ).

> Note: Using the public unicode character database it's easy to handle all
> white space characters of unicode.

Yeah, and using said database is quite difficult. Have you eer tried
to implement something with it ( not using a lib which does it,
implement the lib ).

And, anyway, it seems nobody has problems with 0xA0 not being defined
as a space in lua ( either when using utf-8, latin-* or win1252 or
others ).

Francisco Olarte.

Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Alysson Cunha
> Yeah, and using said database is quite difficult. Have you eer tried
> to implement something with it ( not using a lib which does it,
> implement the lib ).

Yes. I use Delphi and I make myself a Unicode library for Delphi that uses Unicode Data.

On Sat, Jul 7, 2018 at 1:59 PM Francisco Olarte <[hidden email]> wrote:
On Sat, Jul 7, 2018 at 5:11 PM, Alysson Cunha <[hidden email]> wrote:
>>>> However the correct space character is 0x20 (32).

> This is what I am telling.. What? Who said that 0x20 is the correct space
> character? Answer: ASCII

Lua designers decide what characters are correct space. They (
correctly, IMO ), decided non-breaking-space is not one of them.

> But in Unicode, we have more than 1 "correct space character", because it is
> Unicode, not ASCII... So, current LUA version does not support unicode
> characters.

If your definition of "correct space char" is so narrow as "being
usable as space separator in lua", you have more than one. I think at
least tabs work too.

And, also, 0xA0 is not a new unicode stuff. It's present in latin-1
and many other iso8859 ( 1 byte per char ) encodings.

It works in unicode because  the first 0x100 code points are the same
as latin-1.

> By miracle, if you do not use the "wrong" unicode characters, LUA accept it,
> because UNICODE was made to be backward compatible with ASCII till some
> point

With latin-1, and, also, utf-8 ( which is a byte encoding ) encodes
the first 0x80 chars the same as ASCII. ( but it does not encode the
second half of latin-1 the same as the usual 1 byte latin1 encoding ).

> Note: Using the public unicode character database it's easy to handle all
> white space characters of unicode.

Yeah, and using said database is quite difficult. Have you eer tried
to implement something with it ( not using a lib which does it,
implement the lib ).

And, anyway, it seems nobody has problems with 0xA0 not being defined
as a space in lua ( either when using utf-8, latin-* or win1252 or
others ).

Francisco Olarte.



--
Alysson Cunha / AlyssonRPG
http://www.rrpg.com.br - Jogue o tradicional RPG de mesa online
Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Luiz Henrique de Figueiredo
In reply to this post by Francisco Olarte
> And, anyway, it seems nobody has problems with 0xA0 not being defined
> as a space in lua ( either when using utf-8, latin-* or win1252 or
> others ).

And this is trivial to implement: just edit lctype.c.

Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Alysson Cunha
 >  And this is trivial to implement: just edit lctype.c.

I don't think it is just as trivial as you expect. 

The unicode code point 0xA0 encoded as UTF-8 becomes the 2 bytes 0xC2 0xA0.
In my suggestion, Lua should handle those 2 bytes just like UTF-8 handles it, as just one code point/character. 

On Sat, Jul 7, 2018 at 2:04 PM Luiz Henrique de Figueiredo <[hidden email]> wrote:
> And, anyway, it seems nobody has problems with 0xA0 not being defined
> as a space in lua ( either when using utf-8, latin-* or win1252 or
> others ).

And this is trivial to implement: just edit lctype.c.



--
Alysson Cunha / AlyssonRPG
http://www.rrpg.com.br - Jogue o tradicional RPG de mesa online
Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Francisco Olarte
In reply to this post by Luiz Henrique de Figueiredo
On Sat, Jul 7, 2018 at 7:03 PM, Luiz Henrique de Figueiredo
<[hidden email]> wrote:
>> And, anyway, it seems nobody has problems with 0xA0 not being defined
>> as a space in lua ( either when using utf-8, latin-* or win1252 or
>> others ).
> And this is trivial to implement: just edit lctype.c.

The horror! Now the cat is out of the bag and people will begin to
patch the interp., post code in latin-1 and publish libraries as ms
word files!

I'm missing punch cards, none of this problems.

Just kidding. FOS.

Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Dirk Laurie-2
2018-07-07 19:20 GMT+02:00 Francisco Olarte <[hidden email]>:

> On Sat, Jul 7, 2018 at 7:03 PM, Luiz Henrique de Figueiredo
> <[hidden email]> wrote:
>>> And, anyway, it seems nobody has problems with 0xA0 not being defined
>>> as a space in lua ( either when using utf-8, latin-* or win1252 or
>>> others ).
>> And this is trivial to implement: just edit lctype.c.
>
> The horror! Now the cat is out of the bag and people will begin to
> patch the interp., post code in latin-1 and publish libraries as ms
> word files!
>
> I'm missing punch cards, none of this problems.
>
> Just kidding. FOS.

But seriously, I have for years been using a patched Lua that
accepts single UTF-8 characters as variable names so that
I can have functions named ⍴, ⍳, ⌹, ⍉ etc, without insisting
that PUC Lua should also do that.

Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Gregg Reynolds-2
In reply to this post by Alysson Cunha


On Sat, Jul 7, 2018, 9:55 AM Alysson Cunha <[hidden email]> wrote:
I am raising the question: Should the future Lua 5.5 have unicode support? Since a lot (a lot a lot) encoding issues were solved with unicode and we are observing an international trend for the utf-8 use....

In my opinion: unicode is the future (actually, unicode is already the present for the past years), and ASCII was developed in 1960. Today, it is an old and very limitted character encoding.....

I would love to see LUA keep up to date.

Unicode includes many many codepoints whose only justification is either backwards compatibility (see e.g. Arabic presentation forms) or typesetting (non-breaking space, em-space, etc.) IOW it is a hack, unavoidably.

It would be nice to support identifiers in multiple languages, but that would only be a subset of Unicode anyway. And the typesetting codes would never be needed for a programming language. You could do it but only at a cost, and it makes more sense to put the burden on the programmer to normalize all space chars to one.

G
Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Gregg Reynolds-2


On Sat, Jul 7, 2018, 2:16 PM Gregg Reynolds <[hidden email]> wrote:


On Sat, Jul 7, 2018, 9:55 AM Alysson Cunha <[hidden email]> wrote:
I am raising the question: Should the future Lua 5.5 have unicode support? Since a lot (a lot a lot) encoding issues were solved with unicode and we are observing an international trend for the utf-8 use....

In my opinion: unicode is the future (actually, unicode is already the present for the past years), and ASCII was developed in 1960. Today, it is an old and very limitted character encoding.....

I would love to see LUA keep up to date.

Unicode includes many many codepoints whose only justification is either backwards compatibility (see e.g. Arabic presentation forms) or typesetting (non-breaking space, em-space, etc.)

Not to mention support for scholarship. Does Lua really need to support identifiers in Cuneiform or Egyptian hieroglyphics? Those codepoints are only there to support publishing, not programming.

IOW it is a hack, unavoidably.

It would be nice to support identifiers in multiple languages, but that would only be a subset of Unicode anyway. And the typesetting codes would never be needed for a programming language. You could do it but only at a cost, and it makes more sense to put the burden on the programmer to normalize all space chars to one.

G
Reply | Threaded
Open this post in threaded view
|

Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Sean Conner
In reply to this post by Alysson Cunha
It was thus said that the Great Alysson Cunha once stated:
>  >>> However the correct space character is 0x20 (32).
>
> This is what I am telling.. What? Who said that 0x20 is the correct space
> character? Answer: ASCII
>
> But in Unicode, we have more than 1 "correct space character", because it
> is Unicode, not ASCII... So, current LUA version does not support unicode
> characters.

  There are more than one space character, but some would choose to
interpret them by their intent and not necessarily the same as ASCII SP.

  ASCII defined the following characters:

        FS 0x1C File Separator
        GS 0x1D Group Separator
        RS 0x1E Record Separator
        US 0x1F Unit Separator
        SP 0x20 Space

  The placement of Space isn't accidental---the creators of ASCII were very
deliberate in their choices and Space is neither a control character nor a
graphic character, but it can act as either one.  One can choose to treat
Space as another separator character with a sie smaller than a "unit".  Or
it can be treated as a (non-visible) graphic character.  So breaking on a
space is a valid interpretation here.

  In Unicode (and some other character sets) there is the concept of a
"non-breaking space".  The semantic meaning of this is "this is a non-grapic
character that is part of a sequence of characters prior and after it" and
thus, no splitting should be allowed.

  I was able to use such semantics for filenames [1].  I use the command
line almost exclusively, and spaces (0x20) in filenames have always been
problematic because of the way the shell tokenizes input (breaks on Space).
But by replacing Space with Non-breaking-Space, everything just worked, even
filename completion.  

  So I could argue that Lua did "The Right Thing" when it encounted a
Non-breaking-Space [2].  And I'm even going to go into the semantics of a
Zero-Width-Space [3].

  -spc

[1] http://boston.conman.org/2018/02/28.2

[2] The point might be clearer if Lua also supported Unicode in
        identifiers.

[3] But if you want to, https://en.m.wikipedia.org/wiki/Zero-width_space

1234