Changes in the validation of UTF-8

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Changes in the validation of UTF-8

Daurnimator
I noticed the new commit that adds support for longer (deprecated in
2003) utf8 sequences:
https://github.com/lua/lua/commit/1e0c73d5b643707335b06abd2546a83d9439d14c

I'm curious why this changed? It seems like a backwards step to me.

Reply | Threaded
Open this post in threaded view
|

Re: Changes in the validation of UTF-8

Dirk Laurie-2
Op So. 17 Mrt. 2019 om 07:52 het Daurnimator <[hidden email]> geskryf:
>
> I noticed the new commit that adds support for longer (deprecated in
> 2003) utf8 sequences:
> https://github.com/lua/lua/commit/1e0c73d5b643707335b06abd2546a83d9439d14c
>
> I'm curious why this changed? It seems like a backwards step to me.

It seems to be in line with Lua's philosophy of providing capability,
not policy.

Reply | Threaded
Open this post in threaded view
|

Re: Changes in the validation of UTF-8

Andrew Gierth
>>>>> "Dirk" == Dirk Laurie <[hidden email]> writes:

 >> I noticed the new commit that adds support for longer (deprecated in
 >> 2003) utf8 sequences:
 >> https://github.com/lua/lua/commit/1e0c73d5b643707335b06abd2546a83d9439d14c
 >>
 >> I'm curious why this changed? It seems like a backwards step to me.

 Dirk> It seems to be in line with Lua's philosophy of providing
 Dirk> capability, not policy.

When you're validating against an externally defined interchange
standard, then accepting data that the actual standard explicitly
rejects IS a lack of capability.

--
Andrew.

Reply | Threaded
Open this post in threaded view
|

Re: Changes in the validation of UTF-8

Dirk Laurie-2
Op So. 17 Mrt. 2019 om 09:03 het Andrew Gierth
<[hidden email]> geskryf:

>
> >>>>> "Dirk" == Dirk Laurie <[hidden email]> writes:
>
>  >> I noticed the new commit that adds support for longer (deprecated in
>  >> 2003) utf8 sequences:
>  >> https://github.com/lua/lua/commit/1e0c73d5b643707335b06abd2546a83d9439d14c
>  >>
>  >> I'm curious why this changed? It seems like a backwards step to me.
>
>  Dirk> It seems to be in line with Lua's philosophy of providing
>  Dirk> capability, not policy.
>
> When you're validating against an externally defined interchange
> standard, then accepting data that the actual standard explicitly
> rejects IS a lack of capability.

Lua in no way even comes close to validating against the current UTF-8
standard. We've been through this before. Marc Balmer in particular
has been quite trenchant on this point.

All that Lua does is to verify that a string satisfies the basic UTF-8
encoding: ASCII or a starting byte whose introductory string of 1's
says how many bytes in total are being encoded, followed by the right
number of 10... bytes. It's quick-and-dirty, and it doesn't get less
dirty by patching over one of the many ways in which it falls short.

There's no substitute for loading a genuine standard-conforming UTF8
library. Luarocks offers four; hopefully at least one is kept up to
date.

Reply | Threaded
Open this post in threaded view
|

Re: Changes in the validation of UTF-8

Soni "They/Them" L.
In reply to this post by Daurnimator


On 2019-03-17 2:51 a.m., Daurnimator wrote:
> I noticed the new commit that adds support for longer (deprecated in
> 2003) utf8 sequences:
> https://github.com/lua/lua/commit/1e0c73d5b643707335b06abd2546a83d9439d14c
>
> I'm curious why this changed? It seems like a backwards step to me.
>

This is very useful for ppl like me who use an UTF-8-like encoding to
encode for integers. Maybe you want the module renamed to "varint8" instead?

Reply | Threaded
Open this post in threaded view
|

Re: Changes in the validation of UTF-8

Andrew Gierth
In reply to this post by Dirk Laurie-2
>>>>> "Dirk" == Dirk Laurie <[hidden email]> writes:

 Dirk> Lua in no way even comes close to validating against the current
 Dirk> UTF-8 standard. We've been through this before. Marc Balmer in
 Dirk> particular has been quite trenchant on this point.

Other than the fact that it fails to reject encoded surrogates, what
invalid sequence does the code in lua 5.3.5 accept?

 Dirk> All that Lua does is to verify that a string satisfies the basic
 Dirk> UTF-8 encoding: ASCII or a starting byte whose introductory
 Dirk> string of 1's says how many bytes in total are being encoded,
 Dirk> followed by the right number of 10... bytes.

That's ... not what the 5.3.5 utf8_decode does. Did you read it? Test
it?

--
Andrew.

Reply | Threaded
Open this post in threaded view
|

Re: Changes in the validation of UTF-8

Jay Carlson
In reply to this post by Dirk Laurie-2
On 2019-03-17, at 4:57 AM, Dirk Laurie <[hidden email]> wrote:

> Lua in no way even comes close to validating against the current UTF-8
> standard.

Do you mean UTF-8, or do you mean Unicode?

Jay

Reply | Threaded
Open this post in threaded view
|

Re: Changes in the validation of UTF-8

Philippe Verdy
UTF-8 is a stadnard part of Unicode, defined in standard chapter 3 (conformance).
Unicode does not restrict which characters (code points) you can put in a string. If you are UTF-8 conforming, these codepoints can be any one that have a scalar value. This means you cannot place unpaired surrogates, because surrogates don't have any scalar value (so an isolated surrogate cannot be encoded into any "valid" UTF, including UTF-16, UTF-32, BOCU, or even the Chinese GB18030 standard which has now a stabilized definition that covers the whole UCS space).
The standard even gives a specifiction of byte sequences that are "conforming" in UTF-8, and these sequences explicitly exclude the surrogates space (even if surrogates are assigned a codepoint, they have no scalar value suitable for UTF-8. This is made so that ALL conforming UTF-8 text can be bijectively converted to any other conforming UTF, including UTF-16. Unpaired surrogates are unsupported.

Le dim. 17 mars 2019 à 23:58, Jay Carlson <[hidden email]> a écrit :
On 2019-03-17, at 4:57 AM, Dirk Laurie <[hidden email]> wrote:

> Lua in no way even comes close to validating against the current UTF-8
> standard.

Do you mean UTF-8, or do you mean Unicode?

Jay

Reply | Threaded
Open this post in threaded view
|

Re: Changes in the validation of UTF-8

Roberto Ierusalimschy
In reply to this post by Daurnimator
> I noticed the new commit that adds support for longer (deprecated in
> 2003) utf8 sequences:
> https://github.com/lua/lua/commit/1e0c73d5b643707335b06abd2546a83d9439d14c
>
> I'm curious why this changed? It seems like a backwards step to me.

Why is rejecting surrogates a backwards step? Did you (or anyone else
in this discussion) bother reading the new documentation?

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Changes in the validation of UTF-8

Soni "They/Them" L.


On 2019-03-18 9:49 a.m., Roberto Ierusalimschy wrote:

>> I noticed the new commit that adds support for longer (deprecated in
>> 2003) utf8 sequences:
>> https://github.com/lua/lua/commit/1e0c73d5b643707335b06abd2546a83d9439d14c
>>
>> I'm curious why this changed? It seems like a backwards step to me.
> Why is rejecting surrogates a backwards step? Did you (or anyone else
> in this discussion) bother reading the new documentation?
>
> -- Roberto
>

Thank you for this change, for I can now use utf8 as a 31-bit varint
library.

Reply | Threaded
Open this post in threaded view
|

Re: Changes in the validation of UTF-8

Andrew Gierth
In reply to this post by Roberto Ierusalimschy
>>>>> "Roberto" == Roberto Ierusalimschy <[hidden email]> writes:

 >> I noticed the new commit that adds support for longer (deprecated in
 >> 2003) utf8 sequences:
 >> https://github.com/lua/lua/commit/1e0c73d5b643707335b06abd2546a83d9439d14c
 >>
 >> I'm curious why this changed? It seems like a backwards step to me.

 Roberto> Why is rejecting surrogates a backwards step?

Rejecting surrogates is a forward step, that's not the problem.

Accepting values over 10FFFF is the backward step.

--
Andrew.

Reply | Threaded
Open this post in threaded view
|

Re: Changes in the validation of UTF-8

Roberto Ierusalimschy
>  Roberto> Why is rejecting surrogates a backwards step?
>
> Rejecting surrogates is a forward step, that's not the problem.
>
> Accepting values over 10FFFF is the backward step.

Did you read the documentation? By default the functions reject any
value over 10FFFF. They only accept these values if you give an explicit
parameter for that end. You explicitly says: I want invalid codes.
That, as others pointed out, may be useful for other purposes.

If you want to accept invalid codes, it is not the lack of this
parameter that will stop you.

(Again, did you read the documentation? Maybe that point is not
clear there?)

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Changes in the validation of UTF-8

Daurnimator
On Tue, 19 Mar 2019 at 07:25, Roberto Ierusalimschy
<[hidden email]> wrote:

>
> >  Roberto> Why is rejecting surrogates a backwards step?
> >
> > Rejecting surrogates is a forward step, that's not the problem.
> >
> > Accepting values over 10FFFF is the backward step.
>
> Did you read the documentation? By default the functions reject any
> value over 10FFFF. They only accept these values if you give an explicit
> parameter for that end. You explicitly says: I want invalid codes.
> That, as others pointed out, may be useful for other purposes.
>
> If you want to accept invalid codes, it is not the lack of this
> parameter that will stop you.
>
> (Again, did you read the documentation? Maybe that point is not
> clear there?)

What I think is a backwards step, is the lexer accepting "\u{110000}"
Unicode escapes >10FFFF should really be an error IMO.

UTF8PATT accepting deprecated 5 and 6 byte sequences is a similarly
undesirable change.

Accepting unpaired surrogates isn't odd, and is unfortunately required
when working with many badly designed APIs (e.g. windows file paths,
javascript). utf-8 with unpaired surrogates allowed is often called
"wtf-8". https://simonsapin.github.io/wtf-8/

Reply | Threaded
Open this post in threaded view
|

Re: Changes in the validation of UTF-8

Dirk Laurie-2
Op Wo. 20 Mrt. 2019 om 07:39 het Daurnimator <[hidden email]> geskryf:

>
> On Tue, 19 Mar 2019 at 07:25, Roberto Ierusalimschy
> <[hidden email]> wrote:
> >
> > >  Roberto> Why is rejecting surrogates a backwards step?
> > >
> > > Rejecting surrogates is a forward step, that's not the problem.
> > >
> > > Accepting values over 10FFFF is the backward step.
> >
> > Did you read the documentation? By default the functions reject any
> > value over 10FFFF. They only accept these values if you give an explicit
> > parameter for that end. You explicitly says: I want invalid codes.
> > That, as others pointed out, may be useful for other purposes.
> >
> > If you want to accept invalid codes, it is not the lack of this
> > parameter that will stop you.
> >
> > (Again, did you read the documentation? Maybe that point is not
> > clear there?)

> What I think is a backwards step, is the lexer accepting "\u{110000}"
> Unicode escapes >10FFFF should really be an error IMO.
>
> UTF8PATT accepting deprecated 5 and 6 byte sequences is a similarly
> undesirable change.
>
> Accepting unpaired surrogates isn't odd, and is unfortunately required
> when working with many badly designed APIs (e.g. windows file paths,
> javascript). utf-8 with unpaired surrogates allowed is often called
> "wtf-8". https://simonsapin.github.io/wtf-8/

The Unicode and UTF-8 standards have changed over time. Who can say
that they will forever be frozen as they stand now?

On the other hand, non-negative 4-byte integers will always be
non-negative 4-byte integers.

The change to the utf8 library offers a one-to-one conversion
algorithm between non-begative four-byte integers and an encoding that
translates seven-bit integers to themselves and other integers to two
ore more bytes. For historical reasons, and because of its ability to
translate valid Unicode to and from valid UTF-8, this library is
called utf8. It is by no means the only application of the encoding,
any more than counting sheep is the only application of integers.

-- Dirk

Reply | Threaded
Open this post in threaded view
|

Re: Changes in the validation of UTF-8

Roberto Ierusalimschy
In reply to this post by Daurnimator
> What I think is a backwards step, is the lexer accepting "\u{110000}"
> Unicode escapes >10FFFF should really be an error IMO.

If you write "\u{110000}", you are explicitly asking for an
invalid code. If you want invalid codes, you might as well write
"\xf4\x90\x80\x80" or "\244\144\128\128". "\u{110000}" just makes
it easier.
(But "\u{110000}" is hardly as useful as utf8.char(0x110000).
I was wondering about removing this laxity in \u; but then surrogates
should be invalid too. Is that good?)


> UTF8PATT accepting deprecated 5 and 6 byte sequences is a similarly
> undesirable change.

UTF8PATT already accepted all kinds of wrong stuff, including
overlonging sequences. 5 and 6 byte sequences is the least of the
problems here. The documentation is (and was) clear that you should use
it only on valid strings.


> Accepting unpaired surrogates isn't odd, and is unfortunately required
> when working with many badly designed APIs (e.g. windows file paths,
> javascript). utf-8 with unpaired surrogates allowed is often called
> "wtf-8". https://simonsapin.github.io/wtf-8/

That's the whole point: It is useful to be able to work with invalid
codes.  Why is 110000 "more invalid" than a surrogate? If you are going
to accept surrogates, why not do go the whole way and accept what UTF-8
was designed for?

-- Roberto


Reply | Threaded
Open this post in threaded view
|

Re: Changes in the validation of UTF-8

Lorenzo Donati-3
In reply to this post by Daurnimator
On 20/03/2019 06:32, Daurnimator wrote:
> On Tue, 19 Mar 2019 at 07:25, Roberto Ierusalimschy
> <[hidden email]> wrote:
>>

[...]

> Accepting unpaired surrogates isn't odd, and is unfortunately required
> when working with many badly designed APIs (e.g. windows file paths,
> javascript). utf-8 with unpaired surrogates allowed is often called
> "wtf-8". https://simonsapin.github.io/wtf-8/
>
>

"wtf-8" sweet!!!

8-D :-]]

-- Lorenzo