Lua 5.4.0 beta announcement

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Lua 5.4.0 beta announcement

TonyMc
Hi,

in the recent beta announcement there is a link to the changes at
http://www.lua.org/work/doc/#changes .

There is a typo there: coersions should be coercions.

Thank you for the beta!

Tony

Reply | Threaded
Open this post in threaded view
|

Re: Lua 5.4.0 beta announcement

Philippe Verdy
The change in utf8 makes it incompatible with standards if it now accepts and decodes sequences up to 6 bytes.
The only standard UTF-8 version is based on the same definition universally adopted that limits them to the 17 planes up to U+10FFFF.
The "original" specification of UTF-8 was only an informative RFC, that was deprecated many years ago and never adopted as a standard. All web standards use the version copublished in the Unicode standard and in the RFC replacing it (which was approved and adopted everywhere else).
I think this is a bad idea... So now applications will have to use their own libraries and check a lot of dependencies to make sure they conform and will treat erroneous data as invalid.
The UTF-8 standard is so universal today that it is needed in almost all applications using the web, or filesystems. Including automated processes and systems without their own user interface but used as middlewares.
Having now to rewrite it is a bad idea, especially for small devices (including iOT).
You should have not changed this specification. The (extremeley rare) situations when one application may need such extension should be scoped in their own library or could have used another variant of the library. This change also makes existing libraries trying to parse and validate international texts (including the builtin pattern engine or alternate regular expression engines) to have new complications.

Anyway I really suggest that the compiling options for the Lua-5.4 "utf8" library allows keeping a setting so that the standard behavior can remain in place (i.e. 5-byte and 6-byte sequences, as well as their associated leading bytes which are invalid in the standard, should be treated like other sequences that the library will recognize as invalid, such as a valid lead byte not followed by the correct number of trail bytes, or trail bytes without any leading lead byte): treating them as invalid is expected. But now if applications have to make additional checks, this will just slow them down (or leave bugs in them with undetected cases possibly creating security holes that can be exploited).

A "secure" compiled version of Lua should have this option set by default to keep the utf8 library conforming to the standard. The old RFC behavior should then not be supported or could be added in an optional secondary library like "oldutf8" instead of "utf8". But I bet that almost now one will ever want to use that old library that will then not need to be "builtin" in the engine but provided as an optional extension and loaded only "on demand" in the code using explicit library loads.




Le jeu. 3 oct. 2019 à 12:21, TonyMc <[hidden email]> a écrit :
Hi,

in the recent beta announcement there is a link to the changes at
http://www.lua.org/work/doc/#changes .

There is a typo there: coersions should be coercions.

Thank you for the beta!

Tony

Reply | Threaded
Open this post in threaded view
|

Re: Lua 5.4.0 beta announcement

Roberto Ierusalimschy
In reply to this post by TonyMc
> in the recent beta announcement there is a link to the changes at
> http://www.lua.org/work/doc/#changes .
>
> There is a typo there: coersions should be coercions.

Thanks for the correction.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Lua 5.4.0 beta announcement

Soni "They/Them" L.
In reply to this post by Philippe Verdy


On 2019-10-03 11:14 a.m., Philippe Verdy wrote:

> The change in utf8 makes it incompatible with standards if it now
> accepts and decodes sequences up to 6 bytes.
> The only standard UTF-8 version is based on the same definition
> universally adopted that limits them to the 17 planes up to U+10FFFF.
> The "original" specification of UTF-8 was only an informative RFC,
> that was deprecated many years ago and never adopted as a standard.
> All web standards use the version copublished in the Unicode standard
> and in the RFC replacing it (which was approved and adopted everywhere
> else).
> I think this is a bad idea... So now applications will have to use
> their own libraries and check a lot of dependencies to make sure they
> conform and will treat erroneous data as invalid.
> The UTF-8 standard is so universal today that it is needed in almost
> all applications using the web, or filesystems. Including automated
> processes and systems without their own user interface but used as
> middlewares.
> Having now to rewrite it is a bad idea, especially for small devices
> (including iOT).
> You should have not changed this specification. The (extremeley rare)
> situations when one application may need such extension should be
> scoped in their own library or could have used another variant of the
> library. This change also makes existing libraries trying to parse and
> validate international texts (including the builtin pattern engine or
> alternate regular expression engines) to have new complications.
>
> Anyway I really suggest that the compiling options for the Lua-5.4
> "utf8" library allows keeping a setting so that the standard behavior
> can remain in place (i.e. 5-byte and 6-byte sequences, as well as
> their associated leading bytes which are invalid in the standard,
> should be treated like other sequences that the library will recognize
> as invalid, such as a valid lead byte not followed by the correct
> number of trail bytes, or trail bytes without any leading lead byte):
> treating them as invalid is expected. But now if applications have to
> make additional checks, this will just slow them down (or leave bugs
> in them with undetected cases possibly creating security holes that
> can be exploited).
>
> A "secure" compiled version of Lua should have this option set by
> default to keep the utf8 library conforming to the standard. The old
> RFC behavior should then not be supported or could be added in an
> optional secondary library like "oldutf8" instead of "utf8". But I bet
> that almost now one will ever want to use that old library that will
> then not need to be "builtin" in the engine but provided as an
> optional extension and loaded only "on demand" in the code using
> explicit library loads.
>

utf8 has always accepted all sorts of invalid sequences when matching
strings using the utf8 pattern.

a well-behaved program should never output invalid UTF-8 from valid
UTF-8, and you still need to *explicitly* request 31-bit utf8 for
decoding, so nothing has changed there.

but perhaps utf8 should be renamed to varint31? these changes are, after
all, meant to reuse the same code for a small yet useful data
interchange format based on 31-bit varints.

the only concern I have is over existing usage of e.g.
utf8.codes(s:gsub(...)). it would probably be beneficial to make
utf8.codes accept a start index before the lax switch, or otherwise
enforce that the lax switch is not a number. (the start index is more
appealing imo.)

>
>
>
> Le jeu. 3 oct. 2019 à 12:21, TonyMc <[hidden email]
> <mailto:[hidden email]>> a écrit :
>
>     Hi,
>
>     in the recent beta announcement there is a link to the changes at
>     http://www.lua.org/work/doc/#changes .
>
>     There is a typo there: coersions should be coercions.
>
>     Thank you for the beta!
>
>     Tony
>


Reply | Threaded
Open this post in threaded view
|

Re: Lua 5.4.0 beta announcement

Roberto Ierusalimschy
> the only concern I have is over existing usage of e.g.
> utf8.codes(s:gsub(...)). it would probably be beneficial to make utf8.codes
> accept a start index before the lax switch, or otherwise enforce that the
> lax switch is not a number. (the start index is more appealing imo.)

Accepting a start index would not help in that case, would it?

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Lua 5.4.0 beta announcement

Soni "They/Them" L.


On 2019-10-03 6:05 p.m., Roberto Ierusalimschy wrote:
> > the only concern I have is over existing usage of e.g.
> > utf8.codes(s:gsub(...)). it would probably be beneficial to make utf8.codes
> > accept a start index before the lax switch, or otherwise enforce that the
> > lax switch is not a number. (the start index is more appealing imo.)
>
> Accepting a start index would not help in that case, would it?
>
> -- Roberto
>

While it would produce wrong results, that would probably be better than
producing unsafe results. Consider a sequence of gsubs that remove bad
sequences (and yeah you aren't supposed to do it like this but ppl do
things like this all the time - for proper security you should always
operate on decoded data where all arguments about overlong encodings and
whatever being bad for security can be thrown out the window but that
doesn't stop ppl doing it anyway and I could rant about this all day lol).

Anyway, I digress. Consider a sequence of gsubs that remove bad
sequences. And then they switch to the new version. Now some things that
were invalid are suddenly valid because the 2nd return value from gsub
is being used as true and enabling unsafe mode, so some things they're
getting rid of are suddenly going through.

In this case, I think it'd be better to just crash or produce broken but
safe results than let invalid UTF-8 through. but maybe that's just me.

Reply | Threaded
Open this post in threaded view
|

Re: Lua 5.4.0 beta announcement

Philippe Verdy
In reply to this post by Roberto Ierusalimschy
But the old 31-bit non-standard variant uses a very lax pattern, that would be invalid with standard UTF-8.
It's not acceptable to validate any string that starts with a lead byte valid only for the old 31-bit variant.
The valid patterns are documented in TUS chapter 3 (conformance) and the standard RFC; don't trust the old proposed RFC which was never accepted.
Otherwise there is the risk of validating an input that will break when converting it to UTF-16, creating dangerous collisions or unexpected errors.
Standard UTF-8 means it is FULLY interoperable with ALL standard UTFs (in which the old proposed RFC 31-bit variant has never been part of).
There are extremely limited uses for deviating from the UTF-8 standard, and these use cases should be strictly isolated, without having to rewrite a lot of code depending on the standard conformance, and only these rare use cases should use their own encoding libraries (under another name than "utf8", see for example the variant used in Java for JNI, which is NOT labelled "utf8", except in a legacy field name used in legacy source code written in C: even Java no longer uses this binding which is purely local to the internal legacy storage format of compiled classes; otherwise it uses another (de)coder but not the one named "UTF-8", and its only difference was about the way it represents U+0000 NUL, as <0xC0,0x80> instead of a single byte <0x00>, so that it can be used with legacy C string types terminated by a null byte; modern string libraries don't use null-byte termination but byte buffers with a length property, and modern JNI applications no longer use the old interfaces based on 8-bit code units, but 16-bit code units instead; this also applies to other languages like Javascript/ECMIScript/TypeStript/VBScript which also don't depend on null-byte termination to know the length of strings for any encodings and code unit sizes they can support, including notably Python).

So I don't undersand the rationale for changing the utf8 libary for this "extended" but non-standard behavior that can just cause havoc and create severe security problems in applications that would then need to be deeply modified in many places to force them to revalidate their input and make sure that theyr will remain interoperable.


Le jeu. 3 oct. 2019 à 23:05, Roberto Ierusalimschy <[hidden email]> a écrit :
> the only concern I have is over existing usage of e.g.
> utf8.codes(s:gsub(...)). it would probably be beneficial to make utf8.codes
> accept a start index before the lax switch, or otherwise enforce that the
> lax switch is not a number. (the start index is more appealing imo.)

Accepting a start index would not help in that case, would it?

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Lua 5.4.0 beta announcement

Soni "They/Them" L.


On 2019-10-03 6:21 p.m., Philippe Verdy wrote:

> But the old 31-bit non-standard variant uses a very lax pattern, that
> would be invalid with standard UTF-8.
> It's not acceptable to validate any string that starts with a lead
> byte valid only for the old 31-bit variant.
> The valid patterns are documented in TUS chapter 3 (conformance) and
> the standard RFC; don't trust the old proposed RFC which was never
> accepted.
> Otherwise there is the risk of validating an input that will break
> when converting it to UTF-16, creating dangerous collisions or
> unexpected errors.
> Standard UTF-8 means it is FULLY interoperable with ALL standard UTFs
> (in which the old proposed RFC 31-bit variant has never been part of).
> There are extremely limited uses for deviating from the UTF-8
> standard, and these use cases should be strictly isolated, without
> having to rewrite a lot of code depending on the standard conformance,
> and only these rare use cases should use their own encoding
> libraries (under another name than "utf8", see for example the variant
> used in Java for JNI, which is NOT labelled "utf8", except in a legacy
> field name used in legacy source code written in C: even Java no
> longer uses this binding which is purely local to the internal legacy
> storage format of compiled classes; otherwise it uses another
> (de)coder but not the one named "UTF-8", and its only difference was
> about the way it represents U+0000 NUL, as <0xC0,0x80> instead of a
> single byte <0x00>, so that it can be used with legacy C string types
> terminated by a null byte; modern string libraries don't use null-byte
> termination but byte buffers with a length property, and modern JNI
> applications no longer use the old interfaces based on 8-bit code
> units, but 16-bit code units instead; this also applies to other
> languages like Javascript/ECMIScript/TypeStript/VBScript which also
> don't depend on null-byte termination to know the length of strings
> for any encodings and code unit sizes they can support, including
> notably Python).
>
> So I don't undersand the rationale for changing the utf8 libary for
> this "extended" but non-standard behavior that can just cause havoc
> and create severe security problems in applications that would then
> need to be deeply modified in many places to force them to revalidate
> their input and make sure that theyr will remain interoperable.
>

I could rant about this all day but tl;dr: never, ever ever ever ever
ever, sanitize before decoding. just don't.

this is how you get XSS, SQL injection, and all sorts of exploits - with
attempts to make code seem "simpler", sacrificing readability
(sanitizing a string obscures the real intent of sanitizing decoded
data), performance (sanitizing a string just so you can decode it is
slower than decoding it and sanitizing the data) and security (case in
point).

>
> Le jeu. 3 oct. 2019 à 23:05, Roberto Ierusalimschy
> <[hidden email] <mailto:[hidden email]>> a écrit :
>
>     > the only concern I have is over existing usage of e.g.
>     > utf8.codes(s:gsub(...)). it would probably be beneficial to make
>     utf8.codes
>     > accept a start index before the lax switch, or otherwise enforce
>     that the
>     > lax switch is not a number. (the start index is more appealing imo.)
>
>     Accepting a start index would not help in that case, would it?
>
>     -- Roberto
>


Reply | Threaded
Open this post in threaded view
|

Re: Lua 5.4.0 beta announcement

Philippe Verdy


Le jeu. 3 oct. 2019 à 23:35, Soni "They/Them" L. <[hidden email]> a écrit :
On 2019-10-03 6:21 p.m., Philippe Verdy wrote:
> So I don't undersand the rationale for changing the utf8 libary for
> this "extended" but non-standard behavior that can just cause havoc
> and create severe security problems in applications that would then
> need to be deeply modified in many places to force them to revalidate
> their input and make sure that theyr will remain interoperable.
>

I could rant about this all day but tl;dr: never, ever ever ever ever
ever, sanitize before decoding. just don't.
And that's not what I advocated. I never said "sanitize before decoding", but "validate". Because the new utf8 implementation will no longer validate it and will decode INVALID non standard UTF-8 as if it was standard, this will cause havoc everywhere down the stream, effectively forcing existing apps to spread code everywhere to revalidate constantly.
There's absolutely no good reason in Lua 5.4 to accept the behavior old-proposed non standard RFC: there must not be any codepoint higher than 0x10FFFF and the 31-bit "extension" is unsafe, I strongly vote against it, and those rare applications that will want to use the old RFC will have to use their own separate implementation in a separate library, not the builtin "utf8" library that should remain as clean as possible and will safely invalidate non standard codes (if needed the "utf8" library may use additional optional parameter to explicitly request the lax behavior.

Reply | Threaded
Open this post in threaded view
|

Re: Lua 5.4.0 beta announcement

Gabriel Bertilson
On Thu, Oct 3, 2019 at 7:23 PM Philippe Verdy <[hidden email]> wrote:
> if needed the "utf8" library may use additional optional parameter to explicitly request the lax behavior

That's exactly what several of the utf8 functions do. See
https://www.lua.org/work/doc/manual.html#6.5.

— Gabriel

Reply | Threaded
Open this post in threaded view
|

Re: Lua 5.4.0 beta announcement

Philippe Verdy
OK then... But this is nearly OK except the charpattern which is very lax (including for the "extended" 31-bit definition where the pattern is overlong: the charpattern is only valid if you have first scanned the full text to validate its encoding, but charpattern cannot be used to scan the text correctly, but it will only correctly allow enumerating each lead byte, including invalid one, returning a sequence of arbitrary length that may not decode correctly as a single valid codepoint, or could map to a surrogate codepoint plus overlong trail bytes, and not necessarily paired with a following surrogate in the correct range: each sequence matched by this pattern is not necessarily valid as its lead byte may still be incorrect, and the sequence may still be overlong, or too short for the last sequence matched in the given text).
This should be specified more clearly.
I've never used such lax charpattern which can match arbitrarily long sequences (containing an unlimited number of trail bytes), when it was possible to detect the overlong sequence much earlier, without even having to perform any (costly) loop. non lax patterns match at most 4 bytes, don't require any internal buffering or scanning possibly indefinitely (such unlimited pattern can be used to create time-based attacks even if there are not buffering problems, or could cause internal memory problems).
I could not recommend using any code depending on this charpattern (not needed at all to safely validate any input against unsane contents)

Le ven. 4 oct. 2019 à 02:30, Gabriel Bertilson <[hidden email]> a écrit :
On Thu, Oct 3, 2019 at 7:23 PM Philippe Verdy <[hidden email]> wrote:
> if needed the "utf8" library may use additional optional parameter to explicitly request the lax behavior

That's exactly what several of the utf8 functions do. See
https://www.lua.org/work/doc/manual.html#6.5.

— Gabriel
Reply | Threaded
Open this post in threaded view
|

Re: Lua 5.4.0 beta announcement

Gabriel Bertilson
On Thu, Oct 3, 2019 at 9:11 PM Philippe Verdy <[hidden email]> wrote:
>
> OK then... But this is nearly OK except the charpattern which is very lax (including for the "extended" 31-bit definition where the pattern is overlong: the charpattern is only valid if you have first scanned the full text to validate its encoding, but charpattern cannot be used to scan the text correctly, but it will only correctly allow enumerating each lead byte, including invalid one, returning a sequence of arbitrary length that may not decode correctly as a single valid codepoint, or could map to a surrogate codepoint plus overlong trail bytes, and not necessarily paired with a following surrogate in the correct range: each sequence matched by this pattern is not necessarily valid as its lead byte may still be incorrect, and the sequence may still be overlong, or too short for the last sequence matched in the given text).

Yeah, the pattern can't be used for validation. That would only be
possible if Lua patterns allowed alternation.

> I could not recommend using any code depending on this charpattern (not needed at all to safely validate any input against unsane contents)

utf8.charpattern works as advertised, matching a valid byte sequence
for a code point if you apply it to a valid UTF-8 string. I use it
often to match a code point in Lua patterns. There's not another
pattern that would do the same job. The Lua 5.4 version, which now
includes leading bytes for 5- and 6-byte sequences, will work just as
well under the same conditions, so I'll have no qualms using it.

I don't expect utf8.charpattern to validate my UTF-8 for me. utf8.len
can be used for that instead; in Lua 5.4 it rejects encodings of
surrogate code points (which Lua 5.3 doesn't) and encodings of values
that are bigger than a codepoint (such as 0xFFFFFF), unless you ask
for the lax behavior.

It does weird me out that "\u{FFFFFF}" works now though.

> This should be specified more clearly.

The current description of utf8.charpattern, "matches exactly one
UTF-8 byte sequence, assuming that the subject is a valid UTF-8
string", is clear, for those who understand what "valid UTF-8" means,
and that if you apply the pattern to invalid UTF-8 you get undefined
behavior, in the tradition of C. What more would you like it to say?

— Gabriel

Reply | Threaded
Open this post in threaded view
|

Re: Lua 5.4.0 beta announcement

Sean Conner
It was thus said that the Great Gabriel Bertilson once stated:
> On Thu, Oct 3, 2019 at 9:11 PM Philippe Verdy <[hidden email]> wrote:
> >
> > OK then... But this is nearly OK except the charpattern which is very lax (including for the "extended" 31-bit definition where the pattern is overlong: the charpattern is only valid if you have first scanned the full text to validate its encoding, but charpattern cannot be used to scan the text correctly, but it will only correctly allow enumerating each lead byte, including invalid one, returning a sequence of arbitrary length that may not decode correctly as a single valid codepoint, or could map to a surrogate codepoint plus overlong trail bytes, and not necessarily paired with a following surrogate in the correct range: each sequence matched by this pattern is not necessarily valid as its lead byte may still be incorrect, and the sequence may still be overlong, or too short for the last sequence matched in the given text).
>
> Yeah, the pattern can't be used for validation. That would only be
> possible if Lua patterns allowed alternation.

  Which is why we have LPEG.  Speaking of which, I do have a few modules
that deal with this.  All use LPEG.

        org.conman.parsers.ascii
                Matches one US-ASCII character (codes 0 to 127)

        org.conman.parsers.ascii.char
                Matches ASCII codes 20-126 (graphics set plus space)

        org.conman.parsers.ascii.control
                Matches the ASCII C0 control set (codes 0 to 31) plus delete
                (127---technically isn't part of the C0 set)

        org.conman.parsers.ascii.ctrl
                Matches the ASCII C0 set (plus DEL) and translates the
                character to its name:

                        0 - NUL
                        1 - SOH ...

        org.conman.parsers.utf8
                Matches one (or more) UTF-8 code points greater than or equal to 128
                (see org.conman.parsers.utf8.control for more information)

        org.conman.parsers.utf8.char
                Matches one Unicode codepoint greater than or equal to 160
                to the end of the Unicode defined codepoints (that is, if I
                have it defined correctly).

        org.conman.parsers.utf8.control
                Matches the C1 control set.  This include multicode
                sequences like CSI, DCS, SOS, OSC, PM and APC.  If these
                don't mean anything to you, think terminal (or ANSI, even
                though they technically aren't ANSI) escape codes.

        org.conman.parers.utf8.ctrl
                Parses the C1 control set, returning both the name of the
                seqence, and any associated data.

  Since these are LPEG patterns, they can be used in larger expressions, and
they are all available via LuaRocks.

  You can check out the code at <https://github.com/spc476/LPeg-Parsers>

  -spc