[PATCH] Quoted String "%q" non-ascii escaping (w/ hex).

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH] Quoted String "%q" non-ascii escaping (w/ hex).

Duane Leslie
Hi,

I had a problem with the quoted string format producing strings that
were not legal UTF-8 because it was not escaping non-ascii characters,
and then once I fixed that I wasn't able to read the strings back in
to a C program because C uses octal escapes and Lua uses decimal.

This patch ensures all control and non-ascii characters are escaped,
and uses the hexadecimal escape syntax instead of decimal to ensure
compatibility between Lua and C.

Technically it is still not safe to pass the strings as literals
directly into C because in C the hexadecimal production is not
automatically terminated at two characters but I figured this was
outside of the scope of the quoted string format specifier.  I solve
this instead by using `:gsub([[%f[\]\x%x%x]],'%0""')` to terminate the
hexadecimal escapes (triggering C's string literal concatenation
behaviour) at the point of export.

Regards,

Duane.

quoted_string.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Quoted String "%q" non-ascii escaping (w/ hex).

Niccolo Medici
On 6/28/17, Duane Leslie <[hidden email]> wrote:
> Hi,
>
> I had a problem with the quoted string format producing strings that
> were not legal UTF-8 [...]

What do you mean? I don't see any problem:

`string.format("%q", "Здравствуйте")` returns '"Здравствуйте"'. That's
valid UTF-8.

>
> [...] because it was not escaping non-ascii characters,

(Why should non-ASCII chars be escaped? If you personally need this
escaping, it doesn't mean there's a "problem" with Lua's %q.)

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Quoted String "%q" non-ascii escaping (w/ hex).

Duane Leslie
In reply to this post by Duane Leslie
> On 29 Jun 2017, at 03:18, Niccolo Medici <[hidden email]> wrote:

>
>> On 6/28/17, Duane Leslie <[hidden email]> wrote:
>> Hi,
>>
>> I had a problem with the quoted string format producing strings that
>> were not legal UTF-8 [...]
>
> What do you mean? I don't see any problem:
>
> `string.format("%q", "Здравствуйте")` returns '"Здравствуйте"'. That's
> valid UTF-8.
Attached is a UTF-8 aware version of the `addquoted` function.  I'm not sure if it really fits given that the UTF-8 behaviour is attached to a different library, but it might be useful to someone and it's better than my last proposal.

Also, I have noticed that the `utf8_decode` function passes the UTF-16 surrogates which are illegal codepoints, so this might also need to be fixed.

Regards,

Duane.




addquoted.c (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Lua utf8.len violates RFC 3629? (was Re: [PATCH] Quoted String "%q" non-ascii escaping (w/ hex).)

Jay Carlson
Let me back up a second. I believe

  o  Lua should be able to work with UTF-8 in some useful way;

  o The support is for UTF-8, not Unicode;

  o  UTF-8 is an encoding for data exchange between systems, and is currently defined by RFC 3629;

  o  The core can't provide any support like string.lower outside of US-ASCII;

  o  Single-byte character sets like ISO-8859-1 may happen to work with your C locale functions, but that’s not a promise.

Please let me know if these beliefs are wrong.

On Jun 29, 2017, at 7:45 PM, Duane Leslie <[hidden email]> wrote:

> Also, I have noticed that the `utf8_decode` function passes the UTF-16 surrogates which are illegal codepoints, so this might also need to be fixed.

Ahh, now I remember why I kept my own UTF-8 validator. Lua’s behavior seems out of conformance with RFC 3629, and this isn’t just a SHOULD in the RFC, it’s a MUST.

Quoting https://tools.ietf.org/html/rfc3629#section-3 :

> The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF [...]
>
> Implementations of the decoding algorithm above MUST protect against decoding invalid sequences.  For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4.

Let’s see how we’re doing.

function u(t)
  if t.rfc then print(“MUST from RFC 3629:") end
  print(t[1], utf8.len(t[2]))
  print("expected", table.unpack(t.expect))
  print()
end

u{"zero, encoded", "\xC0\x80",
  expect={nil, 1}, rfc=true}

u{"bad CESU pair", "\xED\xA1\x8C\xED\xBE\xB4",
  expect={nil, 1}, rfc=true}

u{"half pair", "\xED\xA1\x8C",
  expect={nil, 1}}

u{"half plus A", "\xED\xA1\x8C" .. "A",
  expect={nil, 1}}

u{"astral char", "\xEF\xBB\xBF\xF0\xA3\x8E\xB4",
  expect={1}, rfc=true}

===

MUST from RFC 3629:
zero, encoded nil 1
expected nil 1

MUST from RFC 3629:
bad CESU pair 2
expected nil 1

half pair 1
expected nil 1

half plus A 2
expected nil 1

MUST from RFC 3629:
astral char 2
expected 1

===

Note that the surrogate behavior is explicitly called out in RFC 3629’s “Security Considerations”, https://tools.ietf.org/html/rfc3629#section-10 .

--
Jay Carlson
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Quoted String "%q" non-ascii escaping (w/ hex).

Ross Berteig
In reply to this post by Duane Leslie



On 6/27/2017 6:56 PM, Duane Leslie wrote:
Hi,

I had a problem with the quoted string format producing strings that
were not legal UTF-8 because it was not escaping non-ascii characters,
and then once I fixed that I wasn't able to read the strings back in
to a C program because C uses octal escapes and Lua uses decimal.

But string.format is not documented to produce a legal C string literal. See

   https://www.lua.org/manual/5.3/manual.html#pdf-string.format

where it says "The q option formats a string between double quotes, using escape sequences when necessary to ensure that it can safely be read back by the Lua interpreter." Note that it explicitly does not mention that the result could be safely read by a C compiler.

Within string literals, Lua is already perfectly fine with UTF-8 content. It may also be fine with other extended ASCII forms, or with some other Unicode translation formats, leaving most of those details up to the system outside of Lua. But don't use UTF-7. Just don't.
This patch ensures all control and non-ascii characters are escaped,
and uses the hexadecimal escape syntax instead of decimal to ensure
compatibility between Lua and C.

Here you are using "Lua" to mean Lua 5.2 or later. The still widely used Lua 5.1 did not support hex escapes.

Technically it is still not safe to pass the strings as literals
directly into C because in C the hexadecimal production is not
automatically terminated at two characters but I figured this was
outside of the scope of the quoted string format specifier.  I solve
this instead by using `:gsub([[%f[\]\x%x%x]],'%0""')` to terminate the
hexadecimal escapes (triggering C's string literal concatenation
behaviour) at the point of export.

You would be far better served by writing a Lua function that generates a proper C string literal and calling that instead of depending on string.format("%q") and additional processing.

Lua is designed to work well with C. It is also designed to be used by people who don't want to know anything about C or lower level programming issues.  While the choice of base-10 for \ddd escapes is occasionally a source of friction when switching back and forth between the languages, it is no worse that the choice of 1-based array indexing and numerous other details that differ.
-- 
Ross Berteig                               [hidden email]
Cheshire Engineering Corp.           http://www.CheshireEng.com/
+1 626 303 1602
Reply | Threaded
Open this post in threaded view
|

Re: Lua utf8.len violates RFC 3629? (was Re: [PATCH] Quoted String "%q" non-ascii escaping (w/ hex).)

Ricardo Ramos Massaro
In reply to this post by Jay Carlson
On Fri, Jun 30, 2017 at 2:44 PM, Jay Carlson <[hidden email]> wrote:
> u{"astral char", "\xEF\xBB\xBF\xF0\xA3\x8E\xB4",
>   expect={1}, rfc=true}
>
> MUST from RFC 3629:
> astral char     2
> expected        1

These tests are nice examples of where Lua's utf8.len() diverges from
the RFC, but the last one confuses me.

It looks like that byte sequence encodes two code points: U+FEFF and
U+233B4. Do you mean to say that utf8.len() should not count U+FEFF
because it appears at the start of the string (and so should be
considered a BOM)? That doesn't look like it's mandated by the RFC,
and I don't think would be a desired behavior for utf8.len().

- Ricardo

Reply | Threaded
Open this post in threaded view
|

Re: Lua utf8.len violates RFC 3629? (was Re: [PATCH] Quoted String "%q" non-ascii escaping (w/ hex).)

Jay Carlson

> On Jun 30, 2017, at 9:59 PM, Ricardo Ramos Massaro <[hidden email]> wrote:
>
> On Fri, Jun 30, 2017 at 2:44 PM, Jay Carlson <[hidden email]> wrote:
>> u{"astral char", "\xEF\xBB\xBF\xF0\xA3\x8E\xB4",
>>  expect={1}, rfc=true}
>>
>> MUST from RFC 3629:
>> astral char     2
>> expected        1
>
> These tests are nice examples of where Lua's utf8.len() diverges from
> the RFC, but the last one confuses me.
>
> It looks like that byte sequence encodes two code points: U+FEFF and
> U+233B4.
Oops, that one confused me too! You are right, and I was not paying attention to what I was cutting&pasting. Thanks.

> Do you mean to say that utf8.len() should not count U+FEFF
> because it appears at the start of the string (and so should be
> considered a BOM)? That doesn't look like it's mandated by the RFC,
> and I don't think would be a desired behavior for utf8.len().

I agree. If there were to be any special behavior it would have been to return nil,1, since the BOM is a Unicode noncharacter.

I think knowledge of the BOM should stay outside the Lua core, on the basis that it is more a Unicode property than a UTF-8 property. Plus the BOM should die.

On the other hand, we could often use a little help with reading input anyway. A Lua library handling Unicode input processing could cover the BOM, plus deal with replacement characters in the face of invalid input. (See “Unicode Security Considerations”, UTR-36 http://www.unicode.org/reports/tr36/ for why doing this right can be important.)

--
Jay Carlson
[hidden email]


signature.asc (465 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Lua utf8.len violates RFC 3629? (was Re: [PATCH] Quoted String "%q" non-ascii escaping (w/ hex).)

Daurnimator
In reply to this post by Jay Carlson
On 1 July 2017 at 03:44, Jay Carlson <[hidden email]> wrote:

> On Jun 29, 2017, at 7:45 PM, Duane Leslie <[hidden email]> wrote:
>
>> Also, I have noticed that the `utf8_decode` function passes the UTF-16 surrogates which are illegal codepoints, so this might also need to be fixed.
>
> Ahh, now I remember why I kept my own UTF-8 validator. Lua’s behavior seems out of conformance with RFC 3629, and this isn’t just a SHOULD in the RFC, it’s a MUST.
>
> Quoting https://tools.ietf.org/html/rfc3629#section-3 :
>
>> The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF [...]
>>
>> Implementations of the decoding algorithm above MUST protect against decoding invalid sequences.  For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4.

This is on purpose for interoperability. This diversion from utf8 is
even called out (twice) on the wikipedia article for utf8 as a common
choice
  - https://en.wikipedia.org/wiki/UTF-8#WTF-8
  - https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points

Note that this does mean that if you need to validate a string is
totally valid utf8 then you need to check for surrogate pairs.
You can find an example of this in lua-http's websocket library:
https://github.com/daurnimator/lua-http/blob/0b54603bfc132dcb9add76a61d3b50b4439031b2/http/websocket.lua#L89

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Quoted String "%q" non-ascii escaping (w/ hex).

Duane Leslie
In reply to this post by Duane Leslie
> But don't use UTF-7. Just don't.

That was an oversight on my part which I fixed in a follow up email.  Part of the reason I sent it to the list was to get this kind of feedback.

> Here you are using "Lua" to mean Lua 5.2 or later. The still widely used Lua 5.1 did not support hex escapes.

Thats fine, I'm not that interested in old versions, I depend too much on the new features.

> You would be far better served by writing a Lua function that generates a proper C string literal and calling that instead of depending on string.format("%q") and additional processing.

I have a Lua runtime embedded in another application which I can modify as I see fit, and it's easier for me to extend or tweak existing functionality than write new functions.  I am currently running, I think, 8 modifications from the standard distribution.  I love how "hackable" Lua is.

Occasionally if I have something that I think is either low impact or quite useful I toss it out for feedback and to hope that maybe there's some interest and some of the features work their way into later versions.  I think Lua strings should be serialised as legal UTF-8, and since the hexadecimal or decimal escape formats are both typically going to be 4 characters and the hexadecimal format is more generally compatible it should be preferred.  The strings produced by this patch can still be read back by Lua, but they'll survive travelling through a strict UTF-8 system without being clobbered by 0xFFFD and they're more likely to be compatible with other programming languages too.

Regards,

Duane.
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Quoted String "%q" non-ascii escaping (w/ hex).

Duane Leslie
In reply to this post by Duane Leslie
>> But don't use UTF-7. Just don't.
>
> That was an oversight on my part which I fixed in a follow up email.  Part of the reason I sent it to the list was to get this kind of feedback.

Wow, I just discovered that UTF-7 is actually a thing.  I had always assumed it was just a colloquialism for ASCII.  Yeah, I'm never going to touch that.  I meant that I had extended the ASCIII support to also cover UTF-8.

Regards,

Duane
Reply | Threaded
Open this post in threaded view
|

Re: Lua utf8.len violates RFC 3629? (was Re: [PATCH] Quoted String "%q" non-ascii escaping (w/ hex).)

Roberto Ierusalimschy
In reply to this post by Daurnimator
> > Quoting https://tools.ietf.org/html/rfc3629#section-3 :
> >
> >> The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF [...]
> >>
> >> Implementations of the decoding algorithm above MUST protect against decoding invalid sequences.  For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4.
>
> This is on purpose for interoperability. This diversion from utf8 is
> even called out (twice) on the wikipedia article for utf8 as a common
> choice
>   - https://en.wikipedia.org/wiki/UTF-8#WTF-8
>   - https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points
>
> Note that this does mean that if you need to validate a string is
> totally valid utf8 then you need to check for surrogate pairs.
> You can find an example of this in lua-http's websocket library:
> https://github.com/daurnimator/lua-http/blob/0b54603bfc132dcb9add76a61d3b50b4439031b2/http/websocket.lua#L89

We are considering adding a "strict" option to 'utf8.len', to ensure the
above rules. Nevertheless, note that Lua does not decode the overlong
sequence C0-80 into U+0000, as it does not decode the pair ED-A1-8C
ED-BE-B4 into U+233B4.

-- Roberto