on invalid UTF-8 byte sequences

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

on invalid UTF-8 byte sequences

Stephan Hennig-2
Hi,

the manual has this on utf8.len()

    [...] If it finds any invalid byte sequence, returns a false
    value plus the position of the first invalid byte.
Which (to me) seems not to specify what the first invalid byte in an
invalid byte sequence is.  Is it the first byte that invalidates a byte
sequence or the first byte of the whole invalid byte sequence?

Can an interested Lua user who has not carefully studied the Unicode
specs, which is an external resource, safely infer the output of

  $ lua -e "print(utf8.len('\xc3\xc4'))"

from the Lua manual alone?

The given UTF-8 string refers to German umlaut character Ä, 0xc384,
except that the second byte's second-most significant bit is flipped.
The first byte now introduces a multi-byte sequence, but is not followed
by a continuation byte.  As given, both bytes are invalid.  But which is
the first invalid one?

Without giving a spoiler (hopefully), in this case[1], Lua seems to be
in line with official Unicode/UTF-8 specs that are more clear on the
handling of invalid UTF-8 material.  But a slight change in the Lua
manual could make it more self-contained.

Best regards,
Stephan Hennig

[1] In <http://lua-users.org/lists/lua-l/2015-12/msg00063.html>, it has
been reported that utf8.len() does not fully comply to Unicode specs
with regards to flagging invalid UTF-8 input.  (Reproduced with Lua
5.3.4 here.)