Of Unicode in the next Lua version

classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Of Unicode in the next Lua version

Pierre-Yves Gérardy
I just read Roberto's slides from the 2012 Lua workshop, and I have a
suggestion for the UTF-8 library.

It is efficient, and often practical, to deal with byte indices, even
in Unicode strings. It is the approach taken by Julia, and I use it in
LuLPeg. The API is simple:

    char, next_pos = getchar(subject, position)

    S = "∂ƒ"
    getchar(S, 1) --> '∂', 4
    getchar(S, 4) --> 'ƒ', 6
    getchar(S, 6) --> nil, nil

A similar function could return code points instead of strings.

What do you think about this?

-- Pierre-Yves

[0] http://www.lua.org/wshop12/Ierusalimschy.pdf

Reply | Threaded
Open this post in threaded view
|

Re: Of Unicode in the next Lua version

Dirk Laurie-2
2013/6/12 Pierre-Yves Gérardy <[hidden email]>:

> I just read Roberto's slides from the 2012 Lua workshop, and I have a
> suggestion for the UTF-8 library.
>
> It is efficient, and often practical, to deal with byte indices, even
> in Unicode strings. It is the approach taken by Julia, and I use it in
> LuLPeg. The API is simple:
>
>     char, next_pos = getchar(subject, position)
>
>     S = "∂ƒ"
>     getchar(S, 1) --> '∂', 4
>     getchar(S, 4) --> 'ƒ', 6
>     getchar(S, 6) --> nil, nil
>
> A similar function could return code points instead of strings.
>
> What do you think about this?

If `pos` comes before `char`, one can write an iterator on the model
of `ipairs`:

    for pos,char in utf8(str) do ...

Reply | Threaded
Open this post in threaded view
|

Re: Of Unicode in the next Lua version

Pierre-Yves Gérardy
On Wed, Jun 12, 2013 at 5:02 PM, Dirk Laurie <[hidden email]> wrote:
> If `pos` comes before `char`, one can write an iterator on the model
> of `ipairs`:
>
>     for pos,char in utf8(str) do ...


Almost... but you end up with the position of the next character... So
you need some trickery. Assuming a valid UTF-8 string:

Usage:
     for finish, start, char in utf8_next_char, "˙†ƒ˙©√" do
        print(cpt)
    end
`start` and `finish` being the bounds of the character, and `cpt`
being the UTF-8 code point.
It produces:
    ˙
    †
    ƒ
    ˙
    ©
    √
local
function utf8_next_char (subject, i)
    i = i and i+1 or 1
    if i > #subject then return end
    local offset = utf8_offset(s_byte(subject,i))
    return i + offset, i, s_sub(subject, i, i + offset)
end

it has the annoying property of passing the end position before the
start position, but it is stateless.

-- Pierre-Yves

Reply | Threaded
Open this post in threaded view
|

Re: Of Unicode in the next Lua version

Jay Carlson
In reply to this post by Pierre-Yves Gérardy
On Jun 12, 2013, at 9:53 AM, Pierre-Yves Gérardy wrote:

> I just read Roberto's slides from the 2012 Lua workshop, and I have a
> suggestion for the UTF-8 library.
>
> It is efficient, and often practical, to deal with byte indices, even
> in Unicode strings. It is the approach taken by Julia, and I use it in
> LuLPeg. The API is simple:
>
>    char, next_pos = getchar(subject, position)
>
>    S = "∂ƒ"
>    getchar(S, 1) --> '∂', 4

Don't forget getchar(S, 2) -> error("not defined at position 2").  I really like Julia's idea of strings as partial functions.

> A similar function could return code points instead of strings.

Would you use that much?

Miles Bader pointed out a lot of string iteration code is phrased in terms of gmatch--or should be. And in that case, there are no string positions at all. The major problem for UTF-8 then would be convincing the pattern matcher to consume an entire UTF-8 sequence for ".".

Jay
Reply | Threaded
Open this post in threaded view
|

Re: Of Unicode in the next Lua version

Pierre-Yves Gérardy
On Thu, Jun 13, 2013 at 11:36 PM, Jay Carlson <[hidden email]> wrote:
> On Jun 12, 2013, at 9:53 AM, Pierre-Yves Gérardy wrote:
> Don't forget getchar(S, 2) -> error("not defined at position 2").  I really like Julia's idea of strings as partial functions.

I'd prefer getchar(S, 2) --> false, 3.

>> A similar function could return code points instead of strings.
>
> Would you use that much?

Yes, before I broke Unicode support in LuLPeg, that's what I was
using. It allows to check if a character is in a given range, and it
is barely slower than returning a sub-string (doing the conversion in
Lua). In LuaJIT, computing the code point with standard arithmetic
(mod, division and floor) is faster than getting the sub-string. It
should be even faster by using the bit library.

> Miles Bader pointed out a lot of string iteration code is phrased in terms of gmatch--or should be. And in that case, there are no string positions at all.

Well, in my case, it isn't, but an LPeg clone is probably not usual in
terms of string processing.

> The major problem for UTF-8 then would be convincing the pattern matcher to consume an entire UTF-8 sequence for ".".

In the 2012 Workshop presentation, Roberto talks about deprecating the
old patterns, so unicode in gmatch will probably never see the light
of day... I don't know if/how he plans to handle Unicode in LPeg.

As posted in the other thread, I plan to tackle this in LuLPeg with
P8(), R8() and S8(), that will live alongside their byte-matching
cousins.

-- Pierre-Yves

Reply | Threaded
Open this post in threaded view
|

Re: Of Unicode in the next Lua version

Dirk Laurie-2
2013/6/14 Pierre-Yves Gérardy <[hidden email]>:

> In the 2012 Workshop presentation, Roberto talks about deprecating the
> old patterns,

Could you specify on which slide that is? I can't find it.

Reply | Threaded
Open this post in threaded view
|

Re: Of Unicode in the next Lua version

Craig Barnes
On 14 June 2013 09:58, Dirk Laurie <[hidden email]> wrote:
> 2013/6/14 Pierre-Yves Gérardy <[hidden email]>:
>
>> In the 2012 Workshop presentation, Roberto talks about deprecating the
>> old patterns,
>
> Could you specify on which slide that is? I can't find it.
>

It's mentioned in slide 6.

Reply | Threaded
Open this post in threaded view
|

Re: Of Unicode in the next Lua version

Roberto Ierusalimschy
In reply to this post by Pierre-Yves Gérardy
> It is efficient, and often practical, to deal with byte indices, even
> in Unicode strings. It is the approach taken by Julia, and I use it in
> LuLPeg. The API is simple:
>
>     char, next_pos = getchar(subject, position)
>
>     S = "∂ƒ"
>     getchar(S, 1) --> '∂', 4
>     getchar(S, 4) --> 'ƒ', 6
>     getchar(S, 6) --> nil, nil
>
> A similar function could return code points instead of strings.
>
> What do you think about this?

You can already easily implement this ǵetchar' in standard Lua (except
that it assumes a well-formed string):

  S = "∂ƒ"
  print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 1))  --> '∂', 4
  print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 4))  --> 'ƒ', 6
  print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 6))  --> nil

PiL3 discusses other patterns like this one.

Of course, the pattern "([^\x80-\xbf][\x80-\xbf]*)()" is not everyone's
cup of tea, but it does not sound very Lua-ish to "duplicate" this
functionality in a new function. Maybe we could have some patterns like
this one predefined in the library, like this:

   string.match(S, utf8.onechar, 1)

(The new library certainly would have a function to check whether a
utf-8 string is well formed.)

-- Roberto


Reply | Threaded
Open this post in threaded view
|

Re: Of Unicode in the next Lua version

Pierre-Yves Gérardy
On Sat, Jun 15, 2013 at 3:52 PM, Roberto Ierusalimschy
<[hidden email]> wrote:
>
> You can already easily implement this ǵetchar' in standard Lua (except
> that it assumes a well-formed string):
>
>   S = "∂ƒ"
>   print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 1))  --> '∂', 4
>   print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 4))  --> 'ƒ', 6
>   print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 6))  --> nil

Thanks for this pattern trick, it helped me to improve my
`getcodepoint()` routine (although I eventually found a faster
method). A validation routine won't be of much help
if you deal with a document that sports multiple encodings.

If you want to validate the characters on the go and always get a position as
second argument, you need something like this:

If the character is not valid, it returns `false, position`. At the
end of the stream, it returns nil, position + 1.

    local s_byte, s_match, s_sub = string.byte, string.match, string.sub

    function getchar(S, first)
        if #S < first then
            return nil, first
        end

        local match, next = S:match("^([^\128-\191][\128-\191]*)()", first)

        if not match then
            return false, first
        end

        local first, n = s_byte(match), #match
        local success
            =  first < 0x128 and n == 1
            or first < 0x224 and n == 2
            or first < 0x240 and n == 3
            or first < 0x248 and n == 4
            or first < 0x252 and n == 5
            or first < 0x254 and n == 6
            --UTF-16 surrogate code point checking left out for clarity.

        if success then
            return match, next
        else
            return false, first
        end
    end


or this (idem in Lua 5.1/5.2, but twice as fast in LuaJIT, where
`gmatch()` is not compiled):


function utf8_get_char_jit_valid2(subject, i)
        if i > #subject then
            return nil, i
        end

        local byte, len = s_byte(subject,i)

        if byte < 128 then
            return s_sub(subject, i, i), i + 1

        elseif byte < 192 then
            return false, i

        elseif byte < 224 and s_match(subject, "^[\128-\191]",
            i + 1) then
                return s_sub(subject, i, i + 1), i + 2

        elseif byte < 240 and s_match(subject,
            "^[\128-\191][\128-\191]",
            i + 1) then
                return s_sub(subject, i, i + 2), i + 3

        elseif byte < 248 and s_match(subject,
            "^[\128-\191][\128-\191][\128-\191]",
            i + 1) then
                return s_sub(subject, i, i + 3), i + 4

        elseif byte < 252 and s_match(subject,
            "^[\128-\191][\128-\191][\128-\191][\128-\191]",
            i + 1) then
                return s_sub(subject, i, i + 4), i + 5

        elseif byte < 254 and s_match(subject,
            "^[\128-\191][\128-\191][\128-\191][\128-\191][\128-\191]",
            i + 1) then
                return s_sub(subject, i, i + 5), i + 6

        else
            return false, i
        end
    end


This is not that complex, but still rather slow in Lua, and the same
goes for getting the code point to perform a range query (useful to
test if a code point is part of some alphabet).

To that, end, you could provide a `utf8.range(char, lower, upper)`, though.

This assumes you don't deprecate patterns in the next Lua version (or
the one after, to ease the transition?).

But I understand the need to balance features and light weight.
`getchar()` and `getcodepoint()` are damn useful to write parsers, but
if LPeg is part of the next version, the point is probably moot.

-- Pierre-Yves

Reply | Threaded
Open this post in threaded view
|

Re: Of Unicode in the next Lua version

Jay Carlson
On Jun 15, 2013, at 2:13 PM, Pierre-Yves Gérardy wrote:

> On Sat, Jun 15, 2013 at 3:52 PM, Roberto Ierusalimschy
> <[hidden email]> wrote:
>>
>> You can already easily implement this ǵetchar' in standard Lua (except
>> that it assumes a well-formed string):
>>
>>  S = "∂ƒ"
>>  print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 1))  --> '∂', 4
>>  print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 4))  --> 'ƒ', 6
>>  print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 6))  --> nil
>
> Thanks for this pattern trick, it helped me to improve my
> `getcodepoint()` routine (although I eventually found a faster
> method). A validation routine won't be of much help
> if you deal with a document that sports multiple encodings.

I tend to think, "strings are components of documents"; when using the XML Infoset as a model, encodings can be abstracted away. But concrete strings are different. In general, people don't have individual strings with multiple encodings; people have bugs.[1]

> If you want to validate the characters on the go and always get a position as
> second argument, you need something like this:
>
> If the character is not valid, it returns `false, position`. At the
> end of the stream, it returns nil, position + 1.

I don't understand where "false" instead of an error would be useful. Once you've decided to iterate over a string as UTF-8, it is a surprise when the string turns out not to be UTF-8, and it's unlikely your code will do anything useful. There could be a separate utf8.isvalid(s, [byteoffset [, bytelen]]) for when you're testing.

I am one of those "assert everything" fascists, though. Code encountering "false" in place of an expected string often blows up anyway (although convenience functions which auto-coerce to string can hide that). The question is how promptly the error is signaled.

>            --UTF-16 surrogate code point checking left out for clarity.

...plus the stuff over U+10FFFF...

> This is not that complex, but still rather slow in Lua, and the same
> goes for getting the code point to perform a range query (useful to
> test if a code point is part of some alphabet).
>
> To that, end, you could provide a `utf8.range(char, lower, upper)`, though.

UTF-8 is constructed such that Unicode code points are ordered lexicographically under 8-bit strcmp. So you can replace that with

function utf8.inrange(str single_codepoint, str lower_codepoint, str upper_codepoint)
  return single_codepoint >= lower_codepoint and single_codepoint <= upper_codepoint;
end

and you don't need to extract the codepoint from a longer string if you write "< upper_codepoint_plus_one"; this lets you test an arbitrary byte offset for range membership. All of these nice properties go to hell if invalid UTF-8 creeps in, though.

Languages like Lua tend to be very slow when operating character by character. I think there is some kind of map/collect primitive for working with codepoints which probably needs to be in C for speed. Because so many functions on Unicode are sparse, something like map_table[page][offset] is useful, especially if those tables have metatables which can answer with a function and optionally populate them lazily.

Jay

[1]: If somebody hands you a URL, you can't round-trip the %-encoding through Unicode; it must be preserved as US ASCII. Casual URL manipulation is full of string-typing bugs. I wrote a web server which used %23 in URLs. Broken web proxies would unescape the string, notice that there was now a "#foo", and truncate the URL at the mistaken fragment identifier.
Reply | Threaded
Open this post in threaded view
|

Re: Of Unicode in the next Lua version

Tim Hill
The real problem is badly-formed UTF-8 .. and there is too much of it to just bail with errors. Some common oddities I have encountered:

-- UTF-16 surrogate pairs encoded as UTF-8 (rather then the underlying code point)
-- UTF-16 BOM encoded in UTF-8 (which of course has no use for a BOM)
-- Non-canonical UTF-8 encodings (for example, encoding in 5 bytes instead of 4)

To be honest, I'm not sure how I would approach an "IsValidUTF8()" function .. I always tend to fall back on the original TCP/IP philosophy: be rigorous in what you generate, and forgiving in what you accept.

--Tim

On Jun 15, 2013, at 1:08 PM, Jay Carlson <[hidden email]> wrote:

I don't understand where "false" instead of an error would be useful. Once you've decided to iterate over a string as UTF-8, it is a surprise when the string turns out not to be UTF-8, and it's unlikely your code will do anything useful. There could be a separate utf8.isvalid(s, [byteoffset [, bytelen]]) for when you're testing.

Reply | Threaded
Open this post in threaded view
|

Re: Of Unicode in the next Lua version

Paul K
Hi Tim,

> Some common oddities I have encountered:

I've also seen quotes copied from text with other encodings: ISO
8859-1 has grave and acute accents with codes 0x91 and 0x92 and
Windows CP1250 has single and double quotation marks with codes
0x91-0x94, with all these codes being invalid in UTF8 ("an unexpected
continuation byte").

> The real problem is badly-formed UTF-8 .. and there is too much of it to just bail with errors.

I've implemented fixUTF8 method in ZBS as described here:
http://notebook.kulchenko.com/programming/fixing-malformed-utf8-in-lua;
the same logic can be used for isValidUTF8(). It can probably be done
in a faster way, but it's been working well for me so far.

Paul.

On Sat, Jun 15, 2013 at 1:37 PM, Tim Hill <[hidden email]> wrote:

> The real problem is badly-formed UTF-8 .. and there is too much of it to
> just bail with errors. Some common oddities I have encountered:
>
> -- UTF-16 surrogate pairs encoded as UTF-8 (rather then the underlying code
> point)
> -- UTF-16 BOM encoded in UTF-8 (which of course has no use for a BOM)
> -- Non-canonical UTF-8 encodings (for example, encoding in 5 bytes instead
> of 4)
>
> To be honest, I'm not sure how I would approach an "IsValidUTF8()" function
> .. I always tend to fall back on the original TCP/IP philosophy: be rigorous
> in what you generate, and forgiving in what you accept.
>
> --Tim
>
> On Jun 15, 2013, at 1:08 PM, Jay Carlson <[hidden email]> wrote:
>
> I don't understand where "false" instead of an error would be useful. Once
> you've decided to iterate over a string as UTF-8, it is a surprise when the
> string turns out not to be UTF-8, and it's unlikely your code will do
> anything useful. There could be a separate utf8.isvalid(s, [byteoffset [,
> bytelen]]) for when you're testing.
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Of Unicode in the next Lua version

Jay Carlson
In reply to this post by Tim Hill
On Jun 15, 2013, at 4:37 PM, Tim Hill wrote:

> The real problem is badly-formed UTF-8 .. and there is too much of it to just bail with errors. Some common oddities I have encountered:
>
> -- UTF-16 surrogate pairs encoded as UTF-8 (rather then the underlying code point)
> -- UTF-16 BOM encoded in UTF-8 (which of course has no use for a BOM)
> -- Non-canonical UTF-8 encodings (for example, encoding in 5 bytes instead of 4)
>
> To be honest, I'm not sure how I would approach an "IsValidUTF8()" function .. I always tend to fall back on the original TCP/IP philosophy: be rigorous in what you generate, and forgiving in what you accept.

If you don't decide on ingress, IsValidUTF8() is still decided, but the definition will be a global property of the codebase.[1] Similarly, if you don't decide what to do with pseudo-UTF-8 surrogates ("CESU-8"), the program as a whole gets this knowledge smeared all over it.

RFC 3629 uses the normative "MUST protect against decoding invalid sequences"; its changelog from RFC 2279 is even stronger, saying, "Turned the note warning against decoding of invalid sequences into a normative MUST NOT."[2]

I think the right thing to do is to make the choice of how to be "forgiving in what you accept" explicit on ingress. Protocols are gonna have to nuke any BOM at start-of-stream anyway (well, the RFC has advice). There are a bunch of choices of what to do with CESU-8 and other coding errors, and they can have broad implications (notoriously including security-sensitive functions in a few old cases).

Regardless of whether you normalize on input, you will probably have to choose to do something if you will be rigorous in what you generate later.

I regularly do have to deal with documents impossible to interpret in any normal coded character set. And it's fun to watch line noise blow up Python 3 programs reading from serial ports. There are a lot of really dirty hacks I've written to recover meaning from arbitrary octet streams. I'd rather not think about it. If I can keep those octet-level hacks on the outer shell of the program, I can have a single consistent definition of UTF-8-ness on the inside.

I mostly do plumbing. People who spend more time working with *text* care about things like canonicalization forms; there are all kinds of correctness issues above the codepoint level. For most plumbing I can ignore them and treat Unicode as a stream of codepoints, since everybody working above that level[3] is already in a world of pain. I try not to make it worse.

Jay

[1]: Well, IsValidUTF8 may be defined in the same way much of C is: arbitrary program behavior. In Lua it won't bring down the runtime without help from C though.

[2]: Uh, WTF, IETF? Is the (presumptively non-normative) changelog making stronger claims about the normativeness of its subjects?

[3]: Above the codepoint level is where you need those character property tables to survive. I'm not counting on Lua getting them, even if slnunicode/tcl did get the size down. Canonicalization is big; the only way Lua would get that is with #ifdefs to haul in the platform code the same way dlopen has to work. Something like dynamic grapheme cluster coding does seem like a win for people who want to work at the "character" level, but I don't have enough experience to be sure.
Reply | Threaded
Open this post in threaded view
|

Re: Of Unicode in the next Lua version

Pierre-Yves Gérardy
In reply to this post by Jay Carlson
On Sat, Jun 15, 2013 at 10:08 PM, Jay Carlson <[hidden email]> wrote:
> UTF-8 is constructed such that Unicode code points are ordered lexicographically under 8-bit strcmp. So you can replace that with
>
> function utf8.inrange(str single_codepoint, str lower_codepoint, str upper_codepoint)
>   return single_codepoint >= lower_codepoint and single_codepoint <= upper_codepoint;
> end

I hadn't realized this. I'm acreting knowledge on the go, I've yet to
rigorously explore Unicode... I find UTF-8 beautiful in lots of
regards. UTF-16 baffles me, though. Do you know why they reserved
codepoints, which are supposed to correspond to symbols, to the
implementation details of an encoding? I whish there was a UTF-16'
that followed the UTF-8 strategy.

> and you don't need to extract the codepoint from a longer string if you write "< upper_codepoint_plus_one"; this lets you test an arbitrary byte offset for range membership.

I don't understand what you mean here :-/

-- Pierre-Yves

Reply | Threaded
Open this post in threaded view
|

Re: Of Unicode in the next Lua version

Peter Cawley
On Sat, Jun 15, 2013 at 11:56 PM, Pierre-Yves Gérardy <[hidden email]> wrote:
> UTF-16 baffles me, though. Do you know why they reserved
> codepoints, which are supposed to correspond to symbols, to the
> implementation details of an encoding?

Presumably to provide a migration path for UCS-2.

Reply | Threaded
Open this post in threaded view
|

Re: Of Unicode in the next Lua version

Pierre-Yves Gérardy
In reply to this post by Jay Carlson
On Sun, Jun 16, 2013 at 12:05 AM, Jay Carlson <[hidden email]> wrote:
> If you don't decide on ingress, IsValidUTF8() is still decided, but the definition will be a global property of the codebase.[1] Similarly, if you don't decide what to do with pseudo-UTF-8 surrogates ("CESU-8"), the program as a whole gets this knowledge smeared all over it.
> [...snip...]

Reading this, I realize how out of my depth I am with regards to Unicode...

> For most plumbing I can ignore them and treat Unicode as a stream of codepoints, since everybody working above that level[3] is already in a world of pain. I try not to make it worse.

That's basically what I planned to do with LuLPeg. Allow to define
leaf patterns as UTF-8 strings, ranges and sets of characters, plus a
special value to detect encoding errors.

I might also add a constructor that takes any indexable value. It
could receive two-stage tables for character classes, if someone (not
me!) were to implement them in Lua...

-- Pierre-Yves

Reply | Threaded
Open this post in threaded view
|

Re: Of Unicode in the next Lua version

Jay Carlson
In reply to this post by Pierre-Yves Gérardy
On Jun 15, 2013, at 6:56 PM, Pierre-Yves Gérardy wrote:

> On Sat, Jun 15, 2013 at 10:08 PM, Jay Carlson <[hidden email]> wrote:
>> UTF-8 is constructed such that Unicode code points are ordered lexicographically under 8-bit strcmp. So you can replace that with
>>
>> function utf8.inrange(str single_codepoint, str lower_codepoint, str upper_codepoint)
>>  return single_codepoint >= lower_codepoint and single_codepoint <= upper_codepoint;
>> end
>
> I hadn't realized this. I'm acreting knowledge on the go, I've yet to
> rigorously explore Unicode... I find UTF-8 beautiful in lots of
> regards. UTF-16 baffles me, though. Do you know why they reserved
> codepoints, which are supposed to correspond to symbols, to the
> implementation details of an encoding? I whish there was a UTF-16'
> that followed the UTF-8 strategy.

Originally, Unicode was sold as "double-wide ASCII". "All you have to do to support the world's scripts is use 2-byte characters." Then they decided 64k codepoints was *not* enough for everyone. Before they ran out of space, they allocated the surrogate blocks to let existing software using 2-byte-characters have access to the other planes. It's a good design given the constraints.

UCS-2 was attractive; all codepoints were a fixed size. Adding UTF-16 to support the astral planes meant codepoints were *not* all the same size, and once you had to deal with that, other variable-width codes like UTF-8 were more competitive.

>> and you don't need to extract the codepoint from a longer string if you write "< upper_codepoint_plus_one"; this lets you test an arbitrary byte offset for range membership.
>
> I don't understand what you mean here :-/

I'm sorry, that really was cryptic. Let's look at an ASCII range tester: is the first character of this string a capital letter?

first_member = "A"
last_member = "Z"
after_last = string.char(string.byte(last_member)+1)

-- outside the range
assert(not( "" >= first_member ))
assert(not( "@home" >= first_member ))

-- inside the range
assert( "Alphabet" >= first_member )
assert( "Yow" < last_member )
assert( "Zymurgy" < after_last )

-- outside
assert(not( "[" < after_last ))

Any string inside the range must be strictly less than the very first value after the range.

This same property is true of UTF-8 strings. Any string starting with a codepoint inside the range will be strictly less than the string consisting of the first codepoint outside the range. You can test membership of the codepoint at any byte-index without extracting it. You could use C code like this:

  strcmp(s+7, first_member) >= 0 && strcmp(s+7, after_last) < 0

as long as you know s+7 is a valid offset inside the string. (This mostly works for invalid offsets too.)

Jay
Reply | Threaded
Open this post in threaded view
|

Re: Of Unicode in the next Lua version

David Heiko Kolf-2
In reply to this post by Tim Hill
Tim Hill wrote:

> The real problem is badly-formed UTF-8 .. and there is too much of it to
> just bail with errors. Some common oddities I have encountered:
>
> -- UTF-16 surrogate pairs encoded as UTF-8 (rather then the underlying
> code point)
> -- UTF-16 BOM encoded in UTF-8 (which of course has no use for a BOM)
> -- Non-canonical UTF-8 encodings (for example, encoding in 5 bytes
> instead of 4)
>
> To be honest, I'm not sure how I would approach an "IsValidUTF8()"
> function .. I always tend to fall back on the original TCP/IP
> philosophy: be rigorous in what you generate, and forgiving in what you
> accept.

The BOM in UTF-8 is mainly annoying for plain ASCII applications where
UTF-8 should be transparent in strings.  But as far as I remember it is
not invalid UTF-8 (though its only use is to show that text is indeed
UTF-8).  An Unicode-aware application can just ignore it.

The last point, the non-canonical UTF-8 encodings, is actually a huge
security risk that already opened holes in the field.

UTF-8 is quite often used as just an extension to ASCII (which it was
meant to be) and so some filters checked that URLs don't use "../" to
access upper directories.  They did not check all the non-canonical ways
of encoding dots and slashes so these paths went through the filter.  At
some point (I guess just before the OS API) the UTF-8 was converted
"forgivingly" to UTF-16 and suddenly the dangerous paths were used.

That is the reason why the standards say that the conversion of UTF-8 to
codepoints must not tolerate non-canonical encodings but either reject
the string completely or put some codepoint in there that signals an
encoding error (though I do not know which codepoint that was).

Best regards,

David Kolf

Reply | Threaded
Open this post in threaded view
|

Re: Of Unicode in the next Lua version

Miles Bader-2
David Heiko Kolf <[hidden email]> writes:
> The BOM in UTF-8 is mainly annoying for plain ASCII applications where
> UTF-8 should be transparent in strings.  But as far as I remember it is
> not invalid UTF-8 (though its only use is to show that text is indeed
> UTF-8).  An Unicode-aware application can just ignore it.

What's annoying (and stupid) is applications that _require_ a BOM for
UTF-8 files (e.g. MS apps often do this), as this breaks one of
UTF-8's main advantages, its usability with apps that _don't_
specifically support it.

-Miles

--
Hers, pron. His.

Reply | Threaded
Open this post in threaded view
|

Re: Of Unicode in the next Lua version

Jay Carlson
In reply to this post by David Heiko Kolf-2
On Jun 16, 2013, at 7:30 AM, David Heiko Kolf wrote:

> The BOM in UTF-8 is mainly annoying for plain ASCII applications where
> UTF-8 should be transparent in strings.  But as far as I remember it is
> not invalid UTF-8 (though its only use is to show that text is indeed
> UTF-8).  An Unicode-aware application can just ignore it.

Yeah, but it should be dropped to avoid ZWNBSPs just randomly littering the interior of concatenated texts, complicating things like search. It's harmless for applications which understand semantics above the codepoint level but I'm proceeding with the assumption Lua will not have those.

I think there may be some advantages to declaring U+FEFF to be not-valid even though it technically is; it has no business being in the interior of any recently generated text. I'll go see if my codebase vomits at the idea....

http://www.unicode.org/faq/utf_bom.html#bom6

Jay
12