Behavior of utf8.offset overflow

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Behavior of utf8.offset overflow

Hisham
The index argument of string.sub overflows gracefully and gives me an
empty string:

> string.sub("hello", 10)

> string.sub("hello", 5)
o

To match on UTF-8 characters, however, I need to convert the index via
utf8.offset.

> string.sub("héllo", 5)
lo
> string.sub("héllo", utf8.offset("héllo", 5))
o

However, the conjunction of string.sub and utf8.offset is not as
forgiving as the ASCII version:

> string.sub("héllo", utf8.offset("héllo", 10))
stdin:1: bad argument #2 to 'sub' (number expected, got nil)
stack traceback:
    [C]: in function 'string.sub'
    stdin:1: in main chunk
    [C]: in ?

Using utf8.offset to convert ASCII indices to UTF-8 indices with
string.find produces an even more surprising result:

> string.find("hello", "o", 10)
nil

When we begin our search past the end of the string, we simply don't
find what we're looking for, as expected.

> string.find("hello", "o", utf8.offset("hello", 10))
5    5

When we replace the init argument with utf8.offset(s, init), we now
get a nil in case of overflow, causing the init argument to be
ignored.

Since the utility of utf8.offset is essentially to give UTF-8-aware
indices to the other string.* functions, it could be a nicer behavior
if utf8.offset(s, i) returned #s+1 in case the desired offset is past
the end of the string, instead of the current nil. This would make
utf8.offset a drop-in replacement for the indices in all string
functions.

(Side note: string.byte is a bit funny in that it returns _no value_
in case of overflow (not nil)

> string.byte("hello", 10)
> type(string.byte("hello", 10))
stdin:1: bad argument #1 to 'type' (value expected)
stack traceback:
    [C]: in function 'type'
    stdin:1: in main chunk
    [C]: in ?
> x = string.byte("hello", 10); print(x)
nil

It's not clear to me what are the criteria for functions to return no
value versus nil -- as seen above, string.find returns an explicit
nil.)

In any case, this is what we get if we don't check for nil with the
current behavior:

> string.byte("hello", utf8.offset("hello", 10))
104

For completeness, the other string function that uses an index is
string.match, and it behaves like string.find:

> string.match("hello", "o", 10)
nil
> string.match("hello", "o", utf8.offset("hello", 10))
o

I'm not sure if this is an improvement worth breaking compatibility,
but I thought I'd share the observation. Adding extra nil-checks in
the code is annoying (and forgetting them is easy).

-- Hisham

Reply | Threaded
Open this post in threaded view
|

Re: Behavior of utf8.offset overflow

Roberto Ierusalimschy
> (Side note: string.byte is a bit funny in that it returns _no value_
> in case of overflow (not nil)
>
> > string.byte("hello", 10)
> > type(string.byte("hello", 10))
> stdin:1: bad argument #1 to 'type' (value expected)
> stack traceback:
>     [C]: in function 'type'
>     stdin:1: in main chunk
>     [C]: in ?
> > x = string.byte("hello", 10); print(x)
> nil
>
> It's not clear to me what are the criteria for functions to return no
> value versus nil -- as seen above, string.find returns an explicit
> nil.)

string.byte has a variable number of returns:

> string.byte("hello", 3, 5)
  --> 108 108 111
> string.byte("hello", 5, 5)
  --> 111
> string.byte("hello", 6, 5)
  --> (nothing)

If there are no results to be returned, it returns nothing.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Behavior of utf8.offset overflow

Hisham
On 24 April 2017 at 13:32, Roberto Ierusalimschy <[hidden email]> wrote:

>> (Side note: string.byte is a bit funny in that it returns _no value_
>> in case of overflow (not nil)
>>
>> > string.byte("hello", 10)
>> > type(string.byte("hello", 10))
>> stdin:1: bad argument #1 to 'type' (value expected)
>> stack traceback:
>>     [C]: in function 'type'
>>     stdin:1: in main chunk
>>     [C]: in ?
>> > x = string.byte("hello", 10); print(x)
>> nil
>>
>> It's not clear to me what are the criteria for functions to return no
>> value versus nil -- as seen above, string.find returns an explicit
>> nil.)
>
> string.byte has a variable number of returns:
>
>> string.byte("hello", 3, 5)
>   --> 108       108     111
>> string.byte("hello", 5, 5)
>   --> 111
>> string.byte("hello", 6, 5)
>   --> (nothing)
>
> If there are no results to be returned, it returns nothing.

Got it. Following that logic, I was able to successfully conclude
table.unpack({}) would behave the same and indeed it does. :)

Also, I had never noticed before that parentheses coerce nothing to
nil. Learn something new every day, even in a small language like Lua!

> type((string.byte("hello", 10)))
nil

-- Hisham

Reply | Threaded
Open this post in threaded view
|

Re: Behavior of utf8.offset overflow

Jonathan Goble
On Mon, Apr 24, 2017 at 10:22 PM Hisham <[hidden email]> wrote:
Also, I had never noticed before that parentheses coerce nothing to
nil. Learn something new every day, even in a small language like Lua!

> type((string.byte("hello", 10)))
nil

-- Hisham

More properly, parentheses adjust an arbitrary number of values to exactly one value. Zero values becomes nil, but multiple values cause the second and subsequent values to be discarded. Example:

> string.byte("hello", 3, 5)
108 108 111
> (string.byte("hello", 3, 5))
108
Reply | Threaded
Open this post in threaded view
|

Re: Behavior of utf8.offset overflow

Dirk Laurie-2
In reply to this post by Roberto Ierusalimschy
2017-04-24 18:32 GMT+02:00 Roberto Ierusalimschy <[hidden email]>:

> If there are no results to be returned, it returns nothing.

This is a good habit to follow. I have noticed "return nil" in some
published code, some of which may even be mine. I'll be more
watchful from now on!

Reply | Threaded
Open this post in threaded view
|

Re: Behavior of utf8.offset overflow

Hisham
On 25 April 2017 at 03:31, Dirk Laurie <[hidden email]> wrote:
> 2017-04-24 18:32 GMT+02:00 Roberto Ierusalimschy <[hidden email]>:
>
>> If there are no results to be returned, it returns nothing.
>
> This is a good habit to follow. I have noticed "return nil" in some
> published code, some of which may even be mine. I'll be more
> watchful from now on!

Using `return nil` is not a code smell. Please do not turn a specific
use-case into a blanket recommendation that turns into a cargo-cult.
Returning an explicit nil is a perfectly fine marker for error
conditions, it is not "considered harmful". In case you need an
appeal-to-authority to be convinced of that, Roberto does it in the
return of string.find() when a match is not found. One could just as
well make the case that in string.find("hello", "a") there are no
results to be returned (nothing was found by the 'find` function,
after all), but that function returns an explicit nil — so in the end
that observation would still be open to interpretation by the API
author.

A more specific recommendation that matches the use-case of
string.byte is to return nothing in functions with variadic results
when the arity of the return is 0. Note that this is also the result
we nicely get when we use an idiom such as `return
table.unpack(results)`.

-- Hisham