Default UTF-8 encoding in strings

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Default UTF-8 encoding in strings

Robert Virding-2
This is for Lua 5.3.4

When exactly are characters/bytes UTF-8 interpreted in a literal strings? It seems like that when you write a literal string which includes a unicode character then it will be inserted into the string by its UTF-8 encoding. Even if it is small enough to fit in one byte. For example the string "aäb" has the bytes 97, 195, 164, 98 even though the ä character has the value 228 so it could fit in a byte. The same when printing a string if there is a legal UTF-8 sequence then its unicode character will be printed, however, a value of 228 will be printed as ?.

While in both the Lua 5.3 reference manual and in the latest Lua book the examples show this, in no place are the actual rules stated. At least I cannot find them.

As with my last string question I am interested as I am implementing Lua and want it to behave the same way.

Robert

Reply | Threaded
Open this post in threaded view
|

Re: Default UTF-8 encoding in strings

Hugo Musso Gualandi



>When exactly are characters/bytes UTF-8 interpreted in a literal
>strings?

This should depend solely on the text encoding your text editor used when saving the ".lua" file. If the text editor is saving as utf-8 then your Lua strings will be encoded as utf-8 as well.

From the point of view of Lua, strings are just an array of bytes. It isn't aware of utf-8 or any other character encoding for that matter.

Reply | Threaded
Open this post in threaded view
|

Re: Default UTF-8 encoding in strings

Sean Conner
In reply to this post by Robert Virding-2
It was thus said that the Great Robert Virding once stated:

> This is for Lua 5.3.4
>
> When exactly are characters/bytes UTF-8 interpreted in a literal strings?
> It seems like that when you write a literal string which includes a unicode
> character then it will be inserted into the string by its UTF-8 encoding.
> Even if it is small enough to fit in one byte. For example the string "aäb"
> has the bytes 97, 195, 164, 98 even though the ä character has the value
> 228 so it could fit in a byte. The same when printing a string if there is
> a legal UTF-8 sequence then its unicode character will be printed, however,
> a value of 228 will be printed as ?.

  UTF-8 encoding will always encode characters with code points above 127 as
multiple bytes (two or more).  A good starting place (I think) is the
Wikipedia page on UTF-8 encoding:

        https://en.wikipedia.org/wiki/UTF-8

  -spc