luaL_loadfile doesn't like the UTF-8 BOM

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

luaL_loadfile doesn't like the UTF-8 BOM

Simon Orde
Hi - for reasons discussed earlier (see "Future Plans for Lua and Unicode"), I want to allow Lua scripts to be encoded in either ANSI or UTF-8.  Legacy scripts are currently ANSI, but with future ones, script authors will be able to choose between ANSI or UTF-8 (for interest, I plan to say that the encoding of strings passed to/returned from my app's API must match/will match the encoding of the script itself).  I use a UTF-8 BOM to allow me to distinguish between ANSI and UTF-8 scripts.  This seems to work fine for a simple script.  I use lua_load with my own supplied reader to load each script, check the BOM (which I need to do anyway, so I know how to handle strings), and then jump past the BOM if there is one. 
 
Script authors can also write Lua modules, however, and by default these are loaded by luaL_loadfile, which doesn't like the UTF8-BOM and throws an error.   This isn't actually that serious a problem, because one simple solution would be for me to simply specify that modules must always be encoded in ANSI.  However, if possible, I would prefer to allow modules to be encoded in either UTF-8 or ANSI too (the rule about string encoding matching script encoding would not apply to modules).  Can anyone suggest a way that I can do this, while still retaining the UTF-8 BOM?  I had a look at the package.loaders section in the manual, but this seems to only provide a way to have a module-specific loader, whereas my requirement applies to all modules.
 
Simon
Reply | Threaded
Open this post in threaded view
|

Re: luaL_loadfile doesn't like the UTF-8 BOM

Jerome Vuarand
2012/7/9 Simon Orde <[hidden email]>:

> Script authors can also write Lua modules, however, and by default these are
> loaded by luaL_loadfile, which doesn't like the UTF8-BOM and throws an
> error.   This isn't actually that serious a problem, because one simple
> solution would be for me to simply specify that modules must always be
> encoded in ANSI.  However, if possible, I would prefer to allow modules to
> be encoded in either UTF-8 or ANSI too (the rule about string encoding
> matching script encoding would not apply to modules).  Can anyone suggest a
> way that I can do this, while still retaining the UTF-8 BOM?  I had a look
> at the package.loaders section in the manual, but this seems to only provide
> a way to have a module-specific loader, whereas my requirement applies to
> all modules.

"package.loaders" doesn't contain module-specific loaders, that's the
role of "package.preload". "package.loaders" is an array of
"searchers", which are functions that take a module name and return
the module initialization function (which is called a "loader"). These
"searchers" and "loaders" are used by the require function.

You can write a "searcher" that look at the BOM, skip it, and compile
the module (but does not run it) with the proper load function. I put
an example of a "searcher" written in Lua on the wiki a while ago:

http://lua-users.org/wiki/LuaModulesLoader

What you have to do is replace the basic "return
assert(loadstring(assert(file:read("*a")), filename))" with something
specific to your situation (BOM skipping). For examples of "searchers"
written in C look a the Lua source code (IIRC Lua itself has 4
predefined "searchers").

Note that all this applies to Lua 5.1, but I believe it wasn't changed
much in 5.2.

Reply | Threaded
Open this post in threaded view
|

Re: luaL_loadfile doesn't like the UTF-8 BOM

Luiz Henrique de Figueiredo
In reply to this post by Simon Orde
> luaL_loadfile, which doesn't like the UTF8-BOM and throws an error.

luaL_loadfile does handle (=skip) BOM in Lua 5.2, doesn't it?
http://www.lua.org/source/5.2/lauxlib.c.html#skipBOM

Reply | Threaded
Open this post in threaded view
|

Re: luaL_loadfile doesn't like the UTF-8 BOM

Philippe Lhoste
In reply to this post by Simon Orde
On 09/07/2012 11:14, Simon Orde wrote:

> Hi - for reasons discussed earlier (see "Future Plans for Lua and Unicode"), I want to
> allow Lua scripts to be encoded in either ANSI or UTF-8.  Legacy scripts are
> currently ANSI, but with future ones, script authors will be able to choose between ANSI
> or UTF-8 (for interest, I plan to say that the encoding of strings passed to/returned from
> my app's API must match/will match the encoding of the script itself).  I use a UTF-8 BOM
> to allow me to distinguish between ANSI and UTF-8 scripts.  This seems to work fine for a
> simple script.  I use lua_load with my own supplied reader to load each script, check the
> BOM (which I need to do anyway, so I know how to handle strings), and then jump past the
> BOM if there is one.
> Script authors can also write Lua modules, however, and by default these are loaded by
> luaL_loadfile, which doesn't like the UTF8-BOM and throws an error.   This isn't actually
> that serious a problem, because one simple solution would be for me to simply specify that
> modules must always be encoded in ANSI.  However, if possible, I would prefer to allow
> modules to be encoded in either UTF-8 or ANSI too (the rule about string encoding matching
> script encoding would not apply to modules).  Can anyone suggest a way that I can do this,
> while still retaining the UTF-8 BOM?  I had a look at the package.loaders section in the
> manual, but this seems to only provide a way to have a module-specific loader, whereas my
> requirement applies to all modules.
> Simon

 From what I have read in this very mailing list from people more knowledgeable than me,
there is no real, official UTF-8 BOM...
It is an invention of Microsoft, written, for example, by Notepad when saving in UTF-8.

Might I ask why you maintain two set of encodings? It would be simpler to make everything
UTF-8. If a script is pure Ascii, no problem, they are identical to the UTF-8 version. If
a script is using "Ansi" encoding (hasn't somebody mentioned it is also not a real
encoding, just a Microsoft shortcut to ISO-8856-1 or -15 or something like that?), it
should be trivial to convert them.

Possible alternatives to this pseudo-BOM (Lua is probably not the only software barfing on
it) can be indicating the encoding in the name of the file, or using a special first line,
like the XML files do with encoding="UTF-8" attribute.


--
Philippe Lhoste
--  (near) Paris -- France
--  http://Phi.Lho.free.fr
--  --  --  --  --  --  --  --  --  --  --  --  --  --




Reply | Threaded
Open this post in threaded view
|

Re: luaL_loadfile doesn't like the UTF-8 BOM

Simon Orde
Many thanks for the replies to this - all helpful.

Philippe - I have to think about legacy scripts written in ANSI, and the
possibility that people may want to write new scripts to be run against old
versions of my software, which expects ANSI.  Yes I can easily convert old
ANSI scripts if I know that that's what they are, but users can import
scripts written by others, and without a BOM it's tricky to distinguish the
encoding.  Also, internally my app will use UTF-16 so I have to do a lot of
string conversion anyway.  It's not significantly harder for me to add
support for ANSI as a conversion target, as well as UTF-8.  Also, I want
plugin authors to be able to work in ANSI anyway if they want to, because
it's easier.  I agree that there are ways of doing it which avoid the need
for a BOM.  I'm not sure if they are better though.

Simon


Reply | Threaded
Open this post in threaded view
|

Re: luaL_loadfile doesn't like the UTF-8 BOM

Owen Shepherd
In reply to this post by Philippe Lhoste


09 July 2012 11:54


From what I have read in this very mailing list from people more knowledgeable than me, there is no real, official UTF-8 BOM...
It is an invention of Microsoft, written, for example, by Notepad when saving in UTF-8.

The Unicode byte order mark is defined by the Unicode standard to be the character U+FEFF (formerly "zero width non breaking space"). The use of the BOM character to identify text files encoded using a Unicode Transformation Format is encouraged by the Unicode Consortium for cases where out of band signalling is not available
Reply | Threaded
Open this post in threaded view
|

Re: luaL_loadfile doesn't like the UTF-8 BOM

Owen Shepherd
In reply to this post by Simon Orde

09 July 2012 13:28
Many thanks for the replies to this - all helpful.

Philippe - I have to think about legacy scripts written in ANSI, and the possibility that people may want to write new scripts to be run against old versions of my software, which expects ANSI.  Yes I can easily convert old ANSI scripts if I know that that's what they are, but users can import scripts written by others, and without a BOM it's tricky to distinguish the encoding.  Also, internally my app will use UTF-16 so I have to do a lot of string conversion anyway.  It's not significantly harder for me to add support for ANSI as a conversion target, as well as UTF-8.  Also, I want plugin authors to be able to work in ANSI anyway if they want to, because it's easier.  I agree that there are ways of doing it which avoid the need for a BOM.  I'm not sure if they are better though.

Simon

Note that even in the case you have an "ANSI" source file you have no idea what encoding that is - For example, for a Russian user their system 'ANSI' encoding is Windows-1251, while for a speaker of a Western European language (such as English) the encoding will be Windows-1252. And then there are some languages added more recently... for which there isn't even an ANSI locale! So, in other words... ANSI is a mess.

My recommended strategy for this sort of transition:
  • Read the first bytes of the file. If they're a UTF-8 BOM, assume UTF-8. If they're a UTF-16 BOM, assume UTF-16. If they're neither, then proceed...
  • Try to decode the text as UTF-8. UTF-8 without a BOM is not uncommon, and UTF-8's encoding has the nice property that misidentifying text in another encoding as it is unlikely. It is therefore reasonable to try UTF-8 on unknown text, and if it decodes, its very probably right
  • Failing that, decode in the system ANSI locale.

Whatever your result, I would suggest just converting your scripting interface to run on UTF-8 internally. You may break some scripts... but you'll also avoid the (frankly going to be very broken) case where you have, say, a UTF-8 script using an "ANSI" module..

With regards to luaL_loadfile: The only builtin Lua functions which load a file are
  • loadfile
  • dofile
  • package.searchers[2], as used by require (<-- you might want to check that index. A glance at the documentation suggests it is probably right)
It should be relatively easy to re-implement these in a fully Unicode aware manner (i.e. following the above suggested rules)
Reply | Threaded
Open this post in threaded view
|

Re: luaL_loadfile doesn't like the UTF-8 BOM

Miles Bader-2
In reply to this post by Philippe Lhoste
Philippe Lhoste <[hidden email]> writes:
> From what I have read in this very mailing list from people more
> knowledgeable than me, there is no real, official UTF-8 BOM...  It
> is an invention of Microsoft, written, for example, by Notepad when
> saving in UTF-8.

Yes, it's a Microsoftism, and causes many problems.

Unfortunately, they seem pretty keen about it (I guess they just don't
care about the effect on non-MS software), and there's MS software out
there (VSxxx Japanese edition, at least) that _forces_ you to use a
BOM if you want to use UTF-8.  As you can imagine, this makes writing
portable source code a bit miserable...

-miles

--
Dictionary, n.  A malevolent literary device for cramping the growth of
a language and making it hard and inelastic. This dictionary, however,
is a most useful work.

Reply | Threaded
Open this post in threaded view
|

Re: luaL_loadfile doesn't like the UTF-8 BOM

Owen Shepherd

09 July 2012 14:27

Yes, it's a Microsoftism, and causes many problems.

Unfortunately, they seem pretty keen about it (I guess they just don't
care about the effect on non-MS software), and there's MS software out
there (VSxxx Japanese edition, at least) that _forces_ you to use a
BOM if you want to use UTF-8. As you can imagine, this makes writing
portable source code a bit miserable...

-miles
It shouldn't cause problems. Both GCC (Since 4.4) and Clang will act on a UTF-8 BOM appropriately.
Reply | Threaded
Open this post in threaded view
|

Re: luaL_loadfile doesn't like the UTF-8 BOM

Javier Guerra Giraldez
In reply to this post by Owen Shepherd
On Mon, Jul 9, 2012 at 7:55 AM, Owen Shepherd <[hidden email]> wrote:
>
> The Unicode byte order mark is defined by the Unicode standard to be the
> character U+FEFF (formerly "zero width non breaking space"). The use of the
> BOM character to identify text files encoded using a Unicode Transformation
> Format is encouraged by the Unicode Consortium for cases where out of band
> signalling is not available

.... and they explicitly recommended against its use in UTF-8

of course, given the "be liberal in what you accept, conservative in
what you send", it's not unreasonable to implement a BOM-skipping
code, even in UTF-8.


--
Javier

Reply | Threaded
Open this post in threaded view
|

Re: luaL_loadfile doesn't like the UTF-8 BOM

Owen Shepherd

09 July 2012 19:49

.... and they explicitly recommended against its use in UTF-8

of course, given the "be liberal in what you accept, conservative in
what you send", it's not unreasonable to implement a BOM-skipping
code, even in UTF-8.


--
Javier

http://unicode.org/faq/utf_bom.html#bom5  says otherwise. They just say its use is discouraged or invalid in some cases where the encoding is already known.
Reply | Threaded
Open this post in threaded view
|

Re: luaL_loadfile doesn't like the UTF-8 BOM

Tom N Harris
On Monday, July 09, 2012 03:07:03 PM Owen Shepherd wrote:
> http://unicode.org/faq/utf_bom.html#bom5  says otherwise. They just say
> its use is discouraged or invalid in some cases where the encoding is
> already known.

Isn't design-by-committee wonderful?

The most sensible approach to encoding identification has been prefix lines as
introduced by Emacs and adopted by other editors and languages. Quoting from
Python PEP #263[1]

    the first or second line must match the regular
    expression "coding[:=]\s*([-\w.]+)". The first group of this
    expression is then interpreted as encoding name.

So a Lua script would start with

    --*- coding: utf-8 -*-

With or without a BOM. (If there is a BOM, the declared encoding is ignored.)

Of course, as far as Lua is concerned, UTF-8 looks the same as ISO-8859-*.
Strings are just bytes as has often been discussed on the list. But knowing
the source encoding allows your application to apply a transformation before
displaying any strings to the user.

[1] http://www.python.org/dev/peps/pep-0263

--
tom <[hidden email]>