Is it possible to add utf-8 lua source file support in lua 5.2?

classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Is it possible to add utf-8 lua source file support in lua 5.2?

Xpol Wan
Hi guys,

    Is it possible to add utf-8 lua source support in lua 5.2?
Just remove the BOM of the utf-8 lua source file when loading a lua file of string.

I know it is just a little patch, but it would be nice to include this feature in the official lua.


Best Regards!

Xpol Wan
Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to add utf-8 lua source file support in lua 5.2?

Pan, Shi Zhu

Lua has no problem supporting utf-8 file without BOM.

According to POSIX standard, you should *not* add bom to utf-8 file.

So utf-8 with BOM is not a standard file format.

BTW: gnu gcc does not support utf-8+bom source file either.

On Mon, Sep 27, 2010 at 9:54 AM, Xpol Wan <[hidden email]> wrote:
Hi guys,

    Is it possible to add utf-8 lua source support in lua 5.2?
Just remove the BOM of the utf-8 lua source file when loading a lua file of string.

I know it is just a little patch, but it would be nice to include this feature in the official lua.

Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to add utf-8 lua source file support in lua 5.2?

Duncan Cross
In reply to this post by Xpol Wan
On Mon, Sep 27, 2010 at 2:54 AM, Xpol Wan <[hidden email]> wrote:

> Hi guys,
>
>     Is it possible to add utf-8 lua source support in lua 5.2?
> Just remove the BOM of the utf-8 lua source file when loading a lua file of
> string.
>
> I know it is just a little patch, but it would be nice to include this
> feature in the official lua.
>
>
> Best Regards!
>
> Xpol Wan
>

To be honest, it seems only fair to me, since Lua's standard script
file loader already supports ignoring the first line if it begins with
a shebang. This sets a precedent that it is appropriate for the loader
to ignore common, OS-specific pecularities at the beginning of a
script file.

-Duncan

Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to add utf-8 lua source file support in lua 5.2?

Miles Bader-2
Duncan Cross <[hidden email]> writes:
> To be honest, it seems only fair to me, since Lua's standard script
> file loader already supports ignoring the first line if it begins with
> a shebang. This sets a precedent that it is appropriate for the loader
> to ignore common, OS-specific pecularities at the beginning of a
> script file.

OTOH, it just encourages this little microsoftian perversity, which
pervades their toolset, and encourages other lazy devs to just copy
MS's behavior in their own tools (thus spreading the disease).

-Miles

--
My books focus on timeless truths.  -- Donald Knuth

Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to add utf-8 lua source file support in lua 5.2?

Robert Raschke
In reply to this post by Pan, Shi Zhu

On Mon, Sep 27, 2010 at 3:16 AM, Pan Shi Zhu <[hidden email]> wrote:

Lua has no problem supporting utf-8 file without BOM.

According to POSIX standard, you should *not* add bom to utf-8 file.

So utf-8 with BOM is not a standard file format.

BTW: gnu gcc does not support utf-8+bom source file either.



Unfortunately,  MS insists on being completely inconsistent, adding the BOM in some tools, and stripping it in others. Complete nightmare.

I once added this to lauxlib.c function luaL_loadfile():

--- lauxlib-orig.c    Mon Sep 27 10:15:59 2010
+++ lauxlib.c    Mon Sep 27 10:16:28 2010
@@ -565,6 +565,21 @@
     if (lf.f == NULL) return errfile(L, "open", fnameindex);
   }
   c = getc(lf.f);
+
+  /* vvv RTR vvv: Check for UTF-8 BOM ef bb bf */
+  if (c == 0xef) {
+    if (getc(lf.f) == 0xbb && getc(lf.f) == 0xbf) {
+      /* do nothing, we've skipped the BOM and just continue with normal processing */
+    } else {
+     /* wasn't the UTF8 BOM, so reset everything again */
+      fclose(lf.f);
+      lf.f = fopen(filename, "r");  /* reopen */
+      if (lf.f == NULL) return errfile(L, "open", fnameindex); /* unable to reopen file */
+    }
+    c = getc(lf.f);
+  }
+  /* ^^^ RTR ^^^: Check for UTF-8 BOM ef bb bf */
+
   if (c == '#') {  /* Unix exec. file? */
     lf.extraline = 1;
     while ((c = getc(lf.f)) != EOF && c != '\n') ;  /* skip first line */


It's been good enough for me for a while.

Robby

Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to add utf-8 lua source file support in lua 5.2?

Daurnimator
On 27 September 2010 19:19, Robert Raschke <[hidden email]> wrote:

>
> On Mon, Sep 27, 2010 at 3:16 AM, Pan Shi Zhu <[hidden email]> wrote:
>>
>> Lua has no problem supporting utf-8 file without BOM.
>>
>> According to POSIX standard, you should *not* add bom to utf-8 file.
>>
>> So utf-8 with BOM is not a standard file format.
>>
>> BTW: gnu gcc does not support utf-8+bom source file either.
>>
>
> Unfortunately,  MS insists on being completely inconsistent, adding the BOM
> in some tools, and stripping it in others. Complete nightmare.
>
> I once added this to lauxlib.c function luaL_loadfile():
>
> --- lauxlib-orig.c    Mon Sep 27 10:15:59 2010
> +++ lauxlib.c    Mon Sep 27 10:16:28 2010
> @@ -565,6 +565,21 @@
>      if (lf.f == NULL) return errfile(L, "open", fnameindex);
>    }
>    c = getc(lf.f);
> +
> +  /* vvv RTR vvv: Check for UTF-8 BOM ef bb bf */
> +  if (c == 0xef) {
> +    if (getc(lf.f) == 0xbb && getc(lf.f) == 0xbf) {
> +      /* do nothing, we've skipped the BOM and just continue with normal
> processing */
> +    } else {
> +     /* wasn't the UTF8 BOM, so reset everything again */
> +      fclose(lf.f);
> +      lf.f = fopen(filename, "r");  /* reopen */
> +      if (lf.f == NULL) return errfile(L, "open", fnameindex); /* unable to
> reopen file */
> +    }
> +    c = getc(lf.f);
> +  }
> +  /* ^^^ RTR ^^^: Check for UTF-8 BOM ef bb bf */
> +
>    if (c == '#') {  /* Unix exec. file? */
>      lf.extraline = 1;
>      while ((c = getc(lf.f)) != EOF && c != '\n') ;  /* skip first line */
>
>
> It's been good enough for me for a while.
>
> Robby
>
>

why not use ungetc?

Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to add utf-8 lua source file support in lua 5.2?

Miles Bader-2
Quae Quack <[hidden email]> writes:
> why not use ungetc?

ungetc is only guaranteed to work for a single character.

-miles

--
`The suburb is an obsolete and contradictory form of human settlement'

Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to add utf-8 lua source file support in lua 5.2?

Rob Kendrick-2
In reply to this post by Daurnimator
On Mon, Sep 27, 2010 at 08:00:32PM +1000, Quae Quack wrote:

> why not use ungetc?

Because Lua is written in C, and C's ungetc is not guarenteed to work
for more tha one byte.

B.

Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to add utf-8 lua source file support in lua 5.2?

Daurnimator
On 27 September 2010 20:22, Rob Kendrick <[hidden email]> wrote:

> On Mon, Sep 27, 2010 at 08:00:32PM +1000, Quae Quack wrote:
>
>> why not use ungetc?
>
> Because Lua is written in C, and C's ungetc is not guarenteed to work
> for more tha one byte.
>
> B.
>
>

Not guaranteed to work, but you can try, checking for EOF.

Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to add utf-8 lua source file support in lua 5.2?

Rob Kendrick-2
On Mon, Sep 27, 2010 at 08:45:01PM +1000, Quae Quack wrote:
> On 27 September 2010 20:22, Rob Kendrick <[hidden email]> wrote:
> > On Mon, Sep 27, 2010 at 08:00:32PM +1000, Quae Quack wrote:
> >
> >> why not use ungetc?
> >
> > Because Lua is written in C, and C's ungetc is not guarenteed to work
> > for more tha one byte.
>
> Not guaranteed to work, but you can try, checking for EOF.

And then what?  Do precisely what the OP does. :)  Alternatively, I'd
suggest not advocating the support of broken files generated by broken
tools.  Fix the tools, not the language.

B.

Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to add utf-8 lua source file support in lua 5.2?

"J.Jørgen von Bargen"
In reply to this post by Robert Raschke
Robert Raschke <rtrlists <at> googlemail.com> writes:

>+     /* wasn't the UTF8 BOM, so reset everything again */
>+      fclose(lf.f);
>+      lf.f = fopen(filename, "r");  /* reopen */

I doubt, this will not work on pipes. (characters are lost?)


Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to add utf-8 lua source file support in lua 5.2?

Robert Raschke

2010/9/27 J.Jørgen von Bargen <[hidden email]>
Robert Raschke <rtrlists <at> googlemail.com> writes:

>+     /* wasn't the UTF8 BOM, so reset everything again */
>+      fclose(lf.f);
>+      lf.f = fopen(filename, "r");  /* reopen */

I doubt, this will not work on pipes. (characters are lost?)



Indeed, if you want to strip a UTF-8 BOM from a pipe stream you need to do something else. Essentially, you'd need to add in a read ahead buffer. That's a bit more work than the 10 extra lines for ditching a file BOM.

I haven't looked in detail, but do pipe opens even go via luaL_loadfile() ?

Robby

Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to add utf-8 lua source file support in lua 5.2?

Mike Pall-19
Robert Raschke wrote:
> 2010/9/27 J.Jørgen von Bargen <[hidden email]>
> > I doubt, this will not work on pipes. (characters are lost?)
>
> Indeed, if you want to strip a UTF-8 BOM from a pipe stream you need to do
> something else. Essentially, you'd need to add in a read ahead buffer.
> That's a bit more work than the 10 extra lines for ditching a file BOM.

There's already such a buffer, but it's inside the lexer. That's
why I've moved the '#'-line detection from lauxlib.c into the core
of LuaJIT. I've also added UTF-8 BOM stripping, because trying to
educate Windows users _not_ to use notepad.exe is an absolutely
hopeless endeavor. The LuaJIT lexer has been modified to accept
non-ASCII UTF-8 characters as identifiers, too.

[
Yes, I prefer pragmatic solutions ('It helps my user base') over
dogmatic approaches (aka 'Someone is wrong on the Internet!' or
'Someone might have a ternary, one's complement computer without
IEEE-754-compliant floating-point hardware somewhere in his shed').
See also: http://xkcd.com/386/
]

> I haven't looked in detail, but do pipe opens even go via luaL_loadfile() ?

Well, yes, if you use a named pipe. Whether that's a good idea or
not is a different issue.

--Mike

Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to add utf-8 lua source file support in lua 5.2?

eugeny gladkih
In reply to this post by Rob Kendrick-2
On 27.09.2010 14:22, Rob Kendrick wrote:
> On Mon, Sep 27, 2010 at 08:00:32PM +1000, Quae Quack wrote:
>
>> why not use ungetc?
>
> Because Lua is written in C, and C's ungetc is not guarenteed to work
> for more tha one byte.
>

man rewind
man fseek

Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to add utf-8 lua source file support in lua 5.2?

Rob Kendrick-2
On Mon, Sep 27, 2010 at 09:11:50PM +0400, john gladkih 599133195 wrote:

> On 27.09.2010 14:22, Rob Kendrick wrote:
> >On Mon, Sep 27, 2010 at 08:00:32PM +1000, Quae Quack wrote:
> >
> >>why not use ungetc?
> >
> >Because Lua is written in C, and C's ungetc is not guarenteed to work
> >for more tha one byte.
>
> man rewind
> man fseek

Which doesn't solve the problem for pipes or other non-seekable file
systems.  The problem isn't in Lua; don't fix it there.

B.

Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to add utf-8 lua source file support in lua 5.2?

Tom N Harris
In reply to this post by Duncan Cross
On 9/27/2010 3:20 AM, Duncan Cross wrote:
>
> To be honest, it seems only fair to me, since Lua's standard script
> file loader already supports ignoring the first line if it begins with
> a shebang. This sets a precedent that it is appropriate for the loader
> to ignore common, OS-specific pecularities at the beginning of a
> script file.
>
> -Duncan
>

A regrettable precedent, perhaps? Although it's an OS feature that is
well documented and used on multiple platforms. In any case, I think the
"correct" response to this problem is...

 From http://www.lua.org/manual/5.1/manual.html#pdf-load
 >
 > load(func [, chunkname])
 > Loads a chunk using function func to get its pieces.

function utf8_dofile(filename)
   return assert(load(
     function()
       return string.gsub(
              assert(io.open(filename)):read('*a'),
              '^\239\187\191', '')
     end,
     filename
   ))()
end

Or something like that.

--
- tom
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to add utf-8 lua source file support in lua 5.2?

Pan, Shi Zhu
In reply to this post by Mike Pall-19

On Mon, Sep 27, 2010 at 10:50 PM, Mike Pall <[hidden email]> wrote:
I've also added UTF-8 BOM stripping, because trying to
educate Windows users _not_ to use notepad.exe is an absolutely
hopeless endeavor.


The utf-8 bom, by definition of unicode, is actually a "space" character. Shall we just treat utf-8 bom like a normal space character, instead of strip it off? Is that easier to handle in the lexer?


Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to add utf-8 lua source file support in lua 5.2?

"J.Jørgen von Bargen"
In reply to this post by Tom N Harris
Tom N Harris <telliamed <at> whoopdedo.org> writes:

> function utf8_dofile(filename)
>    return assert(load(
>      function()
>        return string.gsub(
>               assert(io.open(filename)):read('*a'),
>               '^\239\187\191', '')
>      end,
>      filename
>    ))()
> end

Which will close the file at some unspecified time in the not-so-near future and
may lead to "Out of file handles", if called often with no GC betweeen.



Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to add utf-8 lua source file support in lua 5.2?

David Heiko Kolf
In reply to this post by Pan, Shi Zhu
Pan Shi Zhu schrieb:
> The utf-8 bom, by definition of unicode, is actually a "space"
> character. Shall we just treat utf-8 bom like a normal space character,
> instead of strip it off? Is that easier to handle in the lexer?

I guess this would be the cleanest solution and that's what I did in my
JSON module.

- David

Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to add utf-8 lua source file support in lua 5.2?

Luiz Henrique de Figueiredo
In reply to this post by Pan, Shi Zhu
> The utf-8 bom, by definition of unicode, is actually a "space" character.
> Shall we just treat utf-8 bom like a normal space character, instead of
> strip it off? Is that easier to handle in the lexer?

In Lua 5.2 you don't even have to patch the lexer: just edit lctype.c
and say that 0xFF and 0xFE are whitespace. This of course is not the
perfect solution, because BOM is a 2-byte entity, not a 1-byte one...

12