Support for Windows unicode paths

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Support for Windows unicode paths

Thomas Harning Jr.
One problem I've recently run into was accessing Windows unicode paths is impossible from Lua with its standard interface.  I blame not Lua for its use of the standard C apis, but instead Windows for bad fallbacks (not to mention Outlook Express dies if you have a unicode username).... in any case it still needs to be dealt with.

Has anybody come up with solutions?  The only reasonably solution I can think of is to have 'replacement' APIs at boundaries that send/receive certain text from the OS...

Ex:
  os.getenv()
  io.open()
  io.popen()
  os.execute
  .... even the require libraries would need something to deal with paths it's using

The replacement format would be:

 Lua -> WIN  : (some charset, ex UTF-8) => Wide
 WIN -> Lua  : wide => (some charset, ex UTF-8)

... any other ideas?

--
Thomas Harning Jr.
Reply | Threaded
Open this post in threaded view
|

Re: Support for Windows unicode paths

KHMan
Thomas Harning Jr. wrote:
> [snip]
> The replacement format would be:
>
>  Lua -> WIN  : (some charset, ex UTF-8) => Wide
>  WIN -> Lua  : wide => (some charset, ex UTF-8)
>
> ... any other ideas?

Yeah, in theory the Lua side should just see UTF-8 for all
platforms and be done with it. I've done some Lua file handling
code in such a way myself.

Yet I think it is not so simple if you want to target broad
compatibility; look at some of the Cygwin's discussions this past
few weeks and see the various charset issues still being worked
out for Cygwin 1.7. A limited feature set (like "UTF-8 or the
highway") would be easier to implement than what they are doing on
Cygwin.

--
Cheers,
Kein-Hong Man (esq.)
Kuala Lumpur, Malaysia
Reply | Threaded
Open this post in threaded view
|

Re: Support for Windows unicode paths

Bulat Ziganshin
In reply to this post by Thomas Harning Jr.
Hello Thomas,

Wednesday, July 22, 2009, 6:55:56 PM, you wrote:

> Has anybody come up with solutions?  The only reasonably solution I
> can think of is to have 'replacement' APIs at boundaries that
> send/receive certain text from the OS...

i don't understand that you mean by "replacement APIs". i think the
best way is to provide C low-level functions that implements open,
popen, getenv functionality using utf8-encoded strings (internally
converting them to/from utf-16) and use in io/os packages these
functions instead of usual ones

just adding encoding/decoding layer is probably impossible since
open/popen/... on windows uses OEM/ANSI encoding by default, and
encoding from utf-8 to oem/ansi is lossy


--
Best regards,
 Bulat                            mailto:[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Support for Windows unicode paths

Thomas Harning Jr.
On Wed, Jul 22, 2009 at 11:26 AM, Bulat Ziganshin <[hidden email]> wrote:
Hello Thomas,

Wednesday, July 22, 2009, 6:55:56 PM, you wrote:

> Has anybody come up with solutions?  The only reasonably solution I
> can think of is to have 'replacement' APIs at boundaries that
> send/receive certain text from the OS...

i don't understand that you mean by "replacement APIs". i think the
best way is to provide C low-level functions that implements open,
popen, getenv functionality using utf8-encoded strings (internally
converting them to/from utf-16) and use in io/os packages these
functions instead of usual ones
What you've just described /is/ the replacement API I was thinking about.

Has anyone already performed this work (making a 'lightweight' wrapper for Windows to make Lua operate like this)"?

--
Thomas Harning Jr.
Reply | Threaded
Open this post in threaded view
|

Re: Support for Windows unicode paths

KHMan
Thomas Harning Jr. wrote:

> On Wed, Jul 22, 2009 at 11:26 AM, Bulat Ziganshin wrote:
>
>     Hello Thomas,
>
>     Wednesday, July 22, 2009, 6:55:56 PM, you wrote:
>
>      > Has anybody come up with solutions?  The only reasonably solution I
>      > can think of is to have 'replacement' APIs at boundaries that
>      > send/receive certain text from the OS...
>
>     i don't understand that you mean by "replacement APIs". i think the
>     best way is to provide C low-level functions that implements open,
>     popen, getenv functionality using utf8-encoded strings (internally
>     converting them to/from utf-16) and use in io/os packages these
>     functions instead of usual ones
>
> What you've just described /is/ the replacement API I was thinking about.
>
> Has anyone already performed this work (making a 'lightweight' wrapper
> for Windows to make Lua operate like this)"?

Oh, just forgot, it has already been done, check apr, the Apache
Portable Runtime, the file i/o uses all sorts of UTF-something.
There might be a binding available, but I am too lazy to check the
wiki...

--
Cheers,
Kein-Hong Man (esq.)
Kuala Lumpur, Malaysia
Reply | Threaded
Open this post in threaded view
|

Re: Support for Windows unicode paths

David Given
In reply to this post by Bulat Ziganshin
Bulat Ziganshin wrote:
[...]
> just adding encoding/decoding layer is probably impossible since
> open/popen/... on windows uses OEM/ANSI encoding by default, and
> encoding from utf-8 to oem/ansi is lossy

It's worse than you think. On Windows, the ANSI file system APIs have a
much shorter maximum path length than the UTF-16 ones! Read this:

http://subversion.tigris.org/faq.html#long-paths

So a naive mapping of the Lua functions to the corresponding Windows
ones is not necessarily going to work --- it's entirely possible to be
in a situation where just calling getcwd() fails.

Ditching the ANSI APIs completely and using the UTF-16 ones entirely is
probably the best way to go, unfortunately...

--
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────

│ "They laughed at Newton. They laughed at Einstein. Of course, they
│ also laughed at Bozo the Clown." --- Carl Sagan
Reply | Threaded
Open this post in threaded view
|

Re: Support for Windows unicode paths

Thomas Harning Jr.
On Wed, Jul 22, 2009 at 11:44 AM, David Given <[hidden email]> wrote:
Bulat Ziganshin wrote:
[...]

just adding encoding/decoding layer is probably impossible since
open/popen/... on windows uses OEM/ANSI encoding by default, and
encoding from utf-8 to oem/ansi is lossy

It's worse than you think. On Windows, the ANSI file system APIs have a much shorter maximum path length than the UTF-16 ones! Read this:

http://subversion.tigris.org/faq.html#long-paths

So a naive mapping of the Lua functions to the corresponding Windows ones is not necessarily going to work --- it's entirely possible to be in a situation where just calling getcwd() fails.

Ditching the ANSI APIs completely and using the UTF-16 ones entirely is probably the best way to go, unfortunately...
Yeah, that's what I'd figured I had to do.

Basically replace all ANSI APIs with the UTF-16 ones, and have an automatic conversion from UTF-8 => UTF-16  [and back when pulling data out]


--
Thomas Harning Jr.
Reply | Threaded
Open this post in threaded view
|

Re: Support for Windows unicode paths

Jerome Vuarand
2009/7/22 Thomas Harning Jr. <[hidden email]>:

> On Wed, Jul 22, 2009 at 11:44 AM, David Given <[hidden email]> wrote:
>>
>> Bulat Ziganshin wrote:
>> [...]
>>>
>>> just adding encoding/decoding layer is probably impossible since
>>> open/popen/... on windows uses OEM/ANSI encoding by default, and
>>> encoding from utf-8 to oem/ansi is lossy
>>
>> It's worse than you think. On Windows, the ANSI file system APIs have a
>> much shorter maximum path length than the UTF-16 ones! Read this:
>>
>> http://subversion.tigris.org/faq.html#long-paths
>>
>> So a naive mapping of the Lua functions to the corresponding Windows ones
>> is not necessarily going to work --- it's entirely possible to be in a
>> situation where just calling getcwd() fails.
>>
>> Ditching the ANSI APIs completely and using the UTF-16 ones entirely is
>> probably the best way to go, unfortunately...
>
> Yeah, that's what I'd figured I had to do.
>
> Basically replace all ANSI APIs with the UTF-16 ones, and have an automatic
> conversion from UTF-8 => UTF-16  [and back when pulling data out]
I have a patch (attached) replacing win32 API calls with the wide
versions, with the help of a couple helper functions (lua_towstring,
lua_pushwstring) that convert between utf-8 (Lua side) and utf-16
(win32 side). You can use that as a basis to replace all appropriate
clib calls. I have more complete wstring management functions in
another lib if you're interested (lua_tolwstring, lua_pushlwstring,
luaL_*, etc.).

I also have a partial win32 binding using the same mechanism (ie.
utf-8 in Lua and wide version of win32 API calls). That could be a
basis to add utf-8 support to Lua standard libraries without patching
the core. It's available as sources there:
http://piratery.net/hg/lwin32/archive/tip.tar.gz.

win32-unicode.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Support for Windows unicode paths

Joshua Jensen
----- Original Message -----
From: Jerome Vuarand
Date: 7/22/2009 10:18 AM
> I have a patch (attached) replacing win32 API calls with the wide
> versions, with the help of a couple helper functions (lua_towstring,
> lua_pushwstring) that convert between utf-8 (Lua side) and utf-16
> (win32 side). You can use that as a basis to replace all appropriate
> clib calls. I have more complete wstring management functions in
> another lib if you're interested (lua_tolwstring, lua_pushlwstring,
> luaL_*, etc.).
>  
FWIW, LuaPlus (http://luaplus.org) has wide string support.  It is
probably similar to your patch, but the wide string functionality is
built into the language itself.  It can even read UCS-2 .lua files.  Be
sure to grab latest Subversion from
svn://svn.luaplus.org/LuaPlus/work51.  Everything is #ifdef'ed.  Look
for LUA_WIDESTRING and LUA_WIDESTRING_FILE.

Josh
Reply | Threaded
Open this post in threaded view
|

Re: Support for Windows unicode paths

Shmuel Zeigerman
Joshua Jensen wrote:
> FWIW, LuaPlus (http://luaplus.org) has wide string support.  It is
> probably similar to your patch, but the wide string functionality is
> built into the language itself.  It can even read UCS-2 .lua files.  Be
> sure to grab latest Subversion from
> svn://svn.luaplus.org/LuaPlus/work51.  Everything is #ifdef'ed.  Look
> for LUA_WIDESTRING and LUA_WIDESTRING_FILE.

It would be great to have this functionality in Lua 5.2.

--
Shmuel
Reply | Threaded
Open this post in threaded view
|

Re: Support for Windows unicode paths

Miles Bader-2
Shmuel Zeigerman <[hidden email]> writes:
>> FWIW, LuaPlus (http://luaplus.org) has wide string support.
>
> It would be great to have this functionality in Lua 5.2.

Why?

Lua shouldn't be infected with this crap.

-miles

--
Helpmate, n. A wife, or bitter half.
Reply | Threaded
Open this post in threaded view
|

Re: Support for Windows unicode paths

Joshua Jensen
----- Original Message -----
From: Miles Bader
Date: 7/22/2009 8:47 PM
Shmuel Zeigerman [hidden email] writes:
  
FWIW, LuaPlus (http://luaplus.org) has wide string support.
      
It would be great to have this functionality in Lua 5.2.
    
Why?  Lua shouldn't be infected with this crap.
  
LuaPlus's wide string support has, among other things, been used to ship a number of commercial console games (Xbox and others) to in Far Eastern languages.

So should Lua be infected with this world-wide language enabling "crap"?  That's not my call.  Unicode is far more than just 16-bit characters, but having those 16-bit characters was sufficient to ship properly localized string tables.

FWIW, I just simply leave the #defines off the majority of the time.

Josh
Reply | Threaded
Open this post in threaded view
|

Re: Support for Windows unicode paths

Shmuel Zeigerman
Joshua Jensen wrote:

> LuaPlus's wide string support has, among other things, been used to ship
> a number of commercial console games (Xbox and others) to in Far Eastern
> languages.
>
> So should Lua be infected with this world-wide language enabling
> "crap"?  That's not my call.  Unicode is far more than just 16-bit
> characters, but having those 16-bit characters was sufficient to ship
> properly localized string tables.
>
> FWIW, I just simply leave the #defines off the majority of the time.

Here's an example of why I'd like to have Unicode support in Lua.
I'm maintaining a library ("LuaFAR") that allows writing plugins for FAR
(www.farmanager.com) in Lua. Since couple years ago, FAR has become a
Unicode application. This caused plugin writers to modify their plugins
to support Unicode too. That probably was not an easy task but still
quite feasible for those writing plugins in C/C++, Pascal and
PowerShell. It seems that to do so with Lua, both LuaFAR *and* Lua
itself have to support Unicode.

--
Shmuel
Reply | Threaded
Open this post in threaded view
|

Re: Support for Windows unicode paths

Miles Bader-2
In reply to this post by Joshua Jensen
Joshua Jensen <[hidden email]> writes:
>>> It would be great to have this functionality in Lua 5.2.
>>>    
>> Why?  Lua shouldn't be infected with this crap.
>>  
> LuaPlus's wide string support has, among other things, been used to ship
> a number of commercial console games (Xbox and others) to in Far Eastern
> languages.

One doesn't need wide strings to use unicode, utf-8 works fine.

Microsoft made a bad call, and they're stuck with it, but Lua need not
do so.

-Miles

--
History, n. An account mostly false, of events mostly unimportant, which are
brought about by rulers mostly knaves, and soldiers mostly fools.
Reply | Threaded
Open this post in threaded view
|

Re: Support for Windows unicode paths

Alex Queiroz
Hallo,

On 7/23/09, Miles Bader <[hidden email]> wrote:
>
>
> One doesn't need wide strings to use unicode, utf-8 works fine.
>
>  Microsoft made a bad call, and they're stuck with it, but Lua need not
>  do so.
>

     UTF-8 is best for serialisation (writing text to disk, to socket
etc.). For in-memory strings it makes a lot of algorithms harder.
UCS-2 was a bad idea, but UTF-16 works perfectly well. UTF-32 is even
better.

--
-alex
http://www.ventonegro.org/
Reply | Threaded
Open this post in threaded view
|

Re: Support for Windows unicode paths

David Given
Alex Queiroz wrote:
[...]
>      UTF-8 is best for serialisation (writing text to disk, to socket
> etc.). For in-memory strings it makes a lot of algorithms harder.
> UCS-2 was a bad idea, but UTF-16 works perfectly well. UTF-32 is even
> better.

Not much, I'm afraid --- as each glyph can be comprised from multiple
code points, having fixed-size code points doesn't help a great deal.
Your algorithms still have to cope with variable-sized groups of code
points. And if you're going to do that, you might as well use UTF-8 for
its ASCII interoperability features.

--
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────

│ "They laughed at Newton. They laughed at Einstein. Of course, they
│ also laughed at Bozo the Clown." --- Carl Sagan
Reply | Threaded
Open this post in threaded view
|

Re: Support for Windows unicode paths

Alex Queiroz
Hallo,

On 7/23/09, David Given <[hidden email]> wrote:

> Alex Queiroz wrote:
>  [...]
>
> >     UTF-8 is best for serialisation (writing text to disk, to socket
> > etc.). For in-memory strings it makes a lot of algorithms harder.
> > UCS-2 was a bad idea, but UTF-16 works perfectly well. UTF-32 is even
> > better.
> >
>
>  Not much, I'm afraid --- as each glyph can be comprised from multiple code
> points, having fixed-size code points doesn't help a great deal. Your
> algorithms still have to cope with variable-sized groups of code points. And
> if you're going to do that, you might as well use UTF-8 for its ASCII
> interoperability features.
>

     This is an interesting point. I and thought I had everything
figured out for my VM's text handling...
--
-alex
http://www.ventonegro.org/
Reply | Threaded
Open this post in threaded view
|

Re[2]: Support for Windows unicode paths

Bulat Ziganshin
In reply to this post by Alex Queiroz
Hello Alex,

Thursday, July 23, 2009, 5:20:27 PM, you wrote:

>      UTF-8 is best for serialisation (writing text to disk, to socket
> etc.). For in-memory strings it makes a lot of algorithms harder.
> UCS-2 was a bad idea, but UTF-16 works perfectly well. UTF-32 is even
> better.

ucs-4 better because it's easier to use. utf-16 doesn't buy us
anything. utf-8 is really great since it provides easy migration path
from ascii


--
Best regards,
 Bulat                            mailto:[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Support for Windows unicode paths

David Given
In reply to this post by Alex Queiroz
Alex Queiroz wrote:
[...]
>      This is an interesting point. I and thought I had everything
> figured out for my VM's text handling...

There's some good reading here (which I hadn't found before):

http://www.unicode.org/reports/tr29

It turns out to be possible to programmatically split a Unicode string
up into its component grapheme clusters (what I was incorrectly
referring to as glyphs, and what most people think of as characters).
So, it ought to be fairly simple to do a Lua addon where you can say:

for c in s:graphemes() do
   print(c)
end

...where c is a *string* containing a particular grapheme cluster (which
might be quite long; the link has an example of a four-code point
cluster). This would actually allow a string to be broken down into an
array of grapheme clusters to give true random access, which I'd
previously thought of as being impossible. It'd be expensive, though...
possibly it'd be worth doing lazily.

--
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────

│ "They laughed at Newton. They laughed at Einstein. Of course, they
│ also laughed at Bozo the Clown." --- Carl Sagan
Reply | Threaded
Open this post in threaded view
|

Re: Support for Windows unicode paths

Klaus Ripke
On Thu, Jul 23, 2009 at 03:05:00PM +0100, David Given wrote:
...
> It turns out to be possible to programmatically split a Unicode string
> up into its component grapheme clusters (what I was incorrectly
> referring to as glyphs, and what most people think of as characters).
> So, it ought to be fairly simple to do a Lua addon where you can say:
>
> for c in s:graphemes() do
>   print(c)
> end
It is, and actually slnunicode does this, with a few caveats:

"
See http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
for default grapheme clusters.
Lazy westerners we are (and lacking the Hangul_Syllable_Type data),
we care for base char + Grapheme_Extend, but not for Hangul syllable sequences.

For http://unicode.org/Public/UNIDATA/UCD.html#Grapheme_Extend
we use Mn (NON_SPACING_MARK) + Me (ENCLOSING_MARK),
ignoring the 18 mostly south asian Other_Grapheme_Extend (16 Mc, 2 Cf) from
http://www.unicode.org/Public/UNIDATA/PropList.txt
"

It provides multiple string libs, one of which operates on graphemes,
meaning length, substr etc all count grapheme clusters.

> ...where c is a *string* containing a particular grapheme cluster (which
> might be quite long; the link has an example of a four-code point
> cluster). This would actually allow a string to be broken down into an
> array of grapheme clusters to give true random access, which I'd
> previously thought of as being impossible. It'd be expensive, though...
> possibly it'd be worth doing lazily.
It boild down to snippets like:
      if (MODE_GRAPH == mode)
        while (Grapheme_Extend(code) && p>s) code = utf8_oced(&p, s);

It is not much more expensive than plain UTF-8,
which in turn is not more expensive than UTF-16 done right,
i.e. with checking for the surrogate pairs to encode characters
beyond the BMP.


enjoy