[ANN] Working with UTF-8 filenames on Windows in pure Lua

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[ANN] Working with UTF-8 filenames on Windows in pure Lua

Egor Skriptunoff-2
Hi!

If you are creating portable Lua script (Linux/Windows/MacOS)
then you have a problem: standard Lua functions such as "io.open"
expect a filename encoded in UTF-8 on all "normal OS",
but on Windows a filename must be in some Windows-specific encoding
depending on the locale.

There is a pure Lua solution to this problem.


This module checks OS, and if Windows is detected,
all filename-related standard Lua functions (they are 11)
are modified to accept filenames in UTF-8.

-- Egor

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua

Viacheslav Usov
On Thu, Jan 17, 2019 at 10:50 PM Egor Skriptunoff <[hidden email]> wrote:
Hi!

If you are creating portable Lua script (Linux/Windows/MacOS)
then you have a problem: standard Lua functions such as "io.open"
expect a filename encoded in UTF-8 on all "normal OS",
but on Windows a filename must be in some Windows-specific encoding
depending on the locale.

There is a pure Lua solution to this problem.

There is no pure Lua solution to this problem.

You are aware of this fact and you mentioned a further constraint in your code:

-- Please note that filenames must contain only symbols from your Windows ANSI codepage (which depends on OS locale).
-- Unfortunatelly, it's impossible to work with a file having arbitrary UTF-8 symbols in its name.

Practically, if your code page is Cyrillic, you cannot specify a file with a Chinese name even though the file exists [1].

One would have to use the Windows API accepting UTF-16 strings, but that is impossible in a pure Lua library.

Cheers,
V.

[1] I am not sure if any current Windows versions support UTF-8 as a system code page. If that were possible, that would invalidate my statement.
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua

Jerome Vuarand
On Sat, 19 Jan 2019 at 14:10, Viacheslav Usov <[hidden email]> wrote:
> There is no pure Lua solution to this problem.

I came to the same conclusion. I personally maintain a set of patches
[1] that make Lua (also LFS) use the W variants of all OS APIs
involving filenames. So in a sense all my Lua interpreters are impure,
but they're so much more usable for Windows scripting.

[1] https://bitbucket.org/doub/canopywater/src/default/srcweb/lua-5.3.4/patches/

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua

Egor Skriptunoff-2
In reply to this post by Viacheslav Usov
On Sat, Jan 19, 2019 at 5:10 PM Viacheslav Usov wrote:
On Thu, Jan 17, 2019 at 10:50 PM Egor Skriptunoff wrote:
If you are creating portable Lua script (Linux/Windows/MacOS)
then you have a problem: standard Lua functions such as "io.open"
expect a filename encoded in UTF-8 on all "normal OS",
but on Windows a filename must be in some Windows-specific encoding
depending on the locale.

There is a pure Lua solution to this problem.

There is no pure Lua solution to this problem.

You are aware of this fact and you mentioned a further constraint in your code:
-- Please note that filenames must contain only symbols from your Windows ANSI codepage (which depends on OS locale).
-- Unfortunately, it's impossible to work with a file having arbitrary UTF-8 symbols in its name.
Practically, if your code page is Cyrillic, you cannot specify a file with a Chinese name even though the file exists.


From the perfectionism point of view, you're correct :-)
But in practice, most use cases are covered by my module.

For example, on all my computers there are no files containing Chinese characters in its filenames.
Assuming you are not speaking Chinese, I want to ask you, do you have such filenames on your machines?  ;-)

My module allows a user to use its native language in filenames ON HIS OWN COMPUTER.
But it's not a good idea to have, for example, Chinese filenames on a computer where user don't speak Chinese.
Although it's technically possible in all modern OS, it's inconvenient for the user. 
User should be able to understand the meaning of a filename.
(If a filename is not supposed to be human-understandable, it would be better consisted of digits/hexadecimals/GUIDs/etc. instead of human language words)

P.S.  Sorry, I wasn't honest enough.
It appears that I do have some filenames containing non-Cyrillic symbols on my computer.
But anyway, such files shouldn't be considered seriously as they all are inside "porn" folder  :-)

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua

Philippe Verdy
East-Asian version of Windows (Chinese, Japanese, Korean) define their "ANSI" legacy code page as a multibyte charset (characters are encoded on 1 to 2 bytes), so they CAN use filenames with Chinese characters in these legacy charsets (though with a limited repertoire, a subset of what is available in Unicode). In more recent version of Windows, these charsets were updated to support GB18030, which is the extension of legacy GBK that covers the WHOLE reertoire of Unicode/ISO 10646, while remaining upward compatible with the legacy GBK, by extending some codes to use up to 4 bytes per character).
GB18030 support is mandatory in China and this support was added in Windows about 20 years ago (so that Unicode encodings such as UTF-8 and UTF-16 are not the only option for Chinese).
Today China like all other countries favor the Unicode UTFs because they offer better interoperability and does not require the filesystem to encode dual encodings (UTF-16 for Windows Unicode, and legacy ISO 88559-*/ANSI/OEM/GB*/HK*/KOI* charsets for the old console, all of them having Windows codepages). Even the Console now supports Unicode ("CHCP 65000" or "CHCP 650001" codepages for UTF-8 and UTF-16). You can still use the legacy Chinese charsets on Windows, the the old GBK codepage may sometimes be lossy when outputting the Unicode result of Windows console apps.
If you use NTFS, the storage legacy charset filenames can be disabled (and it is now disabled by default in Windows 10, you could disable it in Windows 7/8/8.1 on NTFS volumes, and the generation of 8.3 short filenames is no longer necessary; with FAT32, the LFN extension for "long filenames" has been made to use Unicode UTF-16 natively).
Legacy charsets are just there for compatiblity with Windows XP when using external drives (such as USB) formatted with FAT32.
On NTFS, only UTF-16 is needed, the NTFS volume already contains a special hidden file named "\$UpCase" to support correct indexing and sorting of case-insensitive filenames for searches and listing directory contents, even if the unicode version is then updated. The conversion from UTF-16 to legacy charsets is made on the fly by the kernel, using this mapping file when needed (because Unicode is versioned).
I see no real reasons to continue using any Windows app compiled with the legacy "ANSI" APIs of Win32. Everyone now compiles with "UNICODE" (when using the Win32 API) and Unicode is also the default for the .Net and UWX APIs. Now Windows 10 has started deprecating the Win32 API.

The Windows console is now fully compliant to Unicode. There just remains internal codepages used in legacy drivers used at boot time, but they are also being migrated to support Unicode natively (this is the case now of all builtin Microsoft drivers and drivers from wellknown vendors, to get the WHQL certification; and with the security requirement of windows 10, requiring signature and WHQL certification to support secure boot, manufacturers have no choice: they must support Unicode or their devices won't be availabel at boot time, notably storage and input devices, as well as display devices: if these devices don't meet this requirement, Windows will only boot with its builtin compatiblity drivers, using software emulation, and these devices will be slow, without acceleration, which will only be turned on after the graphics environment is loaded, provided that the user-mode helper drivers loaded after are also compiled with Unicode).

You actually no longer need any filesystems with legacy charsets except for external storage drives used by small devices (such as USB flash keys, or SD cards): but actualyl these devices do not even need these charsets and the filenames they use (e.g. for naming photos, or for storing flashable firmwares) are reduced to ASCII only. The only case is when you transfer some music/videos on a falsh drive to play it on a TV or audio system, so that they display the filenames correctly in their menu instead of just garbage boxes or "?" signs (most of these devices will only support FAT32 possibly they may recognize the LFN extension which is encoded with 16-bit Unicode, and so that they actually won't use the legacy 8.3 names encoded with the leacy 8-bit codepages; the legacy 8.3 names were anyway not interoperablebecause FAT did not explicitly stored in their volume metadata descriptor which codepage they were encoded with, so these filenames were already interpreted differently depending on the default system locale of the OS mouting the volume: modern OSes ignore these legacy filenames if Unicode LFN filenames are present, and the genreation of new 8.3 filenames on these volumes is notoriously unreliable if you do that on a FAT32 volume remounted between different host systems; you can still run "CHKDSK" on these devices to fix the mixed encoding of these 8.3 filenames accordin to the LFN Unicode filenames).

So I see no interest in your development, except to support Windows 95/98/XP, whose support is now terminated by Microsoft, and whose security is now too much compromized, and to support the lgacy compatibility modes for Windows 7/8/8.1 whose support is also terminated (even if theyr still have security patches).

We could make the same remark about legacy charsets used in Linux and Unix and legacy protocols (like FTP, or HTML4): they are deprecated. Everyone should use Unicode by default (either UTF-8 or UTF-16). Modern installations of Linux all use UTF-8 now by default and legacy charsets have limtied support and cause bugs (which can turn into security risks caused by unexpected filenames clashes, and security risks are so important today that people don't want to assume it, and even manufacturers, OEMs, and OS providers don't want to assume it). Expect all these legacy charsets to die now. Unicode is now used in data much often all data available in all other legacy codepages (this is even true in China now, where the mandatory support of GB18030 in systems is no longer used, given that Unicode is fully interoperably with GB18030 but has lower cost).

There's stil lthe default "OSM" charset of the console in Windows which still use them, but we should turn to use codepages 65000 and 65001 (UTF8 and UTF-16) instead of everything else. Everything can work with Unicode only.


Le dim. 20 janv. 2019 à 12:13, Egor Skriptunoff <[hidden email]> a écrit :
On Sat, Jan 19, 2019 at 5:10 PM Viacheslav Usov wrote:
On Thu, Jan 17, 2019 at 10:50 PM Egor Skriptunoff wrote:
If you are creating portable Lua script (Linux/Windows/MacOS)
then you have a problem: standard Lua functions such as "io.open"
expect a filename encoded in UTF-8 on all "normal OS",
but on Windows a filename must be in some Windows-specific encoding
depending on the locale.

There is a pure Lua solution to this problem.

There is no pure Lua solution to this problem.

You are aware of this fact and you mentioned a further constraint in your code:
-- Please note that filenames must contain only symbols from your Windows ANSI codepage (which depends on OS locale).
-- Unfortunately, it's impossible to work with a file having arbitrary UTF-8 symbols in its name.
Practically, if your code page is Cyrillic, you cannot specify a file with a Chinese name even though the file exists.


From the perfectionism point of view, you're correct :-)
But in practice, most use cases are covered by my module.

For example, on all my computers there are no files containing Chinese characters in its filenames.
Assuming you are not speaking Chinese, I want to ask you, do you have such filenames on your machines?  ;-)

My module allows a user to use its native language in filenames ON HIS OWN COMPUTER.
But it's not a good idea to have, for example, Chinese filenames on a computer where user don't speak Chinese.
Although it's technically possible in all modern OS, it's inconvenient for the user. 
User should be able to understand the meaning of a filename.
(If a filename is not supposed to be human-understandable, it would be better consisted of digits/hexadecimals/GUIDs/etc. instead of human language words)

P.S.  Sorry, I wasn't honest enough.
It appears that I do have some filenames containing non-Cyrillic symbols on my computer.
But anyway, such files shouldn't be considered seriously as they all are inside "porn" folder  :-)

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua

Coda Highland
In reply to this post by Egor Skriptunoff-2
On Sun, Jan 20, 2019 at 5:13 AM Egor Skriptunoff
<[hidden email]> wrote:
> For example, on all my computers there are no files containing Chinese characters in its filenames.
> Assuming you are not speaking Chinese, I want to ask you, do you have such filenames on your machines?  ;-)

How about, y'know, people who are bilingual? >.> I've got a LOT of
Japanese filenames on my machine.

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua

Soni "They/Them" L.
In reply to this post by Philippe Verdy


On 2019-01-20 10:15 a.m., Philippe Verdy wrote:

> East-Asian version of Windows (Chinese, Japanese, Korean) define their
> "ANSI" legacy code page as a multibyte charset (characters are encoded
> on 1 to 2 bytes), so they CAN use filenames with Chinese characters in
> these legacy charsets (though with a limited repertoire, a subset of
> what is available in Unicode). In more recent version of Windows,
> these charsets were updated to support GB18030, which is the extension
> of legacy GBK that covers the WHOLE reertoire of Unicode/ISO 10646,
> while remaining upward compatible with the legacy GBK, by extending
> some codes to use up to 4 bytes per character).
> GB18030 support is mandatory in China and this support was added in
> Windows about 20 years ago (so that Unicode encodings such as UTF-8
> and UTF-16 are not the only option for Chinese).
> Today China like all other countries favor the Unicode UTFs because
> they offer better interoperability and does not require the filesystem
> to encode dual encodings (UTF-16 for Windows Unicode, and legacy ISO
> 88559-*/ANSI/OEM/GB*/HK*/KOI* charsets for the old console, all of
> them having Windows codepages). Even the Console now supports Unicode
> ("CHCP 65000" or "CHCP 650001" codepages for UTF-8 and UTF-16). You
> can still use the legacy Chinese charsets on Windows, the the old GBK
> codepage may sometimes be lossy when outputting the Unicode result of
> Windows console apps.
> If you use NTFS, the storage legacy charset filenames can be disabled
> (and it is now disabled by default in Windows 10, you could disable it
> in Windows 7/8/8.1 on NTFS volumes, and the generation of 8.3 short
> filenames is no longer necessary; with FAT32, the LFN extension for
> "long filenames" has been made to use Unicode UTF-16 natively).
> Legacy charsets are just there for compatiblity with Windows XP when
> using external drives (such as USB) formatted with FAT32.
> On NTFS, only UTF-16 is needed, the NTFS volume already contains a
> special hidden file named "\$UpCase" to support correct indexing and
> sorting of case-insensitive filenames for searches and listing
> directory contents, even if the unicode version is then updated. The
> conversion from UTF-16 to legacy charsets is made on the fly by the
> kernel, using this mapping file when needed (because Unicode is
> versioned).
> I see no real reasons to continue using any Windows app compiled with
> the legacy "ANSI" APIs of Win32. Everyone now compiles with "UNICODE"
> (when using the Win32 API) and Unicode is also the default for the
> .Net and UWX APIs. Now Windows 10 has started deprecating the Win32 API.
>
> The Windows console is now fully compliant to Unicode. There just
> remains internal codepages used in legacy drivers used at boot time,
> but they are also being migrated to support Unicode natively (this is
> the case now of all builtin Microsoft drivers and drivers from
> wellknown vendors, to get the WHQL certification; and with the
> security requirement of windows 10, requiring signature and WHQL
> certification to support secure boot, manufacturers have no choice:
> they must support Unicode or their devices won't be availabel at boot
> time, notably storage and input devices, as well as display devices:
> if these devices don't meet this requirement, Windows will only boot
> with its builtin compatiblity drivers, using software emulation, and
> these devices will be slow, without acceleration, which will only be
> turned on after the graphics environment is loaded, provided that the
> user-mode helper drivers loaded after are also compiled with Unicode).
>
> You actually no longer need any filesystems with legacy charsets
> except for external storage drives used by small devices (such as USB
> flash keys, or SD cards): but actualyl these devices do not even need
> these charsets and the filenames they use (e.g. for naming photos, or
> for storing flashable firmwares) are reduced to ASCII only. The only
> case is when you transfer some music/videos on a falsh drive to play
> it on a TV or audio system, so that they display the filenames
> correctly in their menu instead of just garbage boxes or "?" signs
> (most of these devices will only support FAT32 possibly they may
> recognize the LFN extension which is encoded with 16-bit Unicode, and
> so that they actually won't use the legacy 8.3 names encoded with the
> leacy 8-bit codepages; the legacy 8.3 names were anyway not
> interoperablebecause FAT did not explicitly stored in their volume
> metadata descriptor which codepage they were encoded with, so these
> filenames were already interpreted differently depending on the
> default system locale of the OS mouting the volume: modern OSes ignore
> these legacy filenames if Unicode LFN filenames are present, and the
> genreation of new 8.3 filenames on these volumes is notoriously
> unreliable if you do that on a FAT32 volume remounted between
> different host systems; you can still run "CHKDSK" on these devices to
> fix the mixed encoding of these 8.3 filenames accordin to the LFN
> Unicode filenames).
>
> So I see no interest in your development, except to support Windows
> 95/98/XP, whose support is now terminated by Microsoft, and whose
> security is now too much compromized, and to support the lgacy
> compatibility modes for Windows 7/8/8.1 whose support is also
> terminated (even if theyr still have security patches).
>
> We could make the same remark about legacy charsets used in Linux and
> Unix and legacy protocols (like FTP, or HTML4): they are deprecated.
> Everyone should use Unicode by default (either UTF-8 or UTF-16).
> Modern installations of Linux all use UTF-8 now by default and legacy
> charsets have limtied support and cause bugs (which can turn into
> security risks caused by unexpected filenames clashes, and security
> risks are so important today that people don't want to assume it, and
> even manufacturers, OEMs, and OS providers don't want to assume it).
> Expect all these legacy charsets to die now. Unicode is now used in
> data much often all data available in all other legacy codepages (this
> is even true in China now, where the mandatory support of GB18030 in
> systems is no longer used, given that Unicode is fully interoperably
> with GB18030 but has lower cost).
>
> There's stil lthe default "OSM" charset of the console in Windows
> which still use them, but we should turn to use codepages 65000 and
> 65001 (UTF8 and UTF-16) instead of everything else. Everything can
> work with Unicode only.
>

I guess this thing does help with ReactOS support. But I'm sure ReactOS
will catch up to the new stuff at some point and it'll no longer be
necessary.

>
> Le dim. 20 janv. 2019 à 12:13, Egor Skriptunoff
> <[hidden email] <mailto:[hidden email]>> a écrit :
>
>     On Sat, Jan 19, 2019 at 5:10 PM Viacheslav Usov wrote:
>
>         On Thu, Jan 17, 2019 at 10:50 PM Egor Skriptunoff wrote:
>
>             If you are creating portable Lua script (Linux/Windows/MacOS)
>             then you have a problem: standard Lua functions such as
>             "io.open"
>             expect a filename encoded in UTF-8 on all "normal OS",
>             but on Windows a filename must be in some Windows-specific
>             encoding
>             depending on the locale.
>
>             There is a pure Lua solution to this problem.
>
>
>         There is no pure Lua solution to this problem.
>
>         You are aware of this fact and you mentioned a further
>         constraint in your code:
>         -- Please note that filenames must contain only symbols from
>         your Windows ANSI codepage (which depends on OS locale).
>         -- Unfortunately, it's impossible to work with a file having
>         arbitrary UTF-8 symbols in its name.
>         Practically, if your code page is Cyrillic, you cannot specify
>         a file with a Chinese name even though the file exists.
>
>
>     From the perfectionism point of view, you're correct :-)
>     But in practice, most use cases are covered by my module.
>
>     For example, on all my computers there are no files containing
>     Chinese characters in its filenames.
>     Assuming you are not speaking Chinese, I want to ask you, do you
>     have such filenames on your machines? ;-)
>
>     My module allows a user to use its native language in filenames ON
>     HIS OWN COMPUTER.
>     But it's not a good idea to have, for example, Chinese filenames
>     on a computer where user don't speak Chinese.
>     Although it's technically possible in all modern OS, it's
>     inconvenient for the user.
>     User should be able to understand the meaning of a filename.
>     (If a filename is not supposed to be human-understandable, it
>     would be better consisted of digits/hexadecimals/GUIDs/etc.
>     instead of human language words)
>
>     P.S.  Sorry, I wasn't honest enough.
>     It appears that I do have some filenames containing non-Cyrillic
>     symbols on my computer.
>     But anyway, such files shouldn't be considered seriously as they
>     all are inside "porn" folder  :-)
>


Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua

Egor Skriptunoff-2
In reply to this post by Philippe Verdy
On Sun, Jan 20, 2019 at 3:36 PM Philippe Verdy wrote:
I see no real reasons to continue using any Windows app compiled with the legacy "ANSI" APIs of Win32.

Lua is compiled with the "ANSI part" of WinAPI,
and I do have real reasons to continue using Lua on Windows  :-)


So I see no interest in your development, except to support Windows 95/98/XP

The problem is not specific to old OS.
Try to run latest Lua under latest Windows and you'll see that nothing has changed since Win95 :-)
Under Windows 10 you still have to provide filenames in ANSI encoding for "io.open".
And your very long post doesn't suggest any solution to this problem.
Except arrogant phrase "I see no real reasons to continue using any Windows app compiled with the legacy ANSI API" (obviously, this means "including Lua").

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua

Viacheslav Usov
In reply to this post by Egor Skriptunoff-2
On Sun, Jan 20, 2019 at 12:13 PM Egor Skriptunoff <[hidden email]> wrote:
 
From the perfectionism point of view, you're correct :-)
But in practice, most use cases are covered by my module.

Your claim is: "Working with UTF-8 filenames on Windows in pure Lua"
Your solution is: "Working with a subset-of-UTF-8 filenames on Windows in pure Lua, wherein the subset is selected by arcane system configuration".

Seeing this discrepancy is not perfectionism, while your "most use cases" is wishful thinking.

Here is what looks more like perfectionism: your solution seems to ignore malformed UTF-8 inputs, and that's not good.

For example, on all my computers there are no files containing Chinese characters in its filenames.
Assuming you are not speaking Chinese, I want to ask you, do you have such filenames on your machines?  ;-)

Cyrillic and Chinese were a hypothetical example. If you really care about what I have, I certainly have file names with Cyrillic and accented Latin characters. Examples from one computer:

Азбука Морзе. Прием на слух и передача на ключе. 1927г..djvu (fun)

Рисунки для выжигания ТРОЕ ИЗ ПРОСТОКВАШИНО.doc (fun)

Brevet de technicien supérieur (BTS).pdf (boring)

Übersendung_Vergütungsvereinbarung.pdf (boring)

That computer, and I think just about every Windows computer I use, has code page 1252 (Latin 1/Western European) as its system code page, which makes your solution only good for the boring stuff in my case.

My module allows a user to use its native language in filenames ON HIS OWN COMPUTER.
But it's not a good idea to have, for example, Chinese filenames on a computer where user don't speak Chinese.

What about users who speak Chinese? Should they be allowed only 7-bit ASCII in addition to Chinese?
 
Although it's technically possible in all modern OS, it's inconvenient for the user. 
User should be able to understand the meaning of a filename.

I certainly understand all those filenames.
 
(If a filename is not supposed to be human-understandable, it would be better consisted of digits/hexadecimals/GUIDs/etc. instead of human language words)

What about servers that keep files for multiple users who collectively speak a bunch of languages?

Let's face it, your solution may be good for you, but it is not good for filenames with arbitrary Unicode characters, and your further argument is akin to "640K ought to be enough for anybody"

As I said originally, there is no way to address this with a pure Lua library, but perhaps we could hope the Lua team might recognise the importance of making the built-in libraries Unicode-friendly on Windows, even if that means using non-standard IO routines.

Cheers,
V.
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua

Viacheslav Usov
In reply to this post by Philippe Verdy
On Sun, Jan 20, 2019 at 1:36 PM Philippe Verdy <[hidden email]> wrote:

> Everyone now compiles with "UNICODE" (when using the Win32 API) 

False.

> Now Windows 10 has started deprecating the Win32 API.

False.

>  Everything can work with Unicode only.

For standard-conformant C code on Windows, false.

Cheers,
V.
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua

Egor Skriptunoff-2
In reply to this post by Coda Highland
On Sun, Jan 20, 2019 at 6:18 PM Coda Highland wrote:

> For example, on all my computers there are no files containing Chinese characters in its filenames.
> Assuming you are not speaking Chinese, I want to ask you, do you have such filenames on your machines?  ;-)

How about, y'know, people who are bilingual? >.> I've got a LOT of
Japanese filenames on my machine.


What's your second language?
If it's English, then 932 codepage in Windows will let you work in both languages simultaneously without a problem.
My module does support 932 page too.

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua

Egor Skriptunoff-2
In reply to this post by Viacheslav Usov
On Sun, Jan 20, 2019 at 6:50 PM Viacheslav Usov wrote:
 
From the perfectionism point of view, you're correct :-)
But in practice, most use cases are covered by my module.

Your claim is: "Working with UTF-8 filenames on Windows in pure Lua"
Your solution is: "Working with a subset-of-UTF-8 filenames on Windows in pure Lua, wherein the subset is selected by arcane system configuration".


My solution is "subset-of-UTF-8 corresponding to the native language of user".
Yes, unfortunately, this approach might be not suitable for a trilingual like you :-)

 

Here is what looks more like perfectionism: your solution seems to ignore malformed UTF-8 inputs, and that's not good.


What would you prefer?  And why?
Raising an error? 
Returning "nil, err_message"?

BTW, in Lua on Linux you could create a file with malformed UTF-8 filename (just tested it): no error is raised, operation is successful.
Why Lua on Windows should check correctness of UTF-8 filenames?

 

Азбука Морзе. Прием на слух и передача на ключе. 1927г..djvu (fun)
Рисунки для выжигания ТРОЕ ИЗ ПРОСТОКВАШИНО.doc (fun)
Brevet de technicien supérieur (BTS).pdf (boring)
Übersendung_Vergütungsvereinbarung.pdf (boring)

It's funny, that Windows codepage 936 (Chinese Simplified) is able to render all these 4 filenames except the letter "Ü".


 
(If a filename is not supposed to be human-understandable, it would be better consisted of digits/hexadecimals/GUIDs/etc. instead of human language words)

What about servers that keep files for multiple users who collectively speak a bunch of languages?


Server with a lot of clients over the whole World?  On Windows?  Ha-ha-ha.
Even Microsoft refuses to use this OS on its own servers.
(The most popular operating system on Microsoft's Azure cloud today is -- drumroll please -- Linux.)



As I said originally, there is no way to address this with a pure Lua library, but perhaps we could hope the Lua team might recognise the importance of making the built-in libraries Unicode-friendly on Windows, even if that means using non-standard IO routines.


You are asking for too much.
This Windows-specific part would take a lot of LOC in Lua source.

 
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua

Philippe Verdy
In reply to this post by Viacheslav Usov


Le dim. 20 janv. 2019 à 16:33, Viacheslav Usov <[hidden email]> a écrit :
On Sun, Jan 20, 2019 at 1:36 PM Philippe Verdy <[hidden email]> wrote:

> Everyone now compiles with "UNICODE" (when using the Win32 API) 

False.

Who does not ? Microsoft itself is slowly replacing most of its code (except the one for supporting very old hardware or software no longer supported by their manufacturers, only to preserve some minimum level of compatibility). Softwares makers have been long to propose 64-bit apps, but the move is underway, most initial difficutlies are solved (and the trend will accelerate because of limitations of mitigations with 32-bit sofwares and kernels).
Even smartphones are now 64-bit; the only remaining things are small IOT devices (but path is underway to hide this detail by using virtual machines on top of the small CPU capabilities, to hide the details).


> Now Windows 10 has started deprecating the Win32 API.

False.

True... You've not seen the latest build development and the fact it adds many security restrictions, moving it to isolate it in a sandboxed subsystem for legacy apps...

>  Everything can work with Unicode only.

For standard-conformant C code on Windows, false.

True, Unicode can be used with the Windows API UTF-16) directly even if the application I/O uses UTF-8. The kernel provides itself the conversion when needed without loss with legacy 8-bit I/O using UTF-8; ANSI/OEM charsets have caused many problems of interoperability and are no longer needed even for the console (which has been improved to fully support Unicode/UTF-8 including at boot time).
Even Unix/Linux now deprecates the uses of legacy 8-bit charsets, and are preinstalling themselves with UTF-8 locales (you are no longer required to install all the legacy 8-bit charsets, except possibly a single one for FAT-formatted external storages with legacy external devices.
UEFI boot also no longer needs OEM charsets; secure boot also now requires UEFI; old hardware with only BIOS boot mode are deprecated (they won't be secured enough and have lot of legacy compatibility limits, including storage sizes, this was made for the old 16-32 bit mode, but with 64-bit and today's memory now largely exceeding 4GB on every PC, and storages exceeding the legacy BIOS limits, plus the need to supprot new 4KB sector sizes, the old tricks no longer work reliably or have a severe performance cost)



Cheers,

Not so. You seem to like 32-bit world, but it is now definitely broken and causes more and more problems once you use it (except when these run in completely sandbones virtual machines and very restricted I/O)

Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua

Viacheslav Usov
In reply to this post by Egor Skriptunoff-2
On Sun, Jan 20, 2019 at 6:51 PM Egor Skriptunoff <[hidden email]> wrote:

> My solution is "subset-of-UTF-8 corresponding to the native language of user".

I already demonstrated that this was not true, and so did your own remark about code page 936 . You are not rational about this, and you avoided a civil discussion of the points in my previous message, so I am not willing to take this discussion any further.

> This Windows-specific part would take a lot of LOC in Lua source.

A patch has been posted in this same thread, so we are talking about homework already done.

Cheers,
V.
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] Working with UTF-8 filenames on Windows in pure Lua

Egor Skriptunoff-2
On Mon, Jan 21, 2019 at 12:55 PM Viacheslav Usov wrote:

> My solution is "subset-of-UTF-8 corresponding to the native language of user".

I already demonstrated that this was not true, and so did your own remark about code page 936 . You are not rational about this, and you avoided a civil discussion of the points in my previous message, so I am not willing to take this discussion any further.



I've already agreed in my previous post that my solution is not suitable
for multilinguals.
So, "civil discussion" should end here, as both sides came to the same conclusion.
I don't understand what you are arguing about now?
Are you waiting for a personal apology?  :-)
Here you are:
Sorry, Viacheslav, you're an exception
(a user that might have some specific tasks my module is unable to solve).
 
Yes, I was targeting at "usual" users having knowledge of 1.1 languages:
his native language + 10% of English :-)
I believe this is the most numerous group of users.
If Windows is installed with default settings,
the ANSI encoding corresponds to the language of Windows,
and very few people would change this setting afterwards.
Experienced users, of course, could change this settings,
but I believe they would clearly understand what may happen
in a non-Unicode Windows app (such as Lua)
when ANSI codepage is set to a "wrong" language.

I hope the discussion is now closed without any offense.
 
BTW, the life of multilingual people is not easy anyway  :-)
What keyboards you are using?
How did you survive in the pre-Unicode epoch without Unicode fonts available?
I can't imagine how you had those 4 files (from your example) on one disk :-)