Use SetConsoleOutputCP on windows

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Use SetConsoleOutputCP on windows

Marcus Mason
Hi, as I'm sure the windows users on the list are aware, the lua standalone application often struggles with mojibake when printing to the windows terminal. To be clear, this is a problem with windows consoles specifically, and somewhat ironically using lua from within a WSL session leads to no mojibake. The windows function SetConsoleOutputCP "Sets the output code page used by the console associated with the calling process."  which may be useful in correctly printing from lua by selecting the utf8 codepage. Has anyone looked into something like this before?
Reply | Threaded
Open this post in threaded view
|

Re: Use SetConsoleOutputCP on windows

Scott Morgan
On 13/01/2021 11:30, Marcus Mason wrote:
> Hi, as I'm sure the windows users on the list are aware, the lua
> standalone application often struggles with mojibake when printing to
> the windows terminal. To be clear, this is a problem with windows
> consoles specifically, and somewhat ironically using lua from within a
> WSL session leads to no mojibake. The windows function
> SetConsoleOutputCP "Sets the output code page used by the console
> associated with the calling process."  which may be useful in correctly
> printing from lua by selecting the utf8 codepage. Has anyone looked into
> something like this before?


This would only work on very recent version of Windows (I think Win10
v2004, maybe), and not very reliably.

However, you don't need the direct Win32 system call, you can use the
`chcp` command:


-- Set console output to UTF8
os.execute"chcp 65001"
-- print 'Ναί' (Greek 'yes') in UTF8
print(string.char(0xce, 0x9d, 0xce, 0xb1, 0xce, 0xaf))
-- print '番号' (Japanese 'no') in UTF8
print(string.char(0xe7, 0x95, 0xaa, 0xe5, 0x8f, 0xb7))


And there's still issues with input.

They may fix things in the future, or with a replacement terminal[1]
util. But who knows?

Scott

[1]https://github.com/microsoft/terminal
Reply | Threaded
Open this post in threaded view
|

Re: Use SetConsoleOutputCP on windows

Matthias Kluwe
In reply to this post by Marcus Mason
Hi!

Am Mi., 13. Jan. 2021 um 12:30 Uhr schrieb Marcus Mason <[hidden email]>:

>
> Hi, as I'm sure the windows users on the list are aware, the lua
> standalone application often struggles with mojibake when printing to
> the windows terminal. To be clear, this is a problem with windows
> consoles specifically, and somewhat ironically using lua from within a
> WSL session leads to no mojibake. The windows function
> SetConsoleOutputCP "Sets the output code page used by the console
> associated with the calling process."  which may be useful in
> correctly printing from lua by selecting the utf8 codepage. Has anyone
> looked into something like this before?

Setting the codepage to 65001 works most of the time, but this is not
a general solution, as output buffering can interfere.

You can try

    fputs( "\xc3\xbc", stdout );

vs.

    putc('\xc3'); putc('\xbc');

See https://stackoverflow.com/questions/45575863/how-to-print-utf-8-strings-to-stdcout-on-windows
or https://stackoverflow.com/questions/1660492/utf-8-output-on-windows-console
for further discussion (ignoring the C++ parts).

In summary: No, UTF-8 is not supported by the windows console.

Regards,
Matthias
Reply | Threaded
Open this post in threaded view
|

Re: Use SetConsoleOutputCP on windows

Marcus Mason
Perhaps I should clarify my intent here. I want lua to print correctly on windows, like all the other programming language binaries I use on my computer. Chez scheme, various javascript envrionments, clojure, etc have no issues printing for me. Why is lua.exe not able to do it?

On Wed, Jan 13, 2021, 12:22 Matthias Kluwe <[hidden email]> wrote:
Hi!

Am Mi., 13. Jan. 2021 um 12:30 Uhr schrieb Marcus Mason <[hidden email]>:
>
> Hi, as I'm sure the windows users on the list are aware, the lua
> standalone application often struggles with mojibake when printing to
> the windows terminal. To be clear, this is a problem with windows
> consoles specifically, and somewhat ironically using lua from within a
> WSL session leads to no mojibake. The windows function
> SetConsoleOutputCP "Sets the output code page used by the console
> associated with the calling process."  which may be useful in
> correctly printing from lua by selecting the utf8 codepage. Has anyone
> looked into something like this before?

Setting the codepage to 65001 works most of the time, but this is not
a general solution, as output buffering can interfere.

You can try

    fputs( "\xc3\xbc", stdout );

vs.

    putc('\xc3'); putc('\xbc');

See https://stackoverflow.com/questions/45575863/how-to-print-utf-8-strings-to-stdcout-on-windows
or https://stackoverflow.com/questions/1660492/utf-8-output-on-windows-console
for further discussion (ignoring the C++ parts).

In summary: No, UTF-8 is not supported by the windows console.

Regards,
Matthias
Reply | Threaded
Open this post in threaded view
|

Re: Use SetConsoleOutputCP on windows

Viacheslav Usov
On Wed, Jan 13, 2021 at 2:01 PM Marcus Mason <[hidden email]> wrote:
>
> I want lua to print correctly on windows [...] Why is lua.exe not able to do it?

What makes you think it is unable to?

Cheers,
V.
Reply | Threaded
Open this post in threaded view
|

Re: Use SetConsoleOutputCP on windows

Diego Nehab-4
Is this the issue we sometimes solve using https://github.com/rprichard/winpty/?

On Wed, Jan 13, 2021 at 10:06 AM Viacheslav Usov <[hidden email]> wrote:
On Wed, Jan 13, 2021 at 2:01 PM Marcus Mason <[hidden email]> wrote:
>
> I want lua to print correctly on windows [...] Why is lua.exe not able to do it?

What makes you think it is unable to?

Cheers,
V.
Reply | Threaded
Open this post in threaded view
|

Re: Use SetConsoleOutputCP on windows

Marcus Mason
In reply to this post by Viacheslav Usov
I have code that reads input using the utf8 library. I was testing it with some string literals and noticed my output was nonsense. I then tried the same code on linux and works correctly.

On Wed, Jan 13, 2021 at 1:06 PM Viacheslav Usov <[hidden email]> wrote:
On Wed, Jan 13, 2021 at 2:01 PM Marcus Mason <[hidden email]> wrote:
>
> I want lua to print correctly on windows [...] Why is lua.exe not able to do it?

What makes you think it is unable to?

Cheers,
V.
Reply | Threaded
Open this post in threaded view
|

Re: Use SetConsoleOutputCP on windows

Scott Morgan
In reply to this post by Marcus Mason
On 13/01/2021 13:00, Marcus Mason wrote:
> Perhaps I should clarify my intent here. I want lua to print correctly
> on windows, like all the other programming language binaries I use on my
> computer. Chez scheme, various javascript envrionments, clojure, etc
> have no issues printing for me. Why is lua.exe not able to do it?

On Windows: Python doesn't[1], and Perl doesn't (don't use Clojure or
JavaScript)

You need to stop and do some reading.

Ultimately, it a problem with Windows, not Lua. *nix systems properly
support UTF8 on the console, Windows doesn't.


[1] Trying the following:

print(u"Greek: 'Ναί', Japanese: '番号'")

Outputs something that looks like:

Greek: 'Ναί', Japanese: '□□'

But you can copy/paste that into an editor and see the correct text,
it's just the console messing up rendering.
Reply | Threaded
Open this post in threaded view
|

Re: Use SetConsoleOutputCP on windows

Marcus Mason
I believe some of those languages are using other encodings in their source code, which I am not suggesting but I am aware that in some cases it's just coincidence. I also do not believe that it is a windows problem, the reason that lua works this way on *nix is because it's outputting the same encoding, if the encoding expected by those systems changed would it not be same situation? I just found something about setting the active code page in an application manifest, but I have no idea what that means to be honest.

On Wed, Jan 13, 2021 at 2:04 PM Scott Morgan <[hidden email]> wrote:
On 13/01/2021 13:00, Marcus Mason wrote:
> Perhaps I should clarify my intent here. I want lua to print correctly
> on windows, like all the other programming language binaries I use on my
> computer. Chez scheme, various javascript envrionments, clojure, etc
> have no issues printing for me. Why is lua.exe not able to do it?

On Windows: Python doesn't[1], and Perl doesn't (don't use Clojure or
JavaScript)

You need to stop and do some reading.

Ultimately, it a problem with Windows, not Lua. *nix systems properly
support UTF8 on the console, Windows doesn't.


[1] Trying the following:

print(u"Greek: 'Ναί', Japanese: '番号'")

Outputs something that looks like:

Greek: 'Ναί', Japanese: '□□'

But you can copy/paste that into an editor and see the correct text,
it's just the console messing up rendering.
Reply | Threaded
Open this post in threaded view
|

Re: Use SetConsoleOutputCP on windows

Scott Morgan
On 13/01/2021 15:50, Marcus Mason wrote:
> I believe some of those languages are using other encodings in their
> source code, which I am not suggesting but I am aware that in some cases
> it's just coincidence. I also do not believe that it is a windows
> problem, the reason that lua works this way on *nix is because it's
> outputting the same encoding, if the encoding expected by those systems
> changed would it not be same situation? I just found something about
> setting the active code page in an application manifest, but I have no
> idea what that means to be honest.

It *IS* a Windows problem. Windows default console does not properly
support UTF8. This is a known issue!

https://devblogs.microsoft.com/commandline/windows-command-line-unicode-and-utf-8-output-text-buffer/
> Alas, the Windows Console is not (currently) able to support UTF-8 text!

The are working to fix it, but they are not there yet.

Alternatively, download the new Windows Terminal app from the MS App
Store and try that. It does support UTF8 properly.
Reply | Threaded
Open this post in threaded view
|

Re: Use SetConsoleOutputCP on windows

Viacheslav Usov
In reply to this post by Marcus Mason
On Wed, Jan 13, 2021 at 2:20 PM Marcus Mason <[hidden email]> wrote:
>
> I have code that reads input using the utf8 library. I was testing it with some string literals and noticed my output was nonsense. I then tried the same code on linux and works correctly.

I am not sure what utf8 library that is. Lua's built-in utf8 library
does not read any inputs, it supports some UTF-8 manipulations. But I
suppose that means you have some strings with UTF-8 encoded data,
which you want to print.

What should be understood here is that Lua strings are (immutable)
sequences of bytes (pretty much like in C). Lua does not know nor does
it expect that these sequences should carry UTF-8 encoded data, ASCII
encoded data, or any other kind of "encoded text". It is binary data
to Lua and it does not transform the binary data in any way.

This is in contrast to Python (as one example), where strings are
Unicode text. Note I said "Unicode", not UTF-8 because its internal
representation is generally not UTF-8.

Lua expects the user to know what the user's strings contain. If the
user wants to print strings, then it assumes they are in whatever
encoding the user's underlying C + OS stack expects. On most Unix-like
systems, that would normally be UTF-8 these days. On most Windows
systems, normally NOT these days.

Again, in the Python example, it knows that the user strings are
Unicode data (not UTF-8), so then it uses a way appropriate for the
given OS to print its Unicode strings as Unicode. On most Unix-like
systems, that would be by encoding the data as UTF-8. On WIndows, that
would be by encoding the data as UTF-16.

So you cannot compare Lua to Python in this respect, nor can you say
that Lua works "correctly" on Linux and not so on Windows. It is
_your_ Lua code that is correct on Linux (assuming UTF-8 would work)
and it is also _your_ code that is not correct on Windows (under the
same assumption).

Cheers,
V.
Reply | Threaded
Open this post in threaded view
|

Re: Use SetConsoleOutputCP on windows

Marcus Mason
Well I suppose I can write a c library for outputting utf8 encoded lua strings. I hadn't considered the case where people are using `print` on raw bytes and not in a textual way. When I said "reading input" I simply meant I have a function with some input string that reads codepoints from said input, I am outputting some substrings of interest via print. The speicifc case that provoked this was actually a string literal in a test source file.
On Thu, Jan 14, 2021, 18:05 Viacheslav Usov <[hidden email]> wrote:
On Wed, Jan 13, 2021 at 2:20 PM Marcus Mason <[hidden email]> wrote:
>
> I have code that reads input using the utf8 library. I was testing it with some string literals and noticed my output was nonsense. I then tried the same code on linux and works correctly.

I am not sure what utf8 library that is. Lua's built-in utf8 library
does not read any inputs, it supports some UTF-8 manipulations. But I
suppose that means you have some strings with UTF-8 encoded data,
which you want to print.

What should be understood here is that Lua strings are (immutable)
sequences of bytes (pretty much like in C). Lua does not know nor does
it expect that these sequences should carry UTF-8 encoded data, ASCII
encoded data, or any other kind of "encoded text". It is binary data
to Lua and it does not transform the binary data in any way.

This is in contrast to Python (as one example), where strings are
Unicode text. Note I said "Unicode", not UTF-8 because its internal
representation is generally not UTF-8.

Lua expects the user to know what the user's strings contain. If the
user wants to print strings, then it assumes they are in whatever
encoding the user's underlying C + OS stack expects. On most Unix-like
systems, that would normally be UTF-8 these days. On most Windows
systems, normally NOT these days.

Again, in the Python example, it knows that the user strings are
Unicode data (not UTF-8), so then it uses a way appropriate for the
given OS to print its Unicode strings as Unicode. On most Unix-like
systems, that would be by encoding the data as UTF-8. On WIndows, that
would be by encoding the data as UTF-16.

So you cannot compare Lua to Python in this respect, nor can you say
that Lua works "correctly" on Linux and not so on Windows. It is
_your_ Lua code that is correct on Linux (assuming UTF-8 would work)
and it is also _your_ code that is not correct on Windows (under the
same assumption).

Cheers,
V.
bel
Reply | Threaded
Open this post in threaded view
|

Re: Use SetConsoleOutputCP on windows

bel
Windows internally uses either code page dependent multibyte encoding ... there are a lot of different locale specific code pages, but not UTF-8.
Alternatively Windows can work with UTF-16. This will also avoid the need to set a certain code page.

When you have a UTF-8 string on Windows, you can convert it to UTF-16 (wchar_t is 16 bit on Windows) using
MultiByteToWideChar(CP_UTF8, 0, utt8_buffer, -1, wchar_t_buffer, (int)wchar_t_buffer_length);

Windows cannot handle UTF-8 in any other way, but there are a lot of APIs that handle wide characters correctly.
But to use this consistently, e.g. for file names, you also have to consider some standard functions, e.g., io.open, do not use the wide character version. For printing only, you might use the wprintf versions with wide characters.

On Thu, Jan 14, 2021 at 7:30 PM Marcus Mason <[hidden email]> wrote:
Well I suppose I can write a c library for outputting utf8 encoded lua strings. I hadn't considered the case where people are using `print` on raw bytes and not in a textual way. When I said "reading input" I simply meant I have a function with some input string that reads codepoints from said input, I am outputting some substrings of interest via print. The speicifc case that provoked this was actually a string literal in a test source file.
On Thu, Jan 14, 2021, 18:05 Viacheslav Usov <[hidden email]> wrote:
On Wed, Jan 13, 2021 at 2:20 PM Marcus Mason <[hidden email]> wrote:
>
> I have code that reads input using the utf8 library. I was testing it with some string literals and noticed my output was nonsense. I then tried the same code on linux and works correctly.

I am not sure what utf8 library that is. Lua's built-in utf8 library
does not read any inputs, it supports some UTF-8 manipulations. But I
suppose that means you have some strings with UTF-8 encoded data,
which you want to print.

What should be understood here is that Lua strings are (immutable)
sequences of bytes (pretty much like in C). Lua does not know nor does
it expect that these sequences should carry UTF-8 encoded data, ASCII
encoded data, or any other kind of "encoded text". It is binary data
to Lua and it does not transform the binary data in any way.

This is in contrast to Python (as one example), where strings are
Unicode text. Note I said "Unicode", not UTF-8 because its internal
representation is generally not UTF-8.

Lua expects the user to know what the user's strings contain. If the
user wants to print strings, then it assumes they are in whatever
encoding the user's underlying C + OS stack expects. On most Unix-like
systems, that would normally be UTF-8 these days. On most Windows
systems, normally NOT these days.

Again, in the Python example, it knows that the user strings are
Unicode data (not UTF-8), so then it uses a way appropriate for the
given OS to print its Unicode strings as Unicode. On most Unix-like
systems, that would be by encoding the data as UTF-8. On WIndows, that
would be by encoding the data as UTF-16.

So you cannot compare Lua to Python in this respect, nor can you say
that Lua works "correctly" on Linux and not so on Windows. It is
_your_ Lua code that is correct on Linux (assuming UTF-8 would work)
and it is also _your_ code that is not correct on Windows (under the
same assumption).

Cheers,
V.