Re: [Iup-users] (UTF-8 / UTF-16 / Unicode) Request

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Re: [Iup-users] (UTF-8 / UTF-16 / Unicode) Request

On 2019-12-02 06:58, Antonio Scuri wrote:

>    Yes, the problem is not the conversion itself.
>    IUP already use Unicode strings when setting and retrieving native
> element strings. The problem is we can NOT change the IUP API without
> breaking compatibility with several thousands of lines of code of existing
> applications. To add a new API in parallel is something we don't have time
> to do. So we are still going to use char* in our API for some time.  UTF-16
> would imply in using wchar* in everything.
>    What we can do now is to provide some mechanism for the application to be
> able to use the returned string in IupFileDlg in another control and in
> fopen. Probably a different attribute will be necessary. Some applications
> I know are simply using UTF8MODE_FILE=NO, so they can use it in fopen, and
> before displaying it in another control, converting that string to UTF-8.
> Best,
> Scuri


GNU/Linux user here, using Lua and IM/CD/IUP across multiple distros,
mainly Debian-based (Ubuntu, LinuxMint, but hopefully trying to expand the
supported-distro list over time).

Remember that there is a very wide gulf between:

         - Unicode (which defines abstract code points, their meaning, and
           ways to handle localisation such as left-to-right versus
           right-to-left rendering); and

        - Code point representation (UTF-8 is the clear leader here,
          and Lua itself (not the IUP* tools) now includes a "utf8"
          library), UTF-7, UTF-16, UCS-4, etc.  There is no attempt by
          this library to move from the (relatively straightforward) job
          of representing code points, through to the much, much more
          demanding job of interpreting and rendering the code point
          stream as a Unicode entity.

The choice of UTF-8 by both POSIX-compliant OSes (for all interfaces
that are defined by POSIX) and by Lua choosing to bundle the utf8 library
itself in Version 5.3 shows a strong preference for UTF-8:

         "UTF-8 Support"

An early essay that lays out the UTF-8/Unicode landscape, comes from the
2003 article in the blog "Joel on Software:

         "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"

However, the present IM/CD/IUP discussion has moved beyond POSIX
boundaries, in its move from code point representation, up to graphical
user interface interpretation and rendering.

For fixed-multibyte-encodings, issues such as big-endian versus
little-endian encoding can arise -- hence the Byte Order Mark at the
start of character strings (0xFEFF for UTF-16, I think).  These can
become cumbersome when manipulating strings (e.g. appending two UTF-16
strings, of different endianness).  UTF-8 representation avoids this

Unfortunately, this is where my knowledge stops.  I know enough that
UTF-8 should work on all POSIX-defined interfaces, and that the Lua
library utf8 is now defined and bundled, very strictly staying away from
the explosion in complexity that happens when you move from code point
representation to a conforming Unicode implementation.

My best guess, at this point, is that, given that there are now quite a
large number of programs that work on both Windows and *NIX platforms, go
and study how they handle this problem, rather than reinvent the wheel.
For example, Tcl uses UTF-8 encoding, and it graphical partner, Tk, takes
UTF-8 strings for the definition of Tk widgets, and converts it as
required. From the "I18n" page of the Tcl/Tk documentation (Version 8.1):

         "Fonts, Encodings, and Tk Widgets"

         Tk widgets that display text now require text strings in Unicode/UTF-8
         encoding. Tk automatically handles any encoding conversion necessary
         to display the characters in a particular font.

         If the master font that you set for a widget doesn't contain a glyph
         for a particular Unicode character that you want to display, Tk
         attempts to locate a font that does. Where possible, Tk attempts to
         locate a font that matches as many characteristics of the widget's
         master font as possible (for example, weight, slant, etc.). Once Tk
         finds a suitable font, it displays the character in that font. In other
         words, the widget uses the master font for all characters it is capable
         of displaying, and alternative fonts only as needed.

         In some cases, Tk is unable to identify a suitable font, in which case
         the widget cannot display the characters. (Instead, the widget displays
         a system-dependent fallback character such as "?") The process of
         identifying suitable fonts is complex, and Tk's algorithms don't always
         find a font even if one is actually installed on the system. Therefore,
         for best results, you should try to select as a widget's master font
         one that is capable of handling the characters you expect to display
         For example, "Times" is likely to be a poor choice if you know that you
         need to display Japanese or Arabic characters in a widget.

         If you work with text in a variety of character sets, you may need to
         search out fonts to represent them. Markus Kuhn has developed a free
         6x13 font that supports essentially all the Unicode characters that can
         be displayed in a 6x13 glyph. This does not include Japanese, Chinese,
         and other Asian languages, but it does cover many others. The font is
         available at


         His site also contains many useful links to other sources of fonts and
         font information.

Hope this helps,

s-b etc etc