OpenBSD Journal

Heads up! OpenBSD now supports multi-byte characters!

Contributed by maxime on from the charset-for-IRC-junkies dept.

On July 27th, Stefan Sperling (stsp@) added support for the multi-byte characters in the OpenBSD libc. Thanks to the work of the people involved in its development, the OpenBSD C library now supports the Unicode character encoding scheme UTF-8. Read on for the full commit message, some words from Stefan about what needs to be tested and how to do so:

From: Stefan Sperling 
To: source-changes@cvs.openbsd.org
Subject: CVS: cvs.openbsd.org: src
Date: Tue, 27 Jul 2010 10:59:04 -0600 (MDT)

CVSROOT:	/cvs
Module name:	src
Changes by:	stsp@cvs.openbsd.org	2010/07/27 10:59:04

Modified files:
	distrib/special/libstubs: Makefile 
	lib/libc       : Makefile.inc 
	lib/libc/citrus: citrus_ctype.h citrus_ctype_local.h 
	lib/libc/locale: Makefile.inc runetable.c setrunelocale.c 
	share/locale/ctype: Makefile 
Added files:
	distrib/special/libstubs: mbrtowc_sb.c 
	lib/libc/citrus: Makefile.inc citrus_ctype.c citrus_none.c 
	                 citrus_none.h citrus_utf8.c citrus_utf8.h 
	lib/libc/locale: btowc.c mblen.c mbrlen.c mbstowcs.c mbtowc.c 
	                 multibyte.h multibyte_citrus.c wcscoll.c 
	                 wcstombs.c wcsxfrm.c wctob.c wctomb.c 
Removed files:
	lib/libc/locale: mbrtowc_sb.c multibyte_sb.c 

Log message:
Replace the single-byte placeholders for the multi-byte/wide-character
conversion interfaces of libc (mbrtowc(3) and friends) with new
implementations that internally call an API based on NetBSD's citrus.
This allows us to support locales with multi-byte character encodings.

Provide two implementations of the citrus-based API: one based on the old
single-byte placeholders for use with our existing single-byte character
locales (C, ISO8859-*, KOI8, CP1251, etc.), and one that provides support
for UTF-8 encoded characters (code based on FreeBSD's implementation).

Install the en_US.UTF-8 ctype locale support file, and allow the UTF-8
ctype locale to be enabled via setlocale(3) (export LC_CTYPE='en_US.UTF-8').

A lot of programs, especially from ports, will now start using UTF-8 if the
UTF-8 locale is enabled. Use at your own risk, and please report any breakage.
Note that ncurses-based programs cannot display UTF-8 right now, this is being
worked on.

To prevent install media growth, add vfprintf(3) and mbrtowc(3) to libstubs.
The mbrtowc stub was copied unchanged from its old single-byte placeholder.
vfprintf.c doesn't need to be copied, just put in .PATH (hint by fgsch@).

Testing by myself, naddy, sthen, nicm, espie, armani, Dmitrij D. Czarkoff.

ok matthieu espie millert sthen nicm deraadt

Then, Christian Weisgerber (naddy@) also enabled multibyte support in shells/bash:

From: Christian Weisgerber 
To: ports-changes@cvs.openbsd.org
Subject: CVS: cvs.openbsd.org: ports
Date: Wed, 28 Jul 2010 14:25:11 -0600 (MDT)

CVSROOT:	/cvs
Module name:	ports
Changes by:	naddy@cvs.openbsd.org	2010/07/28 14:25:11

Modified files:
	shells/bash    : Makefile 

Log message:
Enable multibyte support.  Makes regression tests happier.

People might want to read the mbrtowc(3) and vfprintf(3) man pages. As always, users are invited to test, and to report any bug. Stefan also provided undeadly with some notes on the testing that is required for this particular change:

My commit only provided foundations for UTF-8 support.
It makes a lot of things work, but there are still many pieces in the
system which need to be tweaked in order to make proper use of the UTF-8
support in libc.

It is unlikely that much of the higher-layer stuff will be enabled for 4.8.
We're already at ABI lock. But shipping 4.8 with the fundamentals built-in
means that the fundamentals can easily be tested by a lot of people to spot
fallout. It makes more sense to deal with the higher layers during the 4.9
cycle, because it is a lot of work.

Obviously, to use UTF-8 a terminal capable of displaying UTF-8 is needed.
Right now, and maybe forever, the only option is an X11 terminal emulator
like xterm(1), or the various Gnome/KDE/XFCE terminal emulators.

A suitable font that contains a lot of Unicode characters is also required.
The default 'fixed' font of xterm(1) includes quite a lot of characters.
I'm personally using a DejaVu font instead, which is included in xenocara.

In ~/.Xdefaults:
  XTerm*Font: -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1

Here is a screenshot of xterm(1) using this font to display Markus Kuhn's
UTF-8 demo test file.
The screenshot also shows that tmux(1) is ready for UTF-8.

It would be very hard to make the text console support UTF-8 display.
That is certainly out-of-scope for me. I won't touch that.
It's definitely not a good idea to run the entire system with
LC_CTYPE=en_US.UTF-8 in the environment.
It's a bad idea to set LC_CTYPE=en_US.UTF-8 in ~/.profile, because
it could cause gibberish being displayed on the text console.

I run my entire X session with LC_CTYPE=en_US.UTF-8, like this:

$ cat ~/.xsession
env LC_CTYPE="en_US.UTF-8" /usr/local/bin/startxfce4

I'd recommend doing it this way when helping with testing.
The most important thing to look out for is stuff that used to work
with single-byte character sets like ISO8859-1 but does not work with UTF-8.

In environments with high stability requirements, the UTF-8 locale
should not be used at all. 

The UTF-8 locale can also be used with specific applications only,
by starting the applications from uxterm(1) and using a non-UTF8 locale
for the rest of the xsession.

Hint: With mutt from ports, the -slang flavour is required for UTF-8
to work because of the current limitations in ncurses.

Also note that the only UTF-8 locale we currently have is en_US.UTF-8.
There is no de_DE.UTF-8, fr_FR.UTF-8, etc. This might be inconvenient
for people relying on localisation of program messages.
So, this is work in progress, with the very first step being completed. Marc Espie (espie@) tell us more about what still needs to be done:
Independently of other libraries, there are also lots of wide-char functions
AND locale stuff which we don't yet have, such as wprintf or strcoll support.

Until we have these, a lot of software will simply not pick up any locale
support during the configure steps.

cursesw is the most visible "next step", but by no means is it the only
one...

(Comments are closed)


Comments
  1. By Alexander (sasha) sashabsd@bsd.rurltd.ru on

    This is expected and necessary step in the development OpenBSD. Thanks Stefan Sperling and other developers!

Credits

Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]