Home:Professional:Windows Console Character Encoding

Output of characters to the Windows console beyond Code Page 437 turned out to be challenging. I created a conceptual model for the treatment of character sets and encoding in the Windows console to help developing applications that require extended character set support. I don't claim that this is how Windows actually works; however, the model to the best of my knowledge accurately predicts behavior (at least on Windows XP).

The Problem

The problem, in short, is the difficulty in predicting the effect of statements such as
char string[] = "\xc3\xab";
printf("%s\n", string);
in a Windows console application. The intent here, presumably, is to interpret the two-byte sequence $C3 $AB as a UTF-8 encoding of the Unicode codepoint $00EB (“ë”; LATIN SMALL LETTER E WITH DIAERESIS). However, the output typically will be something resembling “├½”.

Internet sources offered a number of answers which give part of the solution, but don't work consistently and don't interact predictably. The following is the result of my own attempt to rationalize and bring some clarification to this problem.

Conclusion

To cut to the chase, my recommendation for working with extended character sets:

Single-byte or Multibyte Output

Do not use setlocale(LC_CTYPE, ...); in this case chcp is also unnecessary. Use strings encoded in any appropriate code page, and call SetConsoleOutputCP(). Code page 65001 may be used for strings encoded as UTF-8. Make sure the console font supports the characters you're trying to display, and is not a raster font.

Wide Character Output

Set the console output mode to UTF-8 with _setmode(_fileno(stdout), _O_U16TEXT) and use wide character functions for output.

The Model

This is for those who are interested in the details. The diagrams below will be used to exemplify the discussion, by showing bytes ‘progress through the model’ and their ultimate interpretation as characters. Model state is shown in red boxes; byte/codepoint/character data are in blue rounded rectangles. Each diagram shows the progression of two distinct example sequences (EB above the lines, and C3 AB below the lines). The above-the-lines example was chosen because it is an ‘extended’ character representing a different character in many character sets, is a valid Unicode code point but invalid UTF-8. The below-the-lines example is the byte sequence that happens to be the UTF-8 representation of that same Unicode code point.

The model consists of the following components:

With these three pieces of information, it's possible to nearly perfectly predict the characters that will appear on the console.

Locale Character Set

The locale character set is changed with the Standard C setlocale() function with LC_CTYPE (or LC_ALL) and a locale specifier such as “.1252”. In the diagrams, the character set implied by the locale is shown in the red box on the left. By default, a program has ‘C’ locale, which implies no (a lack of) interpretation as code points in a character set. This is different from saying that ‘C’ locale is associated with some ‘neutral’ character set such as ASCII or even ISO-8859-1, as will hopefully become clear below.

The locale effectively associates characters and strings with a Windows code page in the context of C standard library functions. Aside from the <stdio> family of functions this should also affect <ctype> functions such as isalnum(), but I didn't verify that. Note that it is not possible to set a UTF locale by specifying “.65001” or “.65002”.

Console Code Page

The console is independently associated with a code page specified by the chcp DOS command. Note that this is a state of the console that extends beyond the lifetime of the process that sets it. In the diagrams, the console code page is indicated by the ‘chcp’ boxes. If necessary, it can be indirectly set programmatically with system("chcp ...").

The effect of the console code page is to cause mapping, or conversion, of bytes as they are output to the console to maintain their character interpretation as closely as possible.

‘C’ locale

In the ‘C’ locale, bytes have no character interpretation and thus are unaffected by the console code page. It is not correct to say that ‘C’ locale implies ASCII, because even though many code points (e.g., the 8-bit upper half) have no interpretation in ASCII, they are still processed through. If it had been the case that ‘C’ locale had some or other associated character set (rather than having none), this would have been detected by observing a translation that depended on the console code page; however, as the diagram shows, the displayed characters are independent of the console code page in ‘C’ locale.

“.1252” locale

Looking at the “.1252” locale, if the console code page is also 1252 then again, an identity mapping is applied.

When the console code page is different from the locale character set, in general bytes will need to be mapped: take the example of console code page 437. Looking at the above-the-lines example, in code page 1252 $EB is interpreted as Unicode $00EB (“ë”), which is represented as $89 in code page 437. Below the lines, $C3 and $AB are interpreted as $00C3 (“Ô; LATIN CAPITAL LETTER A WITH TILDE) and $00AB (“«”; LEFT-POINTING DOUBLE ANGLE QUOTATION MARK). Code page 437 has no representation for the former, so that character is mapped to $41 (“A”) instead; the latter character is represented as $AE.

Code page 65001 (UTF-8) is interesting, because translation may cause a multibyte conversion. Code page 1252 $EB is Unicode code point $00EB, which is represented as $C3 $AB in UTF-8; Similarly, the two-byte sequence in the below-the-lines example is converted to a four-byte UTF-8 sequence.

“.1251” locale

The treatment of this locale (Cyrillic) is not fundamentally different, but it illustrates what happens when the console code page cannot represent characters in the locale character set.

For example, in code page 1251 $EB represents Unicode code point $043B (“л”; CYRILLIC SMALL LETTER EL), which does not exist in code pages 437 or 1252. In these cases, this character is simply mapped to a question mark $3F. Similarly, $C3 has no available representation in those code pages and is mapped to a question mark. However, $AB is Unicode $00AB (“«”; LEFT-POINTING DOUBLE ANGLE QUOTATION MARK) which is also $AB in code page 1252 and $AE in code page 437.

For the UTF-8 code page 65001, the byte sequences are mapped into their respective UTF-8 representations; for example, $EB becomes the two-byte sequence $D0 $BB.

Console Output Code Page

This code page governs the character set which the console interprets the mapped bytes to represent code points of, and ultimately the glyphs that are displayed. This is set programatically by a call to SetConsoleOutputCP() and is also a state of the console that persists after the process setting it has terminated. Why it makes sense to display code points in a different character set than the one they were just transformed into is beyond me. In general, therefore, I can't think of any case where the console output code page should not be the same as the console code page. The console output code page can only be set when the console font is not a raster font, however; in a raster font console the output code page is always implicitly 437, no matter what.

Note that the conceivably most useful setting of 65001 for UTF-8 appears to be buggy. If a locale is set and the UTF-8 console code page is specified, to the best of my ability to detect this the proper UTF-8 byte sequences are generated internally. However, on display, the rest of the line is clipped after one Unicode character is shown. I tried hard to come up with an explanation for this that did not amount to assuming it must be a bug, but could not.

printf and wprintf

I only tested narrow-character strings, but they appear to work almost equally well with the narrow- and wide-character formatted output functions:
char string[] = "\xc3\xab";

printf("$c3 $ab prints as: \"%s\"\n", string);
wprintf(L"$c3 $ab prints as: \"%hs\"\n", string);
Note the nonstandard format specification for selecting narrow-character strings within a wide-character format; this is, however properly documented. Stream orientation (as set by fwide() is documented as unsupported and appears to have no effect).

Posted on 2013/07/22


All pages under this domain © Copyright 1999-2013 by: Ben Hekster