Home:Professional:Windows Console Character Encoding
Output of characters to the Windows console beyond Code Page 437 turned out to be
challenging. I created a conceptual model for the treatment of character sets and
encoding in the Windows console to help developing applications that require extended
character set support. I don't claim that this is how Windows actually works; however,
the model to the best of my knowledge accurately predicts behavior (at least on
The problem, in short, is the difficulty in predicting the effect of statements
char string = "\xc3\xab";
in a Windows console application. The intent here, presumably, is to interpret the
two-byte sequence $C3 $AB as a UTF-8 encoding of the Unicode codepoint
$00EB (“ë”; LATIN SMALL LETTER E WITH DIAERESIS). However,
the output typically will be something resembling “├½”.
Internet sources offered a number of answers which give part of the solution, but
don't work consistently and don't interact predictably. The following is the result
of my own attempt to rationalize and bring some clarification to this problem.
To cut to the chase, my recommendation for working with extended character sets:
Single-byte or Multibyte Output
Do not use setlocale(LC_CTYPE, ...); in this case chcp is
also unnecessary. Use strings encoded in any appropriate code page, and call
SetConsoleOutputCP(). Code page 65001 may be used for strings
encoded as UTF-8. Make sure the console font supports the characters you're
trying to display, and is not a raster font.
Wide Character Output
Set the console output mode to UTF-8 with _setmode(_fileno(stdout), _O_U16TEXT)
and use wide character functions for output.
This is for those who are interested in the details. The diagrams below
will be used to exemplify the discussion, by showing bytes ‘progress
through the model’ and their ultimate interpretation as characters.
Model state is shown in red boxes; byte/codepoint/character data are in blue
rounded rectangles. Each diagram shows the progression of two distinct example
sequences (EB above the lines, and C3 AB below the lines).
The above-the-lines example was chosen because it is an ‘extended’
character representing a different character in many character sets, is a
valid Unicode code point but invalid UTF-8. The below-the-lines example is
the byte sequence that happens to be the UTF-8 representation of that same
Unicode code point.
The model consists of the following components:
With these three pieces of information, it's possible to nearly perfectly
predict the characters that will appear on the console.
- The locale character set;
- The console code page;
- The console output code page.
Locale Character Set
The locale character set is changed with the Standard C setlocale()
function with LC_CTYPE (or LC_ALL) and a
such as “.1252”.
In the diagrams, the character set implied by the locale is shown in the
red box on the left. By default, a program has ‘C’ locale,
which implies no (a lack of) interpretation as code points in a character set.
This is different from saying that ‘C’ locale is associated with
some ‘neutral’ character set such as ASCII or even ISO-8859-1,
as will hopefully become clear below.
The locale effectively associates characters and strings with a Windows code page
in the context of C standard library functions. Aside from the <stdio>
family of functions this should also affect <ctype> functions
such as isalnum(), but I didn't verify that. Note that it is not
possible to set a UTF locale by specifying “.65001” or
Console Code Page
The console is independently associated with a code page specified by the
chcp DOS command. Note that this is a state of the console
that extends beyond the lifetime of the process that sets it. In the diagrams,
the console code page is indicated by the ‘chcp’ boxes. If necessary,
it can be indirectly set programmatically with system("chcp ...").
The effect of the console code page is to cause mapping, or conversion, of bytes as
they are output to the console to maintain their character interpretation as closely
In the ‘C’ locale, bytes have no character interpretation
and thus are unaffected by the console code page.
It is not correct to say that ‘C’ locale implies ASCII, because
even though many code points (e.g., the 8-bit upper half) have no interpretation
in ASCII, they are still processed through. If it had been the case that
‘C’ locale had some or other associated character set
(rather than having none), this would have been detected by observing a translation
that depended on the console code page; however, as the diagram shows, the displayed
characters are independent of the console code page in ‘C’ locale.
Looking at the “.1252” locale, if the console code page is also 1252
then again, an identity mapping is applied.
When the console code page is different from the locale character set, in general
bytes will need to be mapped: take the example of console code page 437. Looking at
the above-the-lines example, in code page 1252 $EB is interpreted as
Unicode $00EB (“ë”), which is represented as $89
in code page 437. Below the lines, $C3 and $AB are interpreted as
$00C3 (“Ã”; LATIN CAPITAL LETTER A WITH TILDE) and
$00AB (“«”; LEFT-POINTING DOUBLE ANGLE QUOTATION MARK).
Code page 437 has no representation for the former, so that character is mapped to
$41 (“A”) instead; the latter character is represented as
Code page 65001 (UTF-8) is interesting, because translation may cause a
multibyte conversion. Code page 1252 $EB is Unicode code point
$00EB, which is represented as $C3 $AB in UTF-8;
Similarly, the two-byte sequence in the below-the-lines example is converted
to a four-byte UTF-8 sequence.
The treatment of this locale (Cyrillic) is not fundamentally different, but it
illustrates what happens when the console code page cannot represent characters
in the locale character set.
For example, in code page 1251 $EB represents Unicode code point
$043B (“л”; CYRILLIC SMALL LETTER EL), which does
not exist in code pages 437 or 1252. In these cases, this character is simply
mapped to a question mark $3F. Similarly, $C3 has no
available representation in those code pages and is mapped to a question mark.
However, $AB is Unicode $00AB
(“«”; LEFT-POINTING DOUBLE ANGLE QUOTATION MARK)
which is also $AB in code page 1252 and $AE in code page 437.
For the UTF-8 code page 65001, the byte sequences are mapped into their respective
UTF-8 representations; for example, $EB becomes the two-byte sequence
Console Output Code Page
This code page governs the character set which the console interprets the
mapped bytes to represent code points of, and ultimately the glyphs that are
displayed. This is set programatically by a call to
SetConsoleOutputCP() and is also a state of the console
that persists after the process setting it has terminated.
Why it makes sense to display code points in a different character set
than the one they were just transformed into is beyond me. In general, therefore,
I can't think of any case where the console output code page should not be the
same as the console code page. The console output code page can only be set
when the console font is not a raster font, however; in a raster font console
the output code page is always implicitly 437, no matter what.
Note that the conceivably most useful setting of 65001 for UTF-8 appears to
be buggy. If a locale is set and the UTF-8 console code page is specified,
to the best of my ability to detect this the proper UTF-8 byte sequences are
generated internally. However, on display, the rest of the line is clipped
after one Unicode character is shown. I tried hard to come up with an explanation
for this that did not amount to assuming it must be a bug, but could not.
printf and wprintf
I only tested narrow-character strings, but they appear to work almost equally
well with the narrow- and wide-character formatted output functions:
char string = "\xc3\xab";
printf("$c3 $ab prints as: \"%s\"\n", string);
wprintf(L"$c3 $ab prints as: \"%hs\"\n", string);
Note the nonstandard format specification for selecting narrow-character
strings within a wide-character format; this is, however properly
Stream orientation (as set by fwide() is documented as unsupported
and appears to have no effect).
Posted on 2013/07/22
All pages under this domain © Copyright 1999-2013 by: