Home:Professional:Windows Console Applications and Character Encoding
Results
This describes my observations and the best of my understanding of what is happening
in regard to text processing in Windows console applications.
This is based on a much earlier (but incomplete) analysis I did previously
under Windows XP; this analysis relates to the current version of Windows 11.
This work was done with the help of a utility program
(the Encoding Explorer)
that let me quickly try different combinations of modes and environmental settings
and observe the results.
The discussion looks only at output from a native application to the console
and intentionally ignores the parallel discussion about input. There are a couple
of reasons why I chose to ignore the input side (not least avoiding the doubling of work),
but focusing on one side of the equation at least greatly simplifies the discussion
and makes it less confusing. I'm operating with the assumption that any observations
and statements that can be made about the output side of I/O should apply (mutatis mutandis)
to input as well.
Sample Data
I used the following input data for my investigations. The wide character code points
were specifically chosen so that their interpretation agrees with the corresponding narrow
character interpretation under Code Page 437.
ordinal
| display
| narrow
| wide
| Unicode name
|
0
| A
| 41
| 0041
| LATIN CAPITAL LETTER A
|
1
| ╬
| CE
| 256C
| BOX DRAWINGS DOUBLE VERTICAL AND HORIZONTAL
|
2
| ú
| A3
| 00FA
| LATIN SMALL LETTER U WITH ACUTE
|
3
| δ
| EB
| 03B4
| GREEK SMALL LETTER DELTA
|
4
| î
| 8C
| 00EE
| LATIN SMALL LETTER I WITH CIRCUMFLEX
|
5
|
| 0A
| 000A
| LINE FEED (LF)
|
Note that these characters are intentionally not uniformly representable in other
character sets, such as ASCII, ISO-8859-1 or -15, or Code Page 1252.
Methods
These represent the different APIs and variations that can be used in a native
console application to output text:
- Windows API
- ‘POSIX-style’
- Standard C I/O streams
- unformatted
- narrow-character formatted
- wide-character formatted
- Standard C++ I/O streams
- unformatted
- narrow-character formatted
- wide-character formatted
I further distinguish between
- terminal output (observed in the console)
- file-redirected output (console command redirecting output to a file)
- file output (through an explicitly opened file)
Windows API
There is no fundamental difference between WriteConsoleA() and WriteFile();
it is specifically not the case, say, that WriteConsole() does any kind of special
text-mode processing such as CR/LF conversion.
The differences are mostly utilitarian in that the Console API as a whole is specifically
aware of the fact that the output device is a console and provides corresponding control
over it (such as color settings);
and that there are the two WriteConsoleA() and WriteConsoleW() versions which may
possibly be of use when writing code that needs to be able to deal with both these worlds.
WriteConsole() will not work if standard output has been redirected to a file,
in which case the handle returned by GetStdHandle() is not a console but a file handle.
WriteConsoleA() and WriteFile() appear to both produce byte-for-byte what was
presented on input; and the console itself displays these as characters as per the Console Output Code Page.
WriteConsoleW() assumes as usual that the wide-character input is UTF-16 that is
interpreted by the console as UTF-8 regardless of the Console Output Code Page setting.
‘POSIX style’
In this case we use _open() and _write().
The different modes are selected by the oflag argument to
_open()
or the flag argument to
_setmode()
for an already-open file.
These functions, while they are provided by Microsoft in the C runtime library, are not actually part
of Standard C but derive originally from Unix/POSIX.
- binary mode (_O_BINARY): input is passed as-is
- text mode (_O_TEXT): CR/LF translation is applied
- wide-character text mode (_O_WTEXT) UTF-16; CR/LF translation and BOM
- Unicode mode (_O_U8TEXT): UTF-16 converted to UTF-8; CR/LF translation and BOM (see below)
- wide-character Unicode mode (_O_U16TEXT) UTF-16; CR/LF translation and BOM
Specifying nonsensical Unicode + binary mode (as _O_BINARY | _O_U8TEXT) does not trigger
an error and has the same effect as specifying just binary mode.
Writing directly to a file a BOM is generated for wide, unicode, and wide unicode modes
(but obviously not for text or binary). When redirecting to a file, Command Prompt
also generates a BOM for those three modes but PowerShell does not.
NOTE The Windows header file fcntl.h claims that _O_WTEXT should produce a BOM
and _O_U8TEXT and _O_U16TEXT should not; but I found all three those modes to
behave identically in that regard). In fact, overall I could not find any difference whatsoever
in behavior between _O_WTEXT and _O_U16TEXT.
Standard C
Standard C I/O is done with FILE stream handles represented by stdout or
created by fopen(). C file streams assume a 'narrow' or 'wide' orientation
determined implicitly, by whether fprintf() ir fwprintf() is first called
on them; or theoretically, by calling fwide() explicitly. Note though that
fwide() is documented as being unimplemented. fwrite() is used for unformatted output.
The different modes are selected by the mode argument to
[fopen()](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen)
and work exactly the same as the corresponding POSIX modes.
- binary: "wb"
- text: "w"
- wide text: "w,ccs=unicode"
- unicode: w,ccs=utf-8
- wide unicode: w,ccs=utf-16le
Again, specifying nonsensical Unicode + binary mode (as "wb,ccs=utf-8") does not trigger
an error and has the same effect as specifying just binary mode. All four text modes perform
CR/LF conversion, and all three wide-character modes insert a prefix BOM.
As documented, it's possible to change the mode of an already-open stream by extracting
the POSIX file descriptor and applying _setmode():
_setmode(fileno(stdout), _O_U8TEXT)
All this clearly suggests that the Standard C APIs are implemented on top of the
POSIX style functions.
NOTE The BOM that is normally generated when writing to a file in one of the
wide-character modes is not generated when the standard output streams are set to one
of those modes with _setmode().
NOTE Unicode mode is not supported through narrow-character fprintf even with
the "%S" wide-character
[format specifier](https://learn.microsoft.com/en-us/cpp/c-runtime-library/format-specification-syntax-printf-and-wprintf-functions?view=msvc-170#type-field-characters)
(fwprintf has to be used).
Standard C++
Standard C++ uses I/O Streams based on std::basic_ostream.
Formatted I/O is done through the usual << operators;
unformatted I/O is done through .write().
An important point to make about C++ file stream I/O is that even though a wide-character
file stream does exist (i.e. std::wfstream) that does accept wide-character text,
the underlying C FILE of a C++ file stream (at least in Windows) is always narrow;
and there is no standard or specific non-standard API to change this.
This implies that and data sent through a wide-character C++ file
stream is converted from the wide character set (always UTF-16)
to some narrow character set.
This is counter to the behavior of the C wide-character streams which dynamically
assume either a narrow or wide orientation (as described above).
The narrow character set that is converted to is specified by means of a C++ locale;
this can be set either globally through std::global::locale() or on an individual
stream through .imbue(). Setting the global C++ locale sets the C locale as well,
but the converse is not generally true: i.e., setlocale() does not affect
the C++ locale except for standard output and presumably standard error and input as well.
This might be due to the C++ and C streams having been synchronized through something
like std::ios_base::sync_with_stdio(), but I haven't verified this.
It is important to understand that this is an actual conversion, not just a reinterpretation
of the byte stream; it happens whether the output is sent to the console or to a file.
If a character cannot be represented in the given narrow character set, it is generally approximated.
For example, our wide character sample string is converted as follows with the given locales:
locale
| displayed
| converted bytes
|
C (default)
| A
| 41
|
.437/.OCP
| A╬úδî
| 41 ce a3 eb 8c 0d 0a
|
.1252/.ACP
| A+údî
| 41 2b fa 64 ee 0d 0a
|
.65001/.utf8
| A╬úδî
| 41 e2 95 ac c3 ba ce b4 c3 ae 0d 0a
|
Only the Code Page 437 and UTF-8 locales are thus able to represent this UTF-16 sample string
correctly; the default C locale actually completely gives up after the first
nonrepresentable character.
A std::locale object can be constructed with a std::codecvt facet,
giving more precise and explicit control over the conversion. In particular,
using a ‘non-converting’ character set conversion that converts UTF-16 into itself
(see codecvt_utf16).
In theory seems to allow the construction of a truly wide-character stream which
maintains UTF-16; however, internally the stream is still narrow.
This can be seen in any of the text modes by observing the effect of CR/LF translation
which interprets the supposed UTF-16 code point 000A as narrow-character 0A 00
and converts it to 0D 0A 00. This obviously messes up the attempted output of UTF-16;
which is why the codecvt_utf16 approach can only be made to work in binary mode.
This in turn requires the programmer to perform the BOM marking and CR/LF translation,
at which point one starts to question the purpose of using wfstream at all.
Note also that the codecvt classes and the entire <codecvt> header
are deprecated as of C++17.
A different ‘back door’ to avoid this and obtain an almost true wide-character C++ file stream
is to open the C FILE with the ccs=utf16-le mode and use the Windows STL
extension that allows construction the std::wfstream from it. So while the C++
stream itself has a narrow-character 'associated character sequence', the underlying C
and POSIX streams are wide. Therefore, BOM insertion and CR/LF translation happens correctly
and as expected, and also no locale is needed.
See the Locales section below for a reference on valid locale strings.
Modes
I've found that there are broadly four (nominally five) fundamental modes
in which character data can be interpreted by the various API methods.
Some of these modes have direct representations in the corresponding API methods;
others need explicit work by the programmer to achieve.
Binary Mode
This is the simplest mode in which data is not interpreted as text but as binary data
(either 'narrow' bytes or 'wide' 16-bit words) and has no character set interpretation.
In this mode we expect no conversions to be applied to the data. All the other modes
cause data to be interpreted as characters (i.e., text).
Text Mode (Narrow-Character)
The canonical text mode sees data as (narrow) byte-size characters.
In this, as in all the text modes, LF is translated to CR LF. The Standard C
locale determines the character set interpretation of the narrow characters.
Wide-Character Text Mode
In this case the input is interpreted as wide-character UTF-16 and presented on output
as UTF-16 as well. The locale and Console Output Code Page are ignored.
‘Unicode’ Mode
Several output methods in Windows support something that is vaguely and confusingly
referred to as ‘Unicode mode’.
This is like wide-chracter text mode in that ths input is interpreted as UTF-16;
internally however it is converted to UTF-8.
Again, the locale and Console Output Code Page are ignored.
‘Wide Unicode’ Mode
Some of the Windows output methods distinguish between narrow and wide-character
Unicode modes; in practice, I haven't found any difference between regular and
‘Unicode’ wide-character modes. In the C++ methods, regular wide-character
mode refers to the standard wide-character streams, and ‘Unicode wide-character mode’
to a nonstandard method of achieving UTF-16 output.
This works with Standard C unformatted and POSIX-style output; but against
formatted Standard C I/O only using fwprintf() or a run-time assertion
is triggered. This makes some sense given that the input is interpreted as wide characters
(though it doesn't explain why fprintf("%S") still doesn't work), and is exactly
specified
in the Microsoft documentation for _setmode(), further confirming the equivalence between
Standard C and POSIX-style ‘Unicode mode’. Finally, applying _setmode() to
put standard output in Unicode mode seems to show the same behavior and displays "칁ઌ",
which visually corresponds with
- FEFF (ZERO WIDTH NO-BREAK SPACE; BYTE ORDER MARK)
- CE41 (specific Hangul syllable)
- EBA3 (Private Use Area)
- 0A8C (GUJARATI LETTER VOCALIC L)
Now convinced that _setmode() allows us to perform Unicode mode I/O to the console,
and given the observation that the Console Output Code Page appears to be overridden
by Unicode mode, I was wondering whether it was the presence of the Unicode BOM itself
that is recognized by the console: but that is not the case. Just outputting the same
UTF-8 string of bytes in binary mode does not trigger the console to interpret
UTF-8; and it also does not trigger the effect of ignoring the Output Code Page.
NOTE I actually don't understand why the BOM occurs in Unicode Mode when writing
directly to a file, but not when redirected to a file in the console. The former seems
to suggest the BOM is added in the POSIX layer and removed by the console; but I'm
not able to confirm this removal by explicitly writing a UTF-8 BOM to the console.
‘Wide-Character Unicode Mode’
The output method APIs do nominally support something that might be called
'wide-character Unicode mode' (corresponding to _O_U16TEXT in the POSIX API);
however, in all of my testing I could not find any difference between it and
'regular' wide-character text mode.
Summary
This table summarizes the available methods and modes.
Method
| Binary
| Text
| Wide
| Unicode
| Wide Unicode
|
Windows API
| WriteFile()
| WriteConsoleA()
| WriteConsoleW()
| n/a
| n/a
|
‘POSIX’ style
| _open(_O_BINARY)
| _open(_O_TEXT)
| _open(_O_WTEXT)
| _open(_O_U8TEXT)
| _open(_O_U16TEXT)
|
C unformatted
| fopen("wb") fwrite()
| fopen("w") fwrite()
| fopen("w,ccs=unicode") fwrite()
| fopen("w,ccs=utf-8") fwrite()
| fopen("w,ccs=utf-16le") fwrite()
|
C formatted
| fopen("wb") fprintf()
| fopen("w") fprintf()
| fopen("w,ccs=unicode") fwprintf()
| fopen("w,ccs=utf-8") fwprintf()
| fopen("w,ccs=utf-16le") fwprintf()
|
C++ unformatted
| ostream(ios::binary) .write()
| ostream() .write()
| wostream() .write()
| fopen("w,ccs=utf-8") wostream(FILE) .write()
| fopen("w,ccs=utf-16le") wostream(FILE) codecvt_utf16 .write()
|
C++ formatted
| ostream(ios::binary) <<
| ostream() <<
| wostream() <<
| fopen("w,ccs=utf-8") wostream(FILE) <<
| fopen("w,ccs=utf-16le") wostream(FILE) codecvt_utf16 <<
|
Console
How the console displays characters is influenced by three properties.
Console Font
This used to be an issue: without a ‘good’ font being configured in the Command Prompt,
you would not get the correct characters displayed, no matter what. In Windows 11
(at least) this does not appear to be an issue anymore, as the default font
seems to have adequate character set coverage. On systems before Windows 11, check
to make sure something like Lucida Console is configured.
Console Code Page
There are two different code pages that appear like they should be associated with
console output:
-
The console code page (set through SetConsoleCP()) doesn't
appear to have any effect on output; in fact I can't find any effect it has
at all
-
The console output code page (set through SetConsoleOutputCP())
determines the character repertoire that is available for display by the console;
it doesn't affect how character data is stored (say, in case output is redirected
to a file).
There is a difference in how these two are handled by Command Prompt and PowerShell
and chcp, explained here and summarized in the table:
- Command Prompt:
chcp affects both CCP and COCP; both CCP and COCP are persistent across
invocations of the program within the console session
- PowerShell:
chcp sets only the CCP; COCP appears to be a property of the
calling process and changes made by the process do not persist across invocations
but are always initially set to 437.
| Console Code Page
| Console Output Code Page
|
Command Prompt
| chcp/console
| chcp/console
|
PowerShell
| chcp/console
| 437/process
|
Here are some specific Code Pages I looked at; see
Code Page Identifiers
for a complete list:
- 437 (default):
The ‘OEM character set’ used in DOS pre-Windows.
Note that this both includes characters not in ASCII, as well as
does not include all characters from ASCII
- 1252
Defined by the original Windows;
roughly a superset of ISO-8859-1 (and therefore ASCII)
- 28591: ISO-8859-1
- 65001: UTF-8 encoding of Unicode
Locale
The Standard C locale setting (through setlocale()) interplays with
the Console Output Code Page in that it defines the character set of the output that is
set to the console. Setting the Standard C locale in general has no effect with on direct
or console-redirected file output (although see above for C++ I/O) but affects the presentation
within the console itself with the POSIX-style method in ‘text’ mode. This affects the
POSIX-style, C (formatted or unformatted), C++ (formatted or unformatted) methods
but not the Windows API.
It has no effect in any other mode; presumably because input
in those modes either already has a well-defined character set interpretation
(wide, unicode, and wide unicode) or has no character interpretation at all (binary).
Valid locale names
even include the ability to specify code pages explicitly, so this all appears
very similar to the effect of the Console Output Code Page;
however, it is a different mechanism as can be seen by the fact that they interact.
Note that on my system, .OCP (OEM Code Page) invokes Code Page 437 and
.ACP (ANSI Code Page) invokes Code Page 1252.
Examples
These show the display of the sample bytes under different combinations of locale
and Console Output Code Page:
Locale
| Console Output Code Page
| Output
|
.437/.OCP/not set
| 437/not set
| A╬úδî
|
1252
| A+údî
|
.1252/.ACP
| 1252
| AΣëŒ
|
437/not set
| AI£ëO
|
When the locale and Console Output Code Page are aligned, the displayed output is
as expected and correct for that given Code Page.
Otherwise, the characters are interpreted as existing in the locale character set
but approximated with characters from the Console Output Page.
For example:
-
the next-to-last character with code point EB corresponds to GREEK SMALL LETTER DELTA
in Code Page 437. When the locale is set to (.437) and the Console Output Code Page
is also set to 437, the character is correctly dislayed as "δ"
When the Console Output Code Page is set to 1252, which doesn't have that character
avialable, it is instead approximated by and displayed as "d".
-
the final character with code point 8C corresponds to LATIN CAPITAL LIGATURE OE
in Code Page 1252. When the locale is set to (.1252) and the Console Output Code Page
is also set to 1252, the character is correctly displayed as "Œ".
When the Console Output Code Page is set to 437, which doesn't have that character
available, it is instead approximated by and displayed as "O".
Note that setting the Console Output Code Page to 65001 (for UTF-8) allows
the characters to be correctly displayed according to the specified locale
in every case. This again serves to reinforce that the COCP does not dictate
the character set encoding of the data, but the available character repertoire.
Console File Redirection
When the output of a command is redirected in the console, note first of all that the
console APIs such as WriteConsole() no longer work; WriteFile() must be used. However,
the character set encoding of the file depends on a number of factors.
Command Prompt
Supposedly depends on whether it was invoked as CMD /U or CMD /A,
but I haven't been able to find any difference:
the input is passed through to the output without any character set interpretation or any CR/LF translation.
PowerShell
The ‘redirection operator’ > is a shortcut for
| OutFile.
The documentation claims that its default encoding is UTF-8-no-BOM, but my observation is
that UTF-16-with-BOM is the default.
Either way, supposedly this can be overridden with the
-Encoding
argument.
Also, CR/LF translation is applied and CR/LF is appended after the end of the last line if none is present.
It appears that the input byte stream is always interpreted as Code Page 437;
neither the Console Code Page nor the Console Output Code Page settings have any effect on this.
To illustrate the effect of PowerShell file redirection: our sample byte stream becomes
FF FE 41 00 6c 25 FA 00 B4 03 EE 00 0D 00 0A 00 0D 00 0A 00 00 00 0D 00 0A 00
corresponding to the UTF-16 code points
FEFF 0041 256C 00FA 03B4 00EE 000D 000A 000D 000A 0000 000D 000A
which are exactly the interpretations under Code Page 437. For example, CE translates
to 256C (BOX DRAWINGS DOUBLE VERTICAL AND HORIZONTAL).
Unicode Mode and File Redirection
Because of the multiple places where character set translations are done, you can
construct some really ludicrous situations. For example, if you open a file in POSIX
Unicode Mode, as described above internally your input is intepreted as UTF-16
CE41 EBA3 0A8C and translated into UTF-8 ec b9 81 ee ae a3 e0 aa 8c. Redirecting this
in PowerShell then interprets the UTF-8 bytes as Code Page 437 and translates them
into UTF-16:
- EC → 221E
- B9 → 2563
- 81 → 00FC
- EE → 03B5
- AE → 00AB
- A3 → 00FA
- E0 → 03B1
- AA → 00AC
- 8C → 00EE
Adding a BOM and CR/LF results in the nonsensical monstrosity:
FF FE 1E 22 63 25 FC 00 B5 03 AB 00 FA 00 B1 03 AC 00 EE 00 0D 00 0A 00
which opens as "칁ઌ".
This is controlled by the PowerShell $OutputEncoding variable:
- [Console]::OutputEncoding = [Text.ASCIIEncoding]::ASCII (Code Page 20127; 7-bit ASCII)
- [Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8 (Code Page 65001; UTF-8)
- [Console]::OutputEncoding = [Text.UnicodeEncoding]::Unicode (Code Page 65001; UTF-8)
It seems they affect the Console Output Code Page that is seen by the application;
however, setting it through SetConsoleOutputCP() has no effect.
So at a minimum, this could be used to check what encoding PowerShell is expecting.
Resources
All pages under this domain © Copyright 1999-2023 by:
Ben Hekster