Home:Professional:Windows Console Applications and Character Encoding

Results

This describes my observations and the best of my understanding of what is happening in regard to text processing in Windows console applications. This is based on a much earlier (but incomplete) analysis I did previously under Windows XP; this analysis relates to the current version of Windows 11. This work was done with the help of a utility program (the Encoding Explorer) that let me quickly try different combinations of modes and environmental settings and observe the results.

The discussion looks only at output from a native application to the console and intentionally ignores the parallel discussion about input. There are a couple of reasons why I chose to ignore the input side (not least avoiding the doubling of work), but focusing on one side of the equation at least greatly simplifies the discussion and makes it less confusing. I'm operating with the assumption that any observations and statements that can be made about the output side of I/O should apply (mutatis mutandis) to input as well.

Sample Data

I used the following input data for my investigations. The wide character code points were specifically chosen so that their interpretation agrees with the corresponding narrow character interpretation under Code Page 437.

ordinal	display	narrow	wide	Unicode name
0	A	41	0041	LATIN CAPITAL LETTER A
1	╬	CE	256C	BOX DRAWINGS DOUBLE VERTICAL AND HORIZONTAL
2	ú	A3	00FA	LATIN SMALL LETTER U WITH ACUTE
3	δ	EB	03B4	GREEK SMALL LETTER DELTA
4	î	8C	00EE	LATIN SMALL LETTER I WITH CIRCUMFLEX
5		0A	000A	LINE FEED (LF)

Note that these characters are intentionally not uniformly representable in other character sets, such as ASCII, ISO-8859-1 or -15, or Code Page 1252.

Methods

These represent the different APIs and variations that can be used in a native console application to output text:

Windows API
‘POSIX-style’
Standard C I/O streams
- unformatted
- narrow-character formatted
- wide-character formatted
Standard C++ I/O streams
- unformatted
- narrow-character formatted
- wide-character formatted

I further distinguish between

terminal output (observed in the console)
file-redirected output (console command redirecting output to a file)
file output (through an explicitly opened file)

Windows API

There is no fundamental difference between WriteConsoleA() and WriteFile(); it is specifically not the case, say, that WriteConsole() does any kind of special text-mode processing such as CR/LF conversion. The differences are mostly utilitarian in that the Console API as a whole is specifically aware of the fact that the output device is a console and provides corresponding control over it (such as color settings); and that there are the two WriteConsoleA() and WriteConsoleW() versions which may possibly be of use when writing code that needs to be able to deal with both these worlds. WriteConsole() will not work if standard output has been redirected to a file, in which case the handle returned by GetStdHandle() is not a console but a file handle. WriteConsoleA() and WriteFile() appear to both produce byte-for-byte what was presented on input; and the console itself displays these as characters as per the Console Output Code Page.

WriteConsoleW() assumes as usual that the wide-character input is UTF-16 that is interpreted by the console as UTF-8 regardless of the Console Output Code Page setting.

‘POSIX style’

In this case we use _open() and _write(). The different modes are selected by the oflag argument to _open() or the flag argument to _setmode() for an already-open file. These functions, while they are provided by Microsoft in the C runtime library, are not actually part of Standard C but derive originally from Unix/POSIX.

binary mode (_O_BINARY): input is passed as-is
text mode (_O_TEXT): CR/LF translation is applied
wide-character text mode (_O_WTEXT) UTF-16; CR/LF translation and BOM
Unicode mode (_O_U8TEXT): UTF-16 converted to UTF-8; CR/LF translation and BOM (see below)
wide-character Unicode mode (_O_U16TEXT) UTF-16; CR/LF translation and BOM

Specifying nonsensical Unicode + binary mode (as _O_BINARY | _O_U8TEXT) does not trigger an error and has the same effect as specifying just binary mode.

Writing directly to a file a BOM is generated for wide, unicode, and wide unicode modes (but obviously not for text or binary). When redirecting to a file, Command Prompt also generates a BOM for those three modes but PowerShell does not.

NOTE The Windows header file fcntl.h claims that _O_WTEXT should produce a BOM and _O_U8TEXT and _O_U16TEXT should not; but I found all three those modes to behave identically in that regard). In fact, overall I could not find any difference whatsoever in behavior between _O_WTEXT and _O_U16TEXT.

Standard C

Standard C I/O is done with FILE stream handles represented by stdout or created by fopen(). C file streams assume a 'narrow' or 'wide' orientation determined implicitly, by whether fprintf() ir fwprintf() is first called on them; or theoretically, by calling fwide() explicitly. Note though that fwide() is documented as being unimplemented. fwrite() is used for unformatted output.

The different modes are selected by the mode argument to [fopen()](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen) and work exactly the same as the corresponding POSIX modes.

binary: "wb"
text: "w"
wide text: "w,ccs=unicode"
unicode: w,ccs=utf-8
wide unicode: w,ccs=utf-16le

Again, specifying nonsensical Unicode + binary mode (as "wb,ccs=utf-8") does not trigger an error and has the same effect as specifying just binary mode. All four text modes perform CR/LF conversion, and all three wide-character modes insert a prefix BOM.

As documented, it's possible to change the mode of an already-open stream by extracting the POSIX file descriptor and applying _setmode():

_setmode(fileno(stdout), _O_U8TEXT)

All this clearly suggests that the Standard C APIs are implemented on top of the POSIX style functions.

NOTE The BOM that is normally generated when writing to a file in one of the wide-character modes is not generated when the standard output streams are set to one of those modes with _setmode().

NOTE Unicode mode is not supported through narrow-character fprintf even with the "%S" wide-character [format specifier](https://learn.microsoft.com/en-us/cpp/c-runtime-library/format-specification-syntax-printf-and-wprintf-functions?view=msvc-170#type-field-characters) (fwprintf has to be used).

Standard C++

Standard C++ uses I/O Streams based on std::basic_ostream. Formatted I/O is done through the usual << operators; unformatted I/O is done through .write().

An important point to make about C++ file stream I/O is that even though a wide-character file stream does exist (i.e. std::wfstream) that does accept wide-character text, the underlying C FILE of a C++ file stream (at least in Windows) is always narrow; and there is no standard or specific non-standard API to change this. This implies that and data sent through a wide-character C++ file stream is converted from the wide character set (always UTF-16) to some narrow character set. This is counter to the behavior of the C wide-character streams which dynamically assume either a narrow or wide orientation (as described above).

The narrow character set that is converted to is specified by means of a C++ locale; this can be set either globally through std::global::locale() or on an individual stream through .imbue(). Setting the global C++ locale sets the C locale as well, but the converse is not generally true: i.e., setlocale() does not affect the C++ locale except for standard output and presumably standard error and input as well. This might be due to the C++ and C streams having been synchronized through something like std::ios_base::sync_with_stdio(), but I haven't verified this.

It is important to understand that this is an actual conversion, not just a reinterpretation of the byte stream; it happens whether the output is sent to the console or to a file. If a character cannot be represented in the given narrow character set, it is generally approximated. For example, our wide character sample string is converted as follows with the given locales:

locale	displayed	converted bytes
C (default)	A	41
.437/.OCP	A╬úδî	41 ce a3 eb 8c 0d 0a
.1252/.ACP	A+údî	41 2b fa 64 ee 0d 0a
.65001/.utf8	A╬úδî	41 e2 95 ac c3 ba ce b4 c3 ae 0d 0a

Only the Code Page 437 and UTF-8 locales are thus able to represent this UTF-16 sample string correctly; the default C locale actually completely gives up after the first nonrepresentable character.

A std::locale object can be constructed with a std::codecvt facet, giving more precise and explicit control over the conversion. In particular, using a ‘non-converting’ character set conversion that converts UTF-16 into itself (see codecvt_utf16). In theory seems to allow the construction of a truly wide-character stream which maintains UTF-16; however, internally the stream is still narrow. This can be seen in any of the text modes by observing the effect of CR/LF translation which interprets the supposed UTF-16 code point 000A as narrow-character 0A 00 and converts it to 0D 0A 00. This obviously messes up the attempted output of UTF-16; which is why the codecvt_utf16 approach can only be made to work in binary mode. This in turn requires the programmer to perform the BOM marking and CR/LF translation, at which point one starts to question the purpose of using wfstream at all. Note also that the codecvt classes and the entire <codecvt> header are deprecated as of C++17.

A different ‘back door’ to avoid this and obtain an almost true wide-character C++ file stream is to open the C FILE with the ccs=utf16-le mode and use the Windows STL extension that allows construction the std::wfstream from it. So while the C++ stream itself has a narrow-character 'associated character sequence', the underlying C and POSIX streams are wide. Therefore, BOM insertion and CR/LF translation happens correctly and as expected, and also no locale is needed.

See the Locales section below for a reference on valid locale strings.

Modes

I've found that there are broadly four (nominally five) fundamental modes in which character data can be interpreted by the various API methods. Some of these modes have direct representations in the corresponding API methods; others need explicit work by the programmer to achieve.

Binary Mode

This is the simplest mode in which data is not interpreted as text but as binary data (either 'narrow' bytes or 'wide' 16-bit words) and has no character set interpretation. In this mode we expect no conversions to be applied to the data. All the other modes cause data to be interpreted as characters (i.e., text).

Text Mode (Narrow-Character)

The canonical text mode sees data as (narrow) byte-size characters. In this, as in all the text modes, LF is translated to CR LF. The Standard C locale determines the character set interpretation of the narrow characters.

Wide-Character Text Mode

In this case the input is interpreted as wide-character UTF-16 and presented on output as UTF-16 as well. The locale and Console Output Code Page are ignored.

‘Unicode’ Mode

Several output methods in Windows support something that is vaguely and confusingly referred to as ‘Unicode mode’. This is like wide-chracter text mode in that ths input is interpreted as UTF-16; internally however it is converted to UTF-8. Again, the locale and Console Output Code Page are ignored.

‘Wide Unicode’ Mode

Some of the Windows output methods distinguish between narrow and wide-character Unicode modes; in practice, I haven't found any difference between regular and ‘Unicode’ wide-character modes. In the C++ methods, regular wide-character mode refers to the standard wide-character streams, and ‘Unicode wide-character mode’ to a nonstandard method of achieving UTF-16 output.

This works with Standard C unformatted and POSIX-style output; but against formatted Standard C I/O only using fwprintf() or a run-time assertion is triggered. This makes some sense given that the input is interpreted as wide characters (though it doesn't explain why fprintf("%S") still doesn't work), and is exactly specified in the Microsoft documentation for _setmode(), further confirming the equivalence between Standard C and POSIX-style ‘Unicode mode’. Finally, applying _setmode() to put standard output in Unicode mode seems to show the same behavior and displays "칁ઌ", which visually corresponds with

FEFF (ZERO WIDTH NO-BREAK SPACE; BYTE ORDER MARK)
CE41 (specific Hangul syllable)
EBA3 (Private Use Area)
0A8C (GUJARATI LETTER VOCALIC L)

Now convinced that _setmode() allows us to perform Unicode mode I/O to the console, and given the observation that the Console Output Code Page appears to be overridden by Unicode mode, I was wondering whether it was the presence of the Unicode BOM itself that is recognized by the console: but that is not the case. Just outputting the same UTF-8 string of bytes in binary mode does not trigger the console to interpret UTF-8; and it also does not trigger the effect of ignoring the Output Code Page.

NOTE I actually don't understand why the BOM occurs in Unicode Mode when writing directly to a file, but not when redirected to a file in the console. The former seems to suggest the BOM is added in the POSIX layer and removed by the console; but I'm not able to confirm this removal by explicitly writing a UTF-8 BOM to the console.

‘Wide-Character Unicode Mode’

The output method APIs do nominally support something that might be called 'wide-character Unicode mode' (corresponding to _O_U16TEXT in the POSIX API); however, in all of my testing I could not find any difference between it and 'regular' wide-character text mode.

Summary

This table summarizes the available methods and modes.

Method	Binary	Text	Wide	Unicode	Wide Unicode
Windows API	`WriteFile()`	`WriteConsoleA()`	`WriteConsoleW()`	n/a	n/a
‘POSIX’ style	`_open(_O_BINARY)`	`_open(_O_TEXT)`	`_open(_O_WTEXT)`	`_open(_O_U8TEXT)`	`_open(_O_U16TEXT)`
C unformatted	`fopen("wb")` `fwrite()`	`fopen("w")` `fwrite()`	`fopen("w,ccs=unicode")` `fwrite()`	`fopen("w,ccs=utf-8")` `fwrite()`	`fopen("w,ccs=utf-16le")` `fwrite()`
C formatted	`fopen("wb")` `fprintf()`	`fopen("w")` `fprintf()`	`fopen("w,ccs=unicode")` `fwprintf()`	`fopen("w,ccs=utf-8")` `fwprintf()`	`fopen("w,ccs=utf-16le")` `fwprintf()`
C++ unformatted	`ostream(ios::binary)` `.write()`	`ostream()` `.write()`	`wostream()` `.write()`	`fopen("w,ccs=utf-8")` `wostream(FILE)` `.write()`	`fopen("w,ccs=utf-16le")` `wostream(FILE)` `codecvt_utf16` `.write()`
C++ formatted	`ostream(ios::binary)` `<<`	`ostream()` `<<`	`wostream()` `<<`	`fopen("w,ccs=utf-8")` `wostream(FILE)` `<<`	`fopen("w,ccs=utf-16le")` `wostream(FILE)` `codecvt_utf16` `<<`

Console

How the console displays characters is influenced by three properties.

Console Font

This used to be an issue: without a ‘good’ font being configured in the Command Prompt, you would not get the correct characters displayed, no matter what. In Windows 11 (at least) this does not appear to be an issue anymore, as the default font seems to have adequate character set coverage. On systems before Windows 11, check to make sure something like Lucida Console is configured.

Console Code Page

There are two different code pages that appear like they should be associated with console output:

The console code page (set through SetConsoleCP()) doesn't appear to have any effect on output; in fact I can't find any effect it has at all
The console output code page (set through SetConsoleOutputCP()) determines the character repertoire that is available for display by the console; it doesn't affect how character data is stored (say, in case output is redirected to a file).

There is a difference in how these two are handled by Command Prompt and PowerShell and chcp, explained here and summarized in the table:

Command Prompt: chcp affects both CCP and COCP; both CCP and COCP are persistent across invocations of the program within the console session
PowerShell: chcp sets only the CCP; COCP appears to be a property of the calling process and changes made by the process do not persist across invocations but are always initially set to 437.

	Console Code Page	Console Output Code Page
Command Prompt	chcp/console	chcp/console
PowerShell	chcp/console	437/process

Here are some specific Code Pages I looked at; see Code Page Identifiers for a complete list:

437 (default): The ‘OEM character set’ used in DOS pre-Windows. Note that this both includes characters not in ASCII, as well as does not include all characters from ASCII
1252 Defined by the original Windows; roughly a superset of ISO-8859-1 (and therefore ASCII)
28591: ISO-8859-1
65001: UTF-8 encoding of Unicode

Locale

The Standard C locale setting (through setlocale()) interplays with the Console Output Code Page in that it defines the character set of the output that is set to the console. Setting the Standard C locale in general has no effect with on direct or console-redirected file output (although see above for C++ I/O) but affects the presentation within the console itself with the POSIX-style method in ‘text’ mode. This affects the POSIX-style, C (formatted or unformatted), C++ (formatted or unformatted) methods but not the Windows API. It has no effect in any other mode; presumably because input in those modes either already has a well-defined character set interpretation (wide, unicode, and wide unicode) or has no character interpretation at all (binary).

Valid locale names even include the ability to specify code pages explicitly, so this all appears very similar to the effect of the Console Output Code Page; however, it is a different mechanism as can be seen by the fact that they interact. Note that on my system, .OCP (OEM Code Page) invokes Code Page 437 and .ACP (ANSI Code Page) invokes Code Page 1252.

Examples

These show the display of the sample bytes under different combinations of locale and Console Output Code Page:

Locale	Console Output Code Page	Output
`.437`/`.OCP`/not set	437/not set	A╬úδî
`.437`/`.OCP`/not set	1252	A+údî
`.1252`/`.ACP`	1252	AÎ£ëŒ
`.1252`/`.ACP`	437/not set	AI£ëO

When the locale and Console Output Code Page are aligned, the displayed output is as expected and correct for that given Code Page. Otherwise, the characters are interpreted as existing in the locale character set but approximated with characters from the Console Output Page. For example:

the next-to-last character with code point EB corresponds to GREEK SMALL LETTER DELTA in Code Page 437. When the locale is set to (.437) and the Console Output Code Page is also set to 437, the character is correctly dislayed as "δ" When the Console Output Code Page is set to 1252, which doesn't have that character avialable, it is instead approximated by and displayed as "d".
the final character with code point 8C corresponds to LATIN CAPITAL LIGATURE OE in Code Page 1252. When the locale is set to (.1252) and the Console Output Code Page is also set to 1252, the character is correctly displayed as "Œ". When the Console Output Code Page is set to 437, which doesn't have that character available, it is instead approximated by and displayed as "O".

Note that setting the Console Output Code Page to 65001 (for UTF-8) allows the characters to be correctly displayed according to the specified locale in every case. This again serves to reinforce that the COCP does not dictate the character set encoding of the data, but the available character repertoire.

Console File Redirection

When the output of a command is redirected in the console, note first of all that the console APIs such as WriteConsole() no longer work; WriteFile() must be used. However, the character set encoding of the file depends on a number of factors.

Command Prompt

Supposedly depends on whether it was invoked as CMD /U or CMD /A, but I haven't been able to find any difference: the input is passed through to the output without any character set interpretation or any CR/LF translation.

PowerShell

The ‘redirection operator’ > is a shortcut for | OutFile. The documentation claims that its default encoding is UTF-8-no-BOM, but my observation is that UTF-16-with-BOM is the default. Either way, supposedly this can be overridden with the -Encoding argument. Also, CR/LF translation is applied and CR/LF is appended after the end of the last line if none is present. It appears that the input byte stream is always interpreted as Code Page 437; neither the Console Code Page nor the Console Output Code Page settings have any effect on this.

To illustrate the effect of PowerShell file redirection: our sample byte stream becomes

FF FE 41 00 6c 25 FA 00 B4 03 EE 00 0D 00 0A 00 0D 00 0A 00 00 00 0D 00 0A 00

corresponding to the UTF-16 code points

FEFF 0041 256C 00FA 03B4 00EE 000D 000A 000D 000A 0000 000D 000A

which are exactly the interpretations under Code Page 437. For example, CE translates to 256C (BOX DRAWINGS DOUBLE VERTICAL AND HORIZONTAL).

Unicode Mode and File Redirection

Because of the multiple places where character set translations are done, you can construct some really ludicrous situations. For example, if you open a file in POSIX Unicode Mode, as described above internally your input is intepreted as UTF-16 CE41 EBA3 0A8C and translated into UTF-8 ec b9 81 ee ae a3 e0 aa 8c. Redirecting this in PowerShell then interprets the UTF-8 bytes as Code Page 437 and translates them into UTF-16:

EC → 221E
B9 → 2563
81 → 00FC
EE → 03B5
AE → 00AB
A3 → 00FA
E0 → 03B1
AA → 00AC
8C → 00EE

Adding a BOM and CR/LF results in the nonsensical monstrosity:

FF FE 1E 22 63 25 FC 00 B5 03 AB 00 FA 00 B1 03 AC 00 EE 00 0D 00 0A 00

which opens as "∞╣üε«úα¬î". This is controlled by the PowerShell $OutputEncoding variable:

[Console]::OutputEncoding = [Text.ASCIIEncoding]::ASCII (Code Page 20127; 7-bit ASCII)
[Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8 (Code Page 65001; UTF-8)
[Console]::OutputEncoding = [Text.UnicodeEncoding]::Unicode (Code Page 65001; UTF-8)

It seems they affect the Console Output Code Page that is seen by the application; however, setting it through SetConsoleOutputCP() has no effect. So at a minimum, this could be used to check what encoding PowerShell is expecting.

Resources

UTF-8 decoder: good for working in hexadecimal
UTF8Tools lots of (too many) tools
CMD Reference