Fundamentals 8 min read

Mastering Character Encodings: From ANSI to UTF‑8 and Beyond

This guide explains the essential character set encodings—ANSI, ASCII, GB2312/GBK/GB18030, Unicode planes, UTF‑16, UTF‑32, and UTF‑8—and shows how they are used in MFC and Qt, providing code examples to avoid garbled text in software.

Liangxu Linux
Liangxu Linux
Liangxu Linux
Mastering Character Encodings: From ANSI to UTF‑8 and Beyond

Introduction

Software developers frequently encounter issues such as Chinese garbled text or missing Japanese characters, which stem from an incomplete understanding of character set encodings.

1. ANSI and ASCII

ANSI (American National Standards Institute) defines a multibyte variable‑length encoding that can represent single‑byte, double‑byte, or multi‑byte characters, compatible with both SBCS and DBCS, and aligns with EUC/EUC‑CN. Different regions use different code pages.

ANSI uses a multibyte system (MBCS) where each character may be one, two, or more bytes.

It is compatible with both single‑byte and double‑byte character sets.

It is compatible with EUC/EUC‑CN, making its double‑byte part big‑endian.

Regional code pages map to specific encoding rules.

1.1 ASCII

ASCII (American Standard Code for Information Interchange) defines 128 characters for English and Western European languages. First published in 1967 and last updated in 1986, it corresponds to ISO/IEC 646 and remains the most universal information‑exchange standard.

2. GB2312, GBK, GB18030

GB2312 is the simplified‑Chinese encoding where each Chinese character occupies two bytes and is compatible with ASCII. GBK extends GB2312 to support traditional Chinese characters. GB18030 further extends GBK to cover minority scripts, Japanese, Korean, and is fully compatible with GBK.

3. Unicode

Unicode provides a universal code space for all characters worldwide, divided into 17 planes (0‑16) with a total of 1,114,112 code points (U+000000 ~ U+10FFFF). Not every code point is assigned to a character.

Each plane contains 65,536 code points.

Plane 0 is the Basic Multilingual Plane (BMP).

Higher planes (1‑16) contain supplementary characters.

Some code points are currently unassigned.

3.1 UTF‑16

UTF‑16 originates from UCS‑2 and encodes characters using either two bytes (for BMP) or four bytes via surrogate pairs for supplementary planes. It can be stored as UTF‑16BE or UTF‑16LE, with or without a Byte Order Mark (BOM). Example: the string "ABC" becomes the byte sequence 00 41 00 42 00 43.

When software only supports UCS‑2, it can handle only BMP characters and cannot represent supplementary plane characters.

UTF‑16 encoding examples
UTF‑16 encoding examples

3.2 UTF‑32

UTF‑32 uses a fixed 32‑bit (4‑byte) unit for each Unicode code point, making it simple but space‑inefficient. It also supports both big‑endian and little‑endian byte orders.

3.3 UTF‑8

UTF‑8 encodes Unicode characters in 1‑6 bytes and is backward compatible with ASCII. It is the most widely used encoding on the web.

UTF‑8 encoding diagram
UTF‑8 encoding diagram

4. Character Sets in Specific Frameworks

4.1 MFC

When MFC selects a multibyte encoding, it uses GBK. When Unicode is selected, it uses UTF‑16.

4.2 Qt

QString stores data as UTF‑16. By default, constructing a QString from a literal uses UTF‑8 encoding, but you can also construct it from GBK‑encoded data.

QTextCodec *codec = QTextCodec::codecForName("UTF-8");
QTextCodec::setCodecForLocale(codec);
QString str = "右边是UFT-8编码的字符串";

For GBK:

QTextCodec *codec = QTextCodec::codecForName("GBK");
QTextCodec::setCodecForLocale(codec);
QString str = "右边是GBK编码的字符串";

Direct conversion methods:

QString str1 = QString::fromLocal8Bit("GBK编码字符串");
QString str2 = QString::fromUtf8("UTF-8编码字符串");
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

software developmentUnicodeUTF-8character encodingQtANSIMFC
Liangxu Linux
Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.