Fundamentals 8 min read

Master Character Encodings: From ANSI to Unicode and Qt Implementation

This guide explains common character encodings such as ANSI, ASCII, GB2312/GBK/GB18030, Unicode, UTF‑8, UTF‑16, and UTF‑32, and shows how to handle them in MFC and Qt with practical code examples.

Liangxu Linux

Jun 13, 2023

Master Character Encodings: From ANSI to Unicode and Qt Implementation

Why Encoding Matters

Software developers often encounter garbled Chinese characters or missing Japanese text because they lack a systematic understanding of character set encodings.

Common Encodings Overview

Typical encodings include GB2312, GBK, BIG5, UTF‑8, UTF‑16, and the older terms ANSI and Unicode. The article provides concise details for each.

1. ANSI

ANSI is a multibyte character set (MBCS) that supports variable‑length encoding, compatible with both single‑byte (SBCS) and double‑byte (DBCS) character sets, and aligns with EUC/EUC‑CN. Different regions use different code pages.

1.1 ASCII

ASCII defines 128 characters for English and Western European languages. First published in 1967, it corresponds to ISO/IEC 646.

1.2 GB2312 and Extensions

GB2312 is a simplified‑Chinese encoding using two bytes per character and is compatible with ASCII. GBK extends GB2312 to include traditional Chinese characters, and GB18030 further adds support for minority scripts, Japanese, and Korean.

2. Unicode

Unicode aims to provide a unique code point for every character worldwide, organized into 17 planes (0‑16) with a total of 1,114,112 possible code points (U+000000 – U+10FFFF). Not every code point is assigned to a character.

2.1 UTF‑16

UTF‑16 derives from UCS‑2 and uses two bytes for the Basic Multilingual Plane (BMP). For supplementary planes, it employs surrogate pairs (four bytes), making it a variable‑length encoding. Byte order can be big‑endian (UTF‑16BE) or little‑endian (UTF‑16LE), each optionally with a BOM.

2.2 UTF‑32

UTF‑32 uses a fixed 32‑bit (four‑byte) representation for each Unicode code point, offering simplicity at the cost of higher memory usage. It also supports both big‑ and little‑endian byte orders.

2.3 UTF‑8

UTF‑8 encodes characters in 1‑6 bytes, preserving ASCII compatibility and becoming the dominant encoding on the web.

3. Encoding in MFC

When MFC is set to multibyte mode, it uses GBK; when set to Unicode mode, it uses UTF‑16.

4. Encoding in Qt

QString stores text as UTF‑16. The following code snippets demonstrate how to set the locale codec and create strings in different encodings.

QTextCodec *codec = QTextCodec::codecForName("UTF-8");
QTextCodec::setCodecForLocale(codec);
QString str = "右边是UFT-8编码的字符串";

For GBK encoding:

QTextCodec *codec = QTextCodec::codecForName("GBK");
QTextCodec::setCodecForLocale(codec);
QString str = "右边是GBK编码的字符串";

Direct conversion methods:

QString str1 = QString::fromLocal8Bit("GBK编码字符串");
QString str2 = QString::fromUtf8("UTF-8编码字符串");

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

software development Unicode UTF-8 character encoding Qt ANSI MFC

Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.