Why Chinese Text Gets Garbled and How to Fix It: A Deep Dive into Encoding Standards
This article explains why Chinese characters often appear as garbled text on Windows and Linux, introduces the history and hierarchy of Chinese encoding standards such as GB2312, GBK, GB18030 and Unicode, compares ASCII, UTF‑8/16/32, shows practical command‑line experiments, and offers guidance for handling Chinese text in C and Python programs.
Example of encoding problems
On Windows Notepad, saving the word “联通” as ANSI (GB2312) and reopening it as UTF‑8 results in garbled characters; selecting the correct ANSI encoding restores the text. On Linux, using cat to view a file with mismatched locale settings also produces garbled output.
Why this article
The author records common Chinese encoding issues encountered in daily development, aiming to provide a clear, non‑technical explanation and a personal summary of key concepts.
Three levels of understanding
Concept: know the main encoding standards and solve typical problems.
Standard: master details such as ranges and conversion rules (not covered here).
Usage: understand binary storage of Chinese characters and choose appropriate encodings in programs.
Computers and binary representation
Computers store all data as binary numbers; characters are represented by numeric codes, which is the essence of “encoding”.
ASCII – the ultimate solution for English
ASCII defines 7‑bit (or 8‑bit extended) codes for English letters, digits and control characters, providing a common baseline for text exchange.
Chinese encoding history
GB2312 (1981) introduced a 94‑by‑94 grid of zone‑position codes for simplified Chinese, Greek, Japanese kana and Cyrillic. GBK extended GB2312 to include rare characters and CJK extensions. GB18030 further extended GBK and added full Unicode 3.1 coverage, supporting four‑byte sequences.
Unicode and UTF encodings
Unicode assigns a unique code point to every character worldwide (0‑0x10FFFF). UTF‑8 encodes code points in 1‑4 bytes, preserving ASCII compatibility; UTF‑16 uses 2‑or‑4 bytes; UTF‑32 uses a fixed 4‑byte representation. Byte Order Marks (BOM) identify endianness for UTF‑16/32.
Practical experiments
Using od and iconv on a Red Hat 4 system, the author shows the byte patterns of the word “中文” in UTF‑8, UTF‑16LE (with BOM), UTF‑32LE, GB2312, GBK and GB18030.
C language Chinese handling
Internal vs. external encoding: internal representation in memory vs. external file/stream encoding.
Two approaches: keep them identical (no conversion) or convert between them.
Linux’s GNU C library supports wchar_t (UTF‑32) and functions like wcslen, mbsrtowcs, wcsrtombs. Locale must be set with setlocale(LC_ALL, "") to match external encoding.
Python Chinese handling
Python’s built‑in Unicode support makes it natural to read/write using UTF‑8 externally while keeping Unicode objects internally.
Encoding selection advice
If only English, use ASCII for both internal and external encoding.
If primarily Chinese and storage size matters, choose GB2312 or GBK and implement necessary string functions.
If portability and simplicity are priorities, use UTF‑8 externally and UTF‑8 or UTF‑32 internally.
References
Links to Baidu Baike articles on GB2312, GBK, GB18030, Unicode specifications, GNU libiconv, and related documentation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
