Fundamentals 25 min read

Why Chinese Text Gets Garbled and How to Fix It: A Deep Dive into Encoding Standards

This article explains why Chinese characters often appear as garbled text on Windows and Linux, introduces the history and hierarchy of Chinese encoding standards such as GB2312, GBK, GB18030 and Unicode, compares ASCII, UTF‑8/16/32, shows practical command‑line experiments, and offers guidance for handling Chinese text in C and Python programs.

21CTO

Jan 4, 2016

Why Chinese Text Gets Garbled and How to Fix It: A Deep Dive into Encoding Standards

Example of encoding problems

On Windows Notepad, saving the word “联通” as ANSI (GB2312) and reopening it as UTF‑8 results in garbled characters; selecting the correct ANSI encoding restores the text. On Linux, using cat to view a file with mismatched locale settings also produces garbled output.

Why this article

The author records common Chinese encoding issues encountered in daily development, aiming to provide a clear, non‑technical explanation and a personal summary of key concepts.

Three levels of understanding

Concept: know the main encoding standards and solve typical problems.

Standard: master details such as ranges and conversion rules (not covered here).

Usage: understand binary storage of Chinese characters and choose appropriate encodings in programs.

Computers and binary representation

Computers store all data as binary numbers; characters are represented by numeric codes, which is the essence of “encoding”.

ASCII – the ultimate solution for English

ASCII defines 7‑bit (or 8‑bit extended) codes for English letters, digits and control characters, providing a common baseline for text exchange.

Chinese encoding history

GB2312 (1981) introduced a 94‑by‑94 grid of zone‑position codes for simplified Chinese, Greek, Japanese kana and Cyrillic. GBK extended GB2312 to include rare characters and CJK extensions. GB18030 further extended GBK and added full Unicode 3.1 coverage, supporting four‑byte sequences.

Unicode and UTF encodings

Unicode assigns a unique code point to every character worldwide (0‑0x10FFFF). UTF‑8 encodes code points in 1‑4 bytes, preserving ASCII compatibility; UTF‑16 uses 2‑or‑4 bytes; UTF‑32 uses a fixed 4‑byte representation. Byte Order Marks (BOM) identify endianness for UTF‑16/32.

Practical experiments

Using od and iconv on a Red Hat 4 system, the author shows the byte patterns of the word “中文” in UTF‑8, UTF‑16LE (with BOM), UTF‑32LE, GB2312, GBK and GB18030.

C language Chinese handling

Internal vs. external encoding: internal representation in memory vs. external file/stream encoding.

Two approaches: keep them identical (no conversion) or convert between them.

Linux’s GNU C library supports wchar_t (UTF‑32) and functions like wcslen, mbsrtowcs, wcsrtombs. Locale must be set with setlocale(LC_ALL, "") to match external encoding.

Python Chinese handling

Python’s built‑in Unicode support makes it natural to read/write using UTF‑8 externally while keeping Unicode objects internally.

Encoding selection advice

If only English, use ASCII for both internal and external encoding.

If primarily Chinese and storage size matters, choose GB2312 or GBK and implement necessary string functions.

If portability and simplicity are priorities, use UTF‑8 externally and UTF‑8 or UTF‑32 internally.

References

Links to Baidu Baike articles on GB2312, GBK, GB18030, Unicode specifications, GNU libiconv, and related documentation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python encoding C#Unicode UTF-8 GB2312 locale

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.