Why Chinese Text Gets Garbled and How to Fix It – A Practical Encoding Guide
This article explains why Chinese characters often appear as mojibake on Windows and Linux, introduces the history and technical details of ASCII, GB2312, GBK, GB18030 and Unicode, and provides concrete examples and command‑line tools for inspecting and converting file encodings.
Encoding Issues Example
When you type the word "联通" in Windows Notepad, save it, and reopen it, the characters disappear and are replaced by garbled symbols like "��ͨ". This happens because the file was saved in ANSI (actually GB2312) but opened as UTF‑8. Selecting the correct "ANSI" encoding when opening the file restores the original text.
On Linux, using cat to view a file may also produce garbled output if the file’s encoding (A) does not match the current locale’s encoding (B). When A and B are incompatible, the characters cannot be displayed correctly.
Why This Article Was Written
Chinese encoding involves many standards, which can be confusing. Understanding the key concepts—what each encoding is for, its differences, and how to solve common problems—allows developers to handle most daily encoding issues without memorising every standard. This article records the author’s practical notes and aims to help readers with clear, everyday language.
The discussion is divided into three levels:
Concept level : Know the purpose of each encoding standard, their differences, and how to solve typical problems.
Standard level : (Not covered) Details such as code ranges and conversion rules, useful for building conversion tools.
Usage level : Understand how Chinese characters are stored in binary and choose the right encoding in programs.
All Because the Computer Doesn’t Read Text
Computers store everything as binary numbers. Early computers focused on scientific calculations, but as they entered everyday use, text processing became essential. To represent characters, computers assign numeric codes—this is called "encoding".
ASCII: The Ultimate English Solution
ASCII (American Standard Code for Information Interchange) defines binary codes for the 128 most common English characters using a single byte. It later became an ISO standard (ISO 646) and was extended to 256 symbols with the high‑bit set.
Chinese Encoding Overview
Because ASCII uses only one byte, it cannot represent the thousands of Chinese characters. The first Chinese standard, GB2312, was released in 1981 and covers about 6,000 characters. It uses a two‑byte "quwei" (area‑position) code where each byte is offset by A0h.
Later standards expanded the range:
GBK (1995) – a superset of GB2312, compatible with Unicode CJK characters.
GB18030 – a superset of GBK that includes all Unicode 3.1 characters and uses 1, 2, or 4 bytes.
Other regions use Big5 for Traditional Chinese.
Unicode Unifies All Scripts
Unicode assigns a unique code point to every character in every language, eliminating the need for multiple encodings. It defines UTF‑8, UTF‑16, and UTF‑32 as transformation formats that map Unicode code points to byte streams.
Key points about UTF encodings:
UTF‑8 uses 1–4 bytes; ASCII characters remain a single byte.
UTF‑16 uses 2‑byte units (or surrogate pairs for code points above 0xFFFF).
UTF‑32 uses a fixed 4‑byte unit for every character.
Byte Order Marks (BOM) identify the byte order of UTF‑16/UTF‑32 streams.
Binary Storage of Chinese Characters
Experiments on a Linux machine show the byte representation of the characters "中" and "文" in various encodings. For example, in UTF‑8 they are E4 B8 AD and E6 96 87, while in GB2312 they are D6 D0 and CE C4.
# Generate a UTF‑8 file
echo -n "中文" > foo.utf8
od -t x1 foo.utf8
# → 0000000 e4 b8 ad e6 96 87
# Convert to UTF‑16 and inspect
iconv -f utf-8 -t utf-16 foo.utf8 > foo.utf16
od -t x1 foo.utf16
# → 0000000 ff fe 2d 4e 87 65
# Convert to GB2312 and inspect
iconv -f utf-8 -t gb2312 foo.utf8 > foo.gb2312
od -t x1 foo.gb2312
# → 0000000 d6 d0 ce c4C Language Chinese Handling
Two approaches exist:
Keep internal and external encodings identical—no conversion needed, but you must implement string functions for multibyte encodings.
Allow different internal (e.g., UTF‑32 via wchar_t) and external encodings, converting with mbrtowc, wcrtomb, mbsrtowcs, and wcsrtombs. Remember to set the locale with setlocale(LC_ALL, "") so the program respects the user’s environment.
Using wchar_t (4 bytes) simplifies handling at the cost of higher memory usage.
Python Chinese Handling
Python’s built‑in Unicode support makes it easy: read/write using UTF‑8 (or another external encoding) and keep Unicode strings internally.
Encoding Choice Recommendations
If the text is pure English, use ASCII for both internal and external encoding.
If the text is mainly Chinese and storage size matters, choose GB2312 or GBK and implement only the needed string functions.
For maximum portability and simplicity, use UTF‑8 externally and either UTF‑8 or UTF‑32 internally.
Source: Taobao Search Technology Blog – http://www.searchtb.com/2012/04/chinese_encode.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
