Why Does Text Become Garbled? A Deep Dive into UTF‑8, GBK, and Unicode
This article explains why characters appear as garbled text when encoding and decoding methods mismatch, explores how Excel defaults to GBK, shows how to convert files with iconv, and walks through the evolution from ASCII to GB2312, GBK, GB18030, and finally Unicode's UTF‑8 encoding.
When a file is opened with a different encoding than it was written, the characters become unreadable "garbled text". The root cause is a mismatch between the encoding used for reading and the one used for writing.
Excel and the GBK Default
Excel assumes the GBK encoding for Chinese text. If a UTF‑8 encoded CSV file is opened directly, Excel still parses it as GBK, producing a block of unreadable characters. The fix is simple: tell Excel to read the file as UTF‑8, or convert the file to GBK before opening.
iconv -f UTF-8 -t GB18030 test.csv > test2.csvFrom Binary to Characters
Computers store data as binary bits. One bit represents two states; eight bits form a byte, which can represent 256 different states. Early computers used the 7‑bit ASCII set (0‑127) to map these states to English letters, digits, and punctuation.
When Chinese users needed to store their characters, the remaining 128‑255 slots were used to create the GB2312 extension, which encodes about 7,000 common Chinese characters using two bytes per character.
GB2312 soon proved insufficient, leading to the GBK extension (adding over 20,000 characters, including traditional Chinese) and later GB18030, which covers virtually all Unicode characters.
Unicode and UTF‑8
Unicode was created to provide a universal code point for every symbol worldwide. UTF‑8 is a variable‑length encoding defined by RFC 3629 that represents Unicode code points using 1 to 4 bytes. Characters from the ASCII range remain a single byte, while other symbols use two, three, or four bytes.
Because UTF‑8 is backward compatible with ASCII and can represent any Unicode character, it has become the de‑facto standard for text interchange.
Practical Example
Consider a Chinese phrase saved as UTF‑8. Opening it in a program that expects GBK will display garbled characters. Converting the file with iconv -f UTF-8 -t GBK or configuring the program to read UTF‑8 resolves the issue.
Understanding the history of character sets—ASCII → GB2312 → GBK → GB18030 → Unicode—helps developers choose the right encoding and avoid compatibility problems across different platforms and languages.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
