Understanding Chinese Character Encodings: ASCII, GB2312, GBK, GB18030, UTF‑8, ANSI & Latin‑1
This article explains the purpose and mechanics of common Chinese character encodings—including ASCII, GB2312, GBK, GB18030, UTF‑8, ANSI and Latin‑1—detailing how they map characters to numbers, handle byte boundaries, and maintain compatibility to avoid garbled text.
What does character encoding do?
In a computer, all text is represented as a string of 0s and 1s. To display Chinese characters correctly, two tasks are required:
Assign each character a unique numeric identifier (the character set).
Encode that number into bits while indicating how many consecutive bytes belong to a single character.
The second task is not a simple binary representation; it must also solve the problem of separating characters when multiple bytes are involved.
Relationship among common Chinese encodings
Different Chinese encodings are compatible subsets of each other, preventing ambiguity such as whether a byte sequence represents "Tencent" or a single character with code 133.
ASCII is compatible with all encodings, while UTF‑8 and GBK share only the ASCII range, which often leads to garbled text when UTF‑8 reads GBK data.
ASCII Encoding
Each ASCII character occupies one byte (8 bits) with the highest bit set to 0, allowing 128 symbols. Extended ASCII uses the high bit to represent an additional 128 symbols, but standard ASCII remains the first 128.
GB2312, GBK, GB18030
These three encodings share a common lineage. GB2312 is the earliest Chinese code set, using 2 bytes per character and covering 6,763 Chinese characters plus 682 symbols. GBK expands GB2312 to 20,902 characters (including traditional Chinese) while remaining 2‑byte fixed length. GB18030 further expands the repertoire to over 70,000 characters by using 4‑byte sequences for characters that cannot fit in 2 bytes.
All three use fixed‑length schemes to solve the byte‑boundary problem: GB2312 and GBK use 2‑byte units, while GB18030 uses 4‑byte units for the additional characters.
UTF‑8 Encoding (Unicode Transformation Format)
UTF‑8 can represent every character in the world because it encodes the Unicode code point assigned to each symbol. It determines the number of bytes for a character by counting the leading 1 bits in the first byte; a leading 0 indicates a single‑byte (compatible with ASCII). Bytes that are not the start of a character begin with "10".
For example, the Chinese character "鹅" has Unicode code point U+9E45 (binary 1001111001000101). It requires three bytes in UTF‑8, resulting in the byte sequence E9 B9 85. Its GBK encoding is B6 EC, showing no relationship between the two schemes.
Typical Chinese characters fit within three UTF‑8 bytes, while GBK uses two bytes. Converting GBK text to UTF‑8 increases file size by roughly 50 %.
Other Frequently Encountered Encodings
ANSI : Not a specific encoding; on Windows it maps to the system locale (e.g., GBK for Simplified Chinese, TIS‑620 for Thai, EUC‑KR for Korean).
Latin‑1 (ISO‑8859‑1) : A single‑byte encoding used as MySQL’s default charset. It covers 256 symbols, the first 128 matching ASCII. Storing non‑Latin‑1 data (e.g., UTF‑8 Chinese) in a Latin‑1 column stores the raw bytes, which appear as garbled characters unless the client decodes them as UTF‑8.
Because Latin‑1 is a single‑byte charset, it never rejects byte sequences, but the displayed text may be unreadable. The common solution is to standardize on UTF‑8 and declare DEFAULT CHARSET=utf8 when creating database tables.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
