Fundamentals 14 min read

Decoding Chinese Text: ASCII, GB2312, GBK, GB18030, and UTF‑8 Explained

This article explains how computer text is represented by assigning unique numeric codes to characters and converting those codes into binary, then compares the most common Chinese encodings—ASCII, GB2312, GBK, GB18030, and UTF‑8—detailing their compatibility, byte lengths, and practical impact on software development.

Liangxu Linux

Jul 14, 2024

Decoding Chinese Text: ASCII, GB2312, GBK, GB18030, and UTF‑8 Explained

What Character Encoding Does

Computers store text as binary strings. To display Chinese characters correctly two tasks are required:

Assign each character a unique numeric identifier (the character set, e.g., Unicode).

Encode that number into a sequence of bytes and indicate how many consecutive bytes belong to a single character (the encoding).

The second task is not a simple binary conversion because multi‑byte characters must be delimited to avoid ambiguity. Unicode provides the universal mapping (task 1); any concrete encoding implements task 2.

Compatibility Relationships Among Common Chinese Encodings

All listed encodings contain the ASCII range (0‑127) as a subset. UTF‑8 and the GB family share only the ASCII subset, which is why mixing them often produces garbled text.

ASCII

ASCII uses one byte (8 bits) per character with the most‑significant bit set to 0, giving 128 symbols. Extended ASCII uses the high bit for an additional 128 symbols, but the standard ASCII set remains the first 128 code points. Because ASCII is a subset of virtually every other encoding, pure ASCII text never becomes garbled.

GB2312, GBK, GB18030

These three Chinese standards form a strict superset chain: GB18030 ⊇ GBK ⊇ GB2312 GB2312 – the earliest standard, fixed‑length 2‑byte encoding, covering 6 763 Chinese characters and 682 symbols (including half‑width and full‑width forms).

GBK – extends GB2312 to 20 902 Chinese characters (including many traditional forms) while keeping the 2‑byte fixed length. It is not compatible with the Taiwanese Big5 encoding.

GB18030 – further extends the repertoire to over 70 000 characters by introducing a 4‑byte form for characters not representable in 2 bytes. It remains backward compatible with GBK.

UTF‑8 (Unicode Transformation Format)

UTF‑8 is the dominant encoding for web pages and databases because it can represent every Unicode code point. It uses a variable‑length scheme where the number of leading 1 bits in the first byte indicates the total byte count:

0xxxxxxx – 1 byte (identical to ASCII).

110xxxxx – start of a 2‑byte sequence.

1110xxxx – start of a 3‑byte sequence.

11110xxx – start of a 4‑byte sequence.

Continuation bytes always begin with 10xxxxxx.

Example: the Chinese character "鹅" has Unicode code point U+9E45 (hex 9E45). Its UTF‑8 encoding is three bytes E9 B9 85. The same character in GBK is B6 EC, showing no numeric relationship between the two schemes.

Other Frequently Encountered Encodings

ANSI – not a single encoding; on Windows it maps to the system’s default code page (e.g., GBK for Simplified Chinese, TIS‑620 for Thai, EUC‑KR for Korean). It exists only in the Windows environment.

Latin‑1 (ISO‑8859‑1) – a single‑byte encoding covering the first 256 code points. The first 128 coincide with ASCII. MySQL historically used Latin‑1 as the default charset. Storing UTF‑8 bytes in a Latin‑1 column preserves the raw byte values but results in garbled display unless the client decodes them as UTF‑8.

Practical Implications

Because ASCII is a common subset, any pure‑ASCII text is safe across all encodings. Mixing UTF‑8 with GBK (or GB18030) without conversion will produce mojibake because their byte ranges beyond ASCII do not overlap. GB18030’s 4‑byte form allows it to encode all characters defined by GBK and GB2312 while preserving backward compatibility.

UTF‑8 typically requires 3 bytes for common Chinese characters, whereas GBK uses 2 bytes. Consequently, converting a GBK‑encoded Chinese document to UTF‑8 increases its size by roughly 50 %.

When storing text in databases, using a single‑byte charset such as Latin‑1 will accept any byte sequence, but the data will be displayed as garbled characters unless the client interprets the bytes with the original encoding. Therefore, declaring CHARSET=utf8 (or utf8mb4 for full Unicode support) is the recommended practice to avoid encoding‑related errors.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Unicode UTF-8 character encoding ASCII text processing GBK GB2312 GB18030

Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.