Why Unicode Matters: Understanding UTF‑8, UTF‑16, and UTF‑32 Encoding
This article traces the evolution from ASCII to Unicode, explains how Unicode defines universal code points, compares the UTF‑8, UTF‑16 and UTF‑32 encoding schemes, discusses byte order and BOM, and shows practical fixes for common encoding problems in Redis and MySQL.
From ASCII to Unicode
Early computers standardized the relationship between English characters and binary bits with the 7‑bit ASCII code, defining 128 characters (0x00‑0x7F). As computers spread worldwide, regional encodings such as GB2312, BIG5, and Shift JIS caused interoperability issues, prompting the creation of Unicode as a universal character set.
Unicode Basics
Unicode assigns a unique code point to every character in every language, covering the range 0x0000‑0x10FFFF (over one million code points). A code point is the binary value that represents a character, e.g., the Chinese character "中" is 0x4E2D and the letter "A" is 0x41.
Character Set vs. Character Encoding
A character set is a collection of symbols (e.g., GB2312 contains thousands of Chinese characters). An encoding is a concrete method that maps those symbols to byte sequences. Unicode is a set; UTF‑8, UTF‑16 and UTF‑32 are encodings.
Unicode Storage Strategies
Unicode itself does not prescribe how code points are stored. To avoid wasteful fixed‑width storage, three variable‑length encodings are used:
UTF‑8 : 1–4 bytes per code point; compatible with ASCII because single‑byte characters start with a 0 bit.
UTF‑16 : 2 or 4 bytes; uses surrogate pairs for code points above 0xFFFF.
UTF‑32 : always 4 bytes; simple but space‑inefficient.
UTF‑8 Encoding Rules
UTF‑8 encodes a code point according to its binary length:
For 1‑byte symbols, the leading bit is 0 and the remaining 7 bits hold the code point (identical to ASCII).
For n‑byte symbols (n>1), the first byte starts with n leading 1 bits followed by a 0, continuation bytes start with 10, and the remaining bits contain the code point.
Example: encoding the character "中" (0x4E2D) falls in the 0x0800‑0xFFFF range, requiring three bytes. The resulting UTF‑8 bytes are E4 B8 AD (hex).
UTF‑16 Encoding Rules
Code points below 0x10000 are stored directly as two bytes. For code points between 0x10000 and 0x10FFFF, UTF‑16 uses a surrogate pair: the high surrogate starts with 110110, the low surrogate with 110111, and the remaining 10‑bit portions encode (code‑point ‑ 0x10000).
Example: the historic South‑Arabian letter 0x10A6F is encoded as the surrogate pair D8 02 DE 6F (hex).
UTF‑32
UTF‑32 stores each code point in a fixed 4‑byte word, directly using the Unicode value without transformation.
Byte Order and BOM
Multi‑byte encodings (UTF‑16, UTF‑32) require an explicit byte order: big‑endian (high byte first) or little‑endian (low byte first). A Byte Order Mark (BOM) at the start of a file signals both the encoding and the endianness, e.g., EF BB BF for UTF‑8, FE FF for UTF‑16BE, FF FE for UTF‑16LE.
Common Encoding Pitfalls
Redis : When storing Chinese keys, the default client may display garbled characters. Restarting redis-cli with the --raw flag forces raw byte output, showing the correct characters.
MySQL : The legacy utf8 charset only supports up to three bytes per character, rejecting four‑byte characters such as 0x10A6F. Switching the table charset to utf8mb4 with a matching collation (e.g., utf8mb4_unicode_ci) resolves the issue. The change can be applied with
ALTER TABLE … CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;and by setting character_set_server=utf8mb4 in my.cnf.
Illustrations
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
