Why Unicode Matters: Understanding UTF‑8, UTF‑16, and UTF‑32 Encoding
This article explains the history and purpose of Unicode, describes how character sets differ from encodings, details the storage formats of UTF‑8, UTF‑16, and UTF‑32, discusses byte order and BOM, and shows common encoding pitfalls in Redis and MySQL with practical solutions.
Unicode Overview
Unicode is an international standard character set that assigns a unique code point to every character used in all languages, enabling reliable cross‑language and cross‑platform text exchange.
Character Sets and Encodings
A character set is a collection of characters (e.g., GB2312 for Simplified Chinese). An encoding maps those characters to specific byte sequences; Unicode is the set, while UTF‑8, UTF‑16, and UTF‑32 are concrete encoding schemes.
Unicode Storage
Unicode defines code points ranging from 0x0000 to 0x10FFFF, requiring 1 to 4 bytes for storage. Different encodings use different binary formats to represent these code points.
UTF‑8 Encoding
UTF‑8 is a variable‑length encoding that uses 1 to 4 bytes per code point. For single‑byte symbols the first bit is 0, matching ASCII. For multi‑byte symbols the first byte starts with n leading 1 bits followed by a 0, and continuation bytes start with "10".
Single‑byte symbols: first bit 0, remaining 7 bits hold the Unicode code point (compatible with ASCII).
n‑byte symbols (n>1): first byte has n leading 1s and a 0, following bytes start with "10"; the remaining bits contain the code point.
For example, the Chinese character "中" (code point 0x4E2D) falls in the 0x0800‑0xFFFF range and is encoded in three bytes as 0xE4 0xB8 0xAD.
UTF‑16 Encoding
UTF‑16 is also variable‑length, using either 2 or 4 bytes. Code points below 0x10000 are stored directly in two bytes. Code points from 0x10000 to 0x10FFFF are stored as a surrogate pair: the high surrogate starts with 0xD800‑0xDBFF (binary prefix 110110) and the low surrogate with 0xDC00‑0xDFFF (binary prefix 110111), each followed by 10 bits of the adjusted code point.
Characters with code point < 0x10000 use two bytes, identical to the Unicode value.
Characters ≥ 0x10000 use four bytes split into two 2‑byte surrogates as described.
Values above 0x10FFFF cannot be encoded in UTF‑16.
Thus the character "中" (0x4E2D) is stored as two bytes 0x4E 0x2D, while the historic South Arabian character 0x10A6F becomes the surrogate pair 0xD802 0xDE6F.
UTF‑32 Encoding
UTF‑32 uses a fixed length of four bytes for every code point, directly storing the Unicode value without transformation. This wastes space but simplifies processing.
Conversion Between UTF‑8, UTF‑16, and UTF‑32
All three encodings can be converted by first decoding the byte sequence to obtain the Unicode code point, then re‑encoding that code point using the target format's rules.
Byte Order (Endianness) and BOM
Multi‑byte encodings (UTF‑16, UTF‑32) require a byte order. Big‑endian stores the most significant byte first; little‑endian stores the least significant byte first. A Byte Order Mark (BOM) at the start of a file indicates the encoding and order: EF BB BF for UTF‑8, FE FF for UTF‑16BE, FF FE for UTF‑16LE, 00 00 FE FF for UTF‑32BE, and FF FE 00 00 for UTF‑32LE.
Common Encoding Issues in Redis and MySQL
Redis may display garbled Chinese keys unless the client is started with the --raw option.
MySQL's utf8 charset only supports up to three bytes per character, causing errors for characters requiring four bytes (e.g., code point 0x10A6F). Using utf8mb4 with an appropriate collation resolves the issue.
mysql> show create table test\G
*************************** 1. row ***************************
Table: test
Create Table: CREATE TABLE `test` (
`name` char(32) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)After altering the table to utf8mb4 and setting the collation to utf8mb4_unicode_ci, inserting the four‑byte character succeeds.
Conclusion
The article traced the evolution from ASCII to Unicode, explained the three main Unicode encodings (UTF‑8, UTF‑16, UTF‑32), covered byte order and BOM, and demonstrated practical solutions for encoding problems in Redis and MySQL.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
