Fundamentals 15 min read

Why Unicode Matters: Understanding UTF‑8, UTF‑16, and UTF‑32 Encoding

This article explains the history and purpose of Unicode, describes how character sets differ from encodings, details the storage formats of UTF‑8, UTF‑16, and UTF‑32, discusses byte order and BOM, and shows common encoding pitfalls in Redis and MySQL with practical solutions.

Open Source Linux

Jul 2, 2021

Why Unicode Matters: Understanding UTF‑8, UTF‑16, and UTF‑32 Encoding

Unicode Overview

Unicode is an international standard character set that assigns a unique code point to every character used in all languages, enabling reliable cross‑language and cross‑platform text exchange.

Character Sets and Encodings

A character set is a collection of characters (e.g., GB2312 for Simplified Chinese). An encoding maps those characters to specific byte sequences; Unicode is the set, while UTF‑8, UTF‑16, and UTF‑32 are concrete encoding schemes.

Unicode Storage

Unicode defines code points ranging from 0x0000 to 0x10FFFF, requiring 1 to 4 bytes for storage. Different encodings use different binary formats to represent these code points.

UTF‑8 Encoding

UTF‑8 is a variable‑length encoding that uses 1 to 4 bytes per code point. For single‑byte symbols the first bit is 0, matching ASCII. For multi‑byte symbols the first byte starts with n leading 1 bits followed by a 0, and continuation bytes start with "10".

Single‑byte symbols: first bit 0, remaining 7 bits hold the Unicode code point (compatible with ASCII).

n‑byte symbols (n>1): first byte has n leading 1s and a 0, following bytes start with "10"; the remaining bits contain the code point.

For example, the Chinese character "中" (code point 0x4E2D) falls in the 0x0800‑0xFFFF range and is encoded in three bytes as 0xE4 0xB8 0xAD.

UTF‑16 Encoding

UTF‑16 is also variable‑length, using either 2 or 4 bytes. Code points below 0x10000 are stored directly in two bytes. Code points from 0x10000 to 0x10FFFF are stored as a surrogate pair: the high surrogate starts with 0xD800‑0xDBFF (binary prefix 110110) and the low surrogate with 0xDC00‑0xDFFF (binary prefix 110111), each followed by 10 bits of the adjusted code point.

Characters with code point < 0x10000 use two bytes, identical to the Unicode value.

Characters ≥ 0x10000 use four bytes split into two 2‑byte surrogates as described.

Values above 0x10FFFF cannot be encoded in UTF‑16.

Thus the character "中" (0x4E2D) is stored as two bytes 0x4E 0x2D, while the historic South Arabian character 0x10A6F becomes the surrogate pair 0xD802 0xDE6F.

UTF‑32 Encoding

UTF‑32 uses a fixed length of four bytes for every code point, directly storing the Unicode value without transformation. This wastes space but simplifies processing.

Conversion Between UTF‑8, UTF‑16, and UTF‑32

All three encodings can be converted by first decoding the byte sequence to obtain the Unicode code point, then re‑encoding that code point using the target format's rules.

Byte Order (Endianness) and BOM

Multi‑byte encodings (UTF‑16, UTF‑32) require a byte order. Big‑endian stores the most significant byte first; little‑endian stores the least significant byte first. A Byte Order Mark (BOM) at the start of a file indicates the encoding and order: EF BB BF for UTF‑8, FE FF for UTF‑16BE, FF FE for UTF‑16LE, 00 00 FE FF for UTF‑32BE, and FF FE 00 00 for UTF‑32LE.

Common Encoding Issues in Redis and MySQL

Redis may display garbled Chinese keys unless the client is started with the --raw option.

MySQL's utf8 charset only supports up to three bytes per character, causing errors for characters requiring four bytes (e.g., code point 0x10A6F). Using utf8mb4 with an appropriate collation resolves the issue.

mysql> show create table test\G
*************************** 1. row ***************************
       Table: test
Create Table: CREATE TABLE `test` (
  `name` char(32) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

After altering the table to utf8mb4 and setting the collation to utf8mb4_unicode_ci, inserting the four‑byte character succeeds.

Conclusion

The article traced the evolution from ASCII to Unicode, explained the three main Unicode encodings (UTF‑8, UTF‑16, UTF‑32), covered byte order and BOM, and demonstrated practical solutions for encoding problems in Redis and MySQL.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

MySQL Unicode UTF-8 character encoding UTF-16 UTF-32

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.