Why MySQL’s utf8 Isn’t True UTF‑8 and How utf8mb4 Fixes It
A collection of Zhihu answers explains that MySQL’s original utf8 charset only supports three‑byte characters, causing data loss for emojis and rare symbols, and shows how the newer utf8mb4 charset provides full Unicode support, becoming the default in MySQL 8.0.
Background
MySQL’s original utf8 character set was implemented as a three‑byte UTF‑8 variant (often called utf8mb3). It only supports characters in the Basic Multilingual Plane (BMP), which covers roughly 90 % of Unicode. Characters that require four bytes—such as most emoji, many historic scripts, and supplementary symbols—cannot be stored, leading to truncation or insertion errors.
Why the limitation persisted
When MySQL added UTF‑8 support, the three‑byte implementation was released widely. Changing it later would have broken existing schemas and applications, so MySQL kept the buggy version and introduced a new charset, utf8mb4, that fully implements Unicode (up to four bytes per character).
Technical differences
utf8 / utf8mb3 : maximum 3 bytes per character, BMP only.
utf8mb4 : maximum 4 bytes per character, full Unicode range (including emoji, mathematical symbols, rare CJK glyphs).
Default charset changes in MySQL 8.0
Starting with MySQL 8.0, the server default connection charset switched from latin1 to utf8mb4, and the default collation became utf8mb4_0900_ai_ci. Recommended collations for most workloads are utf8mb4_unicode_ci or utf8mb4_general_ci.
Deprecation timeline
From MySQL 8.0.28 onward, the legacy utf8 and utf8mb3 identifiers are deprecated. They will be removed in a future major release, so new schemas should explicitly use utf8mb4, and existing tables should be migrated.
Migration guidance
Identify columns defined with CHAR, VARCHAR, TEXT etc. that use utf8 or utf8mb3.
Alter the table to convert the charset, e.g.:
ALTER TABLE my_table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;Update the server and client configuration (my.cnf) to set character-set-server=utf8mb4 and collation-server=utf8mb4_unicode_ci.
Verify that the maximum row size does not exceed the MySQL limit (65,535 bytes) after conversion, especially for tables with many VARCHAR(255) columns.
Practical impact
Real‑world reports show data loss when storing emoji or rare Chinese characters with the three‑byte charset. Switching to utf8mb4 resolves these issues without affecting existing BMP data.
Reference images
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
