Why MySQL’s utf8 Isn’t True UTF‑8 and How utf8mb4 Solves It
MySQL’s original utf8 implementation was a limited, buggy version that only supported three‑byte characters, leading to data loss for emojis and rare symbols, so MySQL introduced utf8mb4 as a full UTF‑8 solution and now recommends it as the default encoding.
Background
On Zhihu a user asked why MySQL discourages the use of utf8. The answer lies in MySQL’s early implementation of UTF‑8, which included optimizations that unintentionally broke full UTF‑8 support.
What Went Wrong
MySQL’s original utf8 stored each character in up to three bytes, whereas the UTF‑8 standard allows up to four bytes. This “optimized” version could not represent characters that require four bytes, such as many emoji and some rare Chinese characters. After the bug was discovered, MySQL could not simply remove the broken implementation because many deployments already relied on it.
Consequently, MySQL kept the buggy utf8 for backward compatibility and introduced a new character set, utf8mb4, which fully complies with the UTF‑8 standard.
Key Lessons
Do not deviate from established standards without a clear migration path.
Personal assumptions about correctness can cause serious problems.
If your implementation does not match the public standard, it is effectively wrong.
Community Experiences
“We once stored orders containing emoji; the content was truncated after the emoji.” “MySQL’s utf8 is a trimmed version; utf8mb4 is the real UTF‑8.” “Our project used MySQL with utf8 for six months, then an ID with a rare character caused errors. Switching to utf8mb4 fixed it.”
Technical Details
MySQL 8.0 officially recommends utf8mb4. This encoding supports up to four bytes per character, allowing proper storage of all Unicode symbols, including emojis and rare glyphs. Starting with MySQL 8.0, the default connection character set changed from latin1 to utf8mb4, and the default collation changed from latin1_swedish_ci to utf8mb4_0900_ai_ci. When using utf8mb4, the recommended collations are utf8mb4_unicode_ci or utf8mb4_general_ci.
Configuration Example
Typical my.ini settings to enforce utf8mb4:
[client]
default-character-set=utf8mb4
[mysql]
default-character-set=utf8mb4
[mysqld]
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ciEfficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
