Databases 5 min read

Why MySQL’s utf8 Isn’t True UTF‑8 and How utf8mb4 Solves It

MySQL’s original utf8 implementation was a limited, buggy version that only supported three‑byte characters, leading to data loss for emojis and rare symbols, so MySQL introduced utf8mb4 as a full UTF‑8 solution and now recommends it as the default encoding.

Efficient Ops
Efficient Ops
Efficient Ops
Why MySQL’s utf8 Isn’t True UTF‑8 and How utf8mb4 Solves It

Background

On Zhihu a user asked why MySQL discourages the use of utf8. The answer lies in MySQL’s early implementation of UTF‑8, which included optimizations that unintentionally broke full UTF‑8 support.

What Went Wrong

MySQL’s original utf8 stored each character in up to three bytes, whereas the UTF‑8 standard allows up to four bytes. This “optimized” version could not represent characters that require four bytes, such as many emoji and some rare Chinese characters. After the bug was discovered, MySQL could not simply remove the broken implementation because many deployments already relied on it.

Consequently, MySQL kept the buggy utf8 for backward compatibility and introduced a new character set, utf8mb4, which fully complies with the UTF‑8 standard.

Key Lessons

Do not deviate from established standards without a clear migration path.

Personal assumptions about correctness can cause serious problems.

If your implementation does not match the public standard, it is effectively wrong.

Community Experiences

“We once stored orders containing emoji; the content was truncated after the emoji.” “MySQL’s utf8 is a trimmed version; utf8mb4 is the real UTF‑8.” “Our project used MySQL with utf8 for six months, then an ID with a rare character caused errors. Switching to utf8mb4 fixed it.”

Technical Details

MySQL 8.0 officially recommends utf8mb4. This encoding supports up to four bytes per character, allowing proper storage of all Unicode symbols, including emojis and rare glyphs. Starting with MySQL 8.0, the default connection character set changed from latin1 to utf8mb4, and the default collation changed from latin1_swedish_ci to utf8mb4_0900_ai_ci. When using utf8mb4, the recommended collations are utf8mb4_unicode_ci or utf8mb4_general_ci.

Configuration Example

Typical my.ini settings to enforce utf8mb4:

[client]
default-character-set=utf8mb4

[mysql]
default-character-set=utf8mb4

[mysqld]
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
SQL garbled issue illustration
SQL garbled issue illustration
Character set view
Character set view
my.ini configuration file
my.ini configuration file
encodingMySQLdatabasesCharacter Setutf8mb4utf8
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.