Databases 7 min read

Why MySQL’s “utf8” Isn’t Real UTF‑8 and How utf8mb4 Fixes It

Although MySQL historically labeled its three‑byte character set as “utf8”, it actually implements a truncated version (utf8mb3) that cannot store the full Unicode range, leading to bugs with emojis and rare characters; the newer utf8mb4 restores true UTF‑8 support and is now the default in MySQL 8.0.

ITPUB
ITPUB
ITPUB
Why MySQL’s “utf8” Isn’t Real UTF‑8 and How utf8mb4 Fixes It

Background

A question on Zhihu asked why MySQL does not recommend using its utf8 character set. The confusion stems from MySQL’s historical implementation of utf8 as a three‑byte, limited‑range encoding rather than the full Unicode UTF‑8.

MySQL’s original utf8 (utf8mb3)

When MySQL first added UTF‑8 support, it introduced an optimization that only covered the Basic Multilingual Plane (BMP). This “utf8” (later renamed utf8mb3) can store at most three bytes per character, so it cannot represent characters outside the BMP, such as many emoji and rare CJK glyphs. The limitation caused bugs that were hard to trace, for example when storing order notes containing emoji or rare Chinese characters.

又甜又甜: This bug caused large losses; many orders with emoji were truncated after the emoji. 风之帆: MySQL’s utf8 is a trimmed version; utf8mb4 is the true UTF‑8. 邵NewBee: We were once stuck because the project used utf8 and could not store emoji replies. 精灵福将马国成: After half a year of using utf8 , we hit an error with a rare Chinese name; switching to utf8mb4 solved it.

Introduction of utf8mb4

In 2010 MySQL released utf8mb4, a four‑byte encoding that fully supports the Unicode standard, including supplementary planes, emoji, mathematical symbols, and other special characters. Starting with MySQL 8.0, utf8mb4 became the recommended default character set, and the server’s default connection charset changed from latin1 to utf8mb4. The default collation also switched to utf8mb4_0900_ai_ci, with utf8mb4_unicode_ci or utf8mb4_general_ci as common alternatives.

SQL garbled characters illustration
SQL garbled characters illustration

Differences Between utf8, utf8mb3, and utf8mb4

utf8/utf8mb3 store up to three bytes per character and only cover the BMP (≈90 % of Unicode). utf8mb4 stores up to four bytes, covering the entire Unicode range, including emoji and supplementary characters. The table below (illustrated in the image) shows the byte‑size and character‑range differences.

utf8 vs utf8mb4 comparison chart
utf8 vs utf8mb4 comparison chart

Practical Recommendations

For new projects, always use utf8mb4 to ensure full Unicode compatibility. If you only need BMP characters, the older utf8 may be sufficient, but it is being deprecated: MySQL 8.0.28+ will eventually remove utf8 and utf8mb3 entirely. Explicitly specifying utf8mb4 in DDL and connection settings avoids ambiguity.

MySQLUnicodeCharacter Setutf8mb4utf8
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.