Databases 7 min read

Why MySQL’s “utf8” Isn’t Real UTF‑8 and How utf8mb4 Fixes It

Discover why MySQL’s legacy ‘utf8’ charset only supports three‑byte characters, causing storage errors for true UTF‑8 data, and learn how the ‘utf8mb4’ charset resolves these issues, with historical context, technical details, and migration guidance.

Liangxu Linux
Liangxu Linux
Liangxu Linux
Why MySQL’s “utf8” Isn’t Real UTF‑8 and How utf8mb4 Fixes It

When trying to store a UTF‑8 string (e.g., “😃 …”) in a MariaDB database using Rails, the insert fails with the error:

Incorrect string value: ‘😃 <…’ for column ‘summary’ at row 1

The root cause is that MySQL’s legacy utf8 charset is not a true UTF‑8 implementation; it only supports up to three bytes per character, so four‑byte characters such as many emojis cannot be stored.

Why the “utf8” charset is limited

MySQL originally adopted an early UTF‑8 draft (RFC 2279) that allowed up to six bytes per character. In 2002 the developers trimmed the implementation to a maximum of three‑byte sequences, effectively creating a proprietary “utf8” that excludes many Unicode code points.

Because the documentation incorrectly claimed full UTF‑8 support, countless tutorials and articles still recommend using “utf8”, leading developers to encounter mysterious “Incorrect string value” errors.

The proper solution: utf8mb4

In 2010 MySQL introduced the utf8mb4 charset, which implements the full Unicode range (up to four bytes per character) and is fully compatible with other software. All MySQL and MariaDB users should migrate their databases from “utf8” to “utf8mb4”.

Historical background

MySQL added UTF‑8 support in version 4.1 (2003) before the current RFC 3629 standard was published. The early implementation used the older RFC 2279 definition, and a later code change in September 2002 limited the charset to three bytes. The exact motivation for this change is unclear, but it appears to have been driven by performance considerations for fixed‑length CHAR columns.

Developers originally intended to let users benefit from space‑ and speed‑optimizations by defining CHAR columns with a fixed byte length, but the restriction broke true UTF‑8 handling, resulting in larger storage usage and slower performance for “utf8” columns.

Migration guidance

To convert an existing database, follow the guide at https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4, which details the necessary ALTER TABLE statements and configuration changes.

In summary, never use MySQL’s “utf8” charset; always prefer “utf8mb4” for correct Unicode support.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

encodingmysqlCharacter Setutf8mb4MariaDButf8
Liangxu Linux
Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.