Why MySQL’s “utf8” Isn’t Real UTF‑8 and How utf8mb4 Fixes It
Discover why MySQL’s legacy ‘utf8’ charset only supports three‑byte characters, causing storage errors for true UTF‑8 data, and learn how the ‘utf8mb4’ charset resolves these issues, with historical context, technical details, and migration guidance.
When trying to store a UTF‑8 string (e.g., “😃 …”) in a MariaDB database using Rails, the insert fails with the error:
Incorrect string value: ‘😃 <…’ for column ‘summary’ at row 1The root cause is that MySQL’s legacy utf8 charset is not a true UTF‑8 implementation; it only supports up to three bytes per character, so four‑byte characters such as many emojis cannot be stored.
Why the “utf8” charset is limited
MySQL originally adopted an early UTF‑8 draft (RFC 2279) that allowed up to six bytes per character. In 2002 the developers trimmed the implementation to a maximum of three‑byte sequences, effectively creating a proprietary “utf8” that excludes many Unicode code points.
Because the documentation incorrectly claimed full UTF‑8 support, countless tutorials and articles still recommend using “utf8”, leading developers to encounter mysterious “Incorrect string value” errors.
The proper solution: utf8mb4
In 2010 MySQL introduced the utf8mb4 charset, which implements the full Unicode range (up to four bytes per character) and is fully compatible with other software. All MySQL and MariaDB users should migrate their databases from “utf8” to “utf8mb4”.
Historical background
MySQL added UTF‑8 support in version 4.1 (2003) before the current RFC 3629 standard was published. The early implementation used the older RFC 2279 definition, and a later code change in September 2002 limited the charset to three bytes. The exact motivation for this change is unclear, but it appears to have been driven by performance considerations for fixed‑length CHAR columns.
Developers originally intended to let users benefit from space‑ and speed‑optimizations by defining CHAR columns with a fixed byte length, but the restriction broke true UTF‑8 handling, resulting in larger storage usage and slower performance for “utf8” columns.
Migration guidance
To convert an existing database, follow the guide at https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4, which details the necessary ALTER TABLE statements and configuration changes.
In summary, never use MySQL’s “utf8” charset; always prefer “utf8mb4” for correct Unicode support.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
