Never Use MySQL "utf8" – Switch to "utf8mb4" for Real UTF‑8 Support
The article explains why MySQL's legacy "utf8" charset only supports three‑byte characters, causing storage errors for true four‑byte UTF‑8 symbols, and advises developers to migrate all MySQL/MariaDB databases to the proper "utf8mb4" charset.
While trying to store a UTF‑8 string in a Rails application backed by a MariaDB database configured with the "utf8" charset, the author encountered the error Incorrect string value: ‘\xF0\x9F\x98\x83 <…>’ for column ‘summary’ at row 1 , despite using UTF‑8 everywhere.
What Is UTF‑8 and Why MySQL’s "utf8" Is Not Real UTF‑8
MySQL’s "utf8" charset only supports characters up to three bytes, whereas the official UTF‑8 standard (RFC 3629) allows up to four bytes per character. Consequently, characters outside the BMP (e.g., many emojis) cannot be stored with "utf8".
In 2010 MySQL introduced the "utf8mb4" charset, which implements the full UTF‑8 range and resolves the bug.
Encoding Basics
Computers store text as binary data; a character like "C" becomes the byte sequence 01000011 . The computer maps this byte to the Unicode code point 67, then looks up the corresponding glyph.
Unicode contains millions of characters. UTF‑32 uses 32 bits per character, which is wasteful. UTF‑8 is more space‑efficient: common ASCII characters use one byte, while less common characters use two to four bytes.
MySQL’s Historical Decisions
MySQL added UTF‑8 support in version 4.1 (2003) based on the older RFC 2279, which allowed up to six bytes per character. Later that year the developers limited the charset to three‑byte sequences for performance reasons, encouraging the use of fixed‑length CHAR columns.
This change was not well documented, leading many developers to believe that MySQL’s "utf8" was the true UTF‑8 implementation.
The restriction caused two problems: (1) space and speed gains were not realized because CHAR columns still stored variable‑length data, and (2) users could not store characters that require four bytes, such as many emojis.
Because fixing the charset would have required a massive migration, MySQL kept the broken "utf8" until the introduction of "utf8mb4" in 2010.
Why This Matters
The author spent a week troubleshooting the issue, only to discover that the root cause was the misuse of the legacy charset. The same mistake appears in countless online articles that incorrectly recommend "utf8".
Conclusion
If you are using MySQL or MariaDB, stop using the "utf8" charset and migrate all tables to "utf8mb4". A detailed migration guide is available at https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4 .
Java Captain
Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.