Why MySQL’s “utf8” Isn’t Real UTF‑8 and How to Switch to utf8mb4
The article explains that MySQL’s legacy “utf8” charset only supports three‑byte characters, causing errors when storing true four‑byte UTF‑8 symbols like emojis, and shows why switching to the proper “utf8mb4” charset is essential for correct Unicode handling.
When trying to store the emoji "😃" in a MariaDB table declared with the "utf8" charset, Rails raises the error
Incorrect string value: ‘😃 …’ for column ‘summary’ at row 1, even though the client, server and database are all set to UTF‑8.
The root cause is that MySQL’s historic "utf8" charset is limited to three bytes per character, so it cannot represent characters that require four bytes in true UTF‑8. MySQL never fixed this limitation; instead, in 2010 it introduced the "utf8mb4" charset, which fully supports the Unicode standard.
All MySQL and MariaDB users are therefore advised to abandon the legacy "utf8" charset and migrate every database, table, and column to "utf8mb4" to avoid data loss and storage inefficiencies.
What Is an Encoding?
Computers store text as binary numbers; for example, the character "C" is stored as the byte sequence 01000011. The process of mapping characters to numbers is defined by an encoding such as Unicode, and the actual storage format (UTF‑8, UTF‑32, etc.) determines how many bytes each character occupies.
Unicode contains millions of characters. UTF‑32 uses a fixed 32‑bit (four‑byte) representation for every character, which is simple but wasteful. UTF‑8 is variable‑length: common ASCII characters use one byte, while less common symbols may use two, three, or four bytes, saving roughly three‑quarters of the space compared to UTF‑32 for typical text.
MySQL’s UTF‑8 History
MySQL added UTF‑8 support in version 4.1 (2003) based on the old RFC 2279, which allowed up to six bytes per character. In September 2002 the developers changed the implementation to limit UTF‑8 sequences to a maximum of three bytes, effectively creating a proprietary charset that could not store many modern Unicode symbols.
The change was not well documented, and the commit history is obscure because the project switched from BitKeeper to Git, losing many author names. The motivation appears to have been performance: by forcing CHAR columns to a fixed byte width, MySQL could store and compare rows more quickly, assuming users would pad or truncate data to fit the defined length.
Unfortunately, this decision broke compatibility with the true UTF‑8 standard. Users who defined CHAR columns with "utf8" ended up with larger storage footprints and slower performance, while those needing full Unicode support could not store characters such as emojis.
Why It Matters
The proprietary "utf8" charset caused widespread confusion because most documentation and tutorials incorrectly presented it as genuine UTF‑8. As a result, many developers experienced baffling errors when their applications attempted to save four‑byte characters.
Since the charset cannot be corrected without rebuilding every affected database, MySQL introduced "utf8mb4" in 2010 to provide a proper UTF‑8 implementation.
How to Migrate
To convert an existing database from "utf8" to "utf8mb4", follow the migration guide at https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4, which details the necessary ALTER TABLE statements and configuration changes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
