Why MySQL’s “utf8” Isn’t Real UTF‑8 and How to Switch to utf8mb4
This article explains why MySQL’s legacy "utf8" charset only supports three‑byte characters, why it’s not true UTF‑8, and provides a clear guide to migrating databases to the proper "utf8mb4" charset for full Unicode support.
Recently I encountered a bug when trying to save a UTF‑8 string in a MariaDB database using Rails with the "utf8" charset, receiving an error like:
Incorrect string value: ‘\xF9\x98\x83 …’ for column ‘summary’ at row 1Even though the client, server, and database were all set to UTF‑8, MySQL’s "utf8" charset is not true UTF‑8; it only supports up to three bytes per character, while real UTF‑8 allows up to four.
MySQL never fixed this limitation. In 2010 it introduced the "utf8mb4" charset, which fully implements UTF‑8.
MySQL’s "utf8mb4" is genuine UTF‑8.
MySQL’s "utf8" is a proprietary, limited‑capacity charset.
All MySQL and MariaDB users should stop using "utf8" and switch to "utf8mb4".
What Is Encoding? What Is UTF‑8?
Computers store text as binary. For example, the character "C" is stored as 01000011. The computer reads the bits, maps the number 67 to the Unicode code point, and displays "C".
Unicode contains millions of characters. UTF‑32 uses 32 bits per character, which is simple but wasteful. UTF‑8 saves space: common characters like "C" use 1 byte, while less common characters may use 2–4 bytes, reducing storage to about a quarter of UTF‑32.
MySQL History
MySQL added UTF‑8 support in version 4.1 (2003), but the standard at that time (RFC 2279) allowed up to six bytes per character. In September 2002 MySQL limited its "utf8" to a maximum of three‑byte sequences, deviating from the standard.
The reason appears to be performance: early MySQL encouraged defining text columns as CHAR with fixed byte lengths, padding or truncating as needed. Developers initially used a six‑byte per character model, but later changed to three bytes to keep CHAR columns efficient.
Because the proprietary "utf8" charset was already documented and widely referenced, many developers continued to use it, unaware that it could not store true four‑byte Unicode characters.
Why This Is Frustrating
The limitation caused weeks of wasted debugging time for developers who believed "utf8" was real UTF‑8. The charset’s incompatibility introduces hidden bugs and performance penalties.
Summary
If you are using MySQL or MariaDB, stop using the "utf8" charset and migrate to "utf8mb4". A migration guide is available at https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4 .
English original: https://medium.com/@adamhooper/in-mysql-never-use-utf8-use-utf8mb4-11761243e434 Original author: Adam Hooper Translation author: Wu Ming
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
