Why MySQL’s “utf8” Isn’t Real UTF‑8 and How utf8mb4 Fixes It
The article explains that MySQL’s legacy "utf8" charset only supports three‑byte sequences, causing errors with true four‑byte Unicode characters, and shows how the later "utf8mb4" charset provides proper UTF‑8 support along with historical context and migration guidance.
Background and the bug
A Rails application tried to insert a UTF‑8 string containing a four‑byte emoji into a MariaDB column declared with the utf8 charset and received the error:
Incorrect string value: ‘\xF0\x9F\x98\x83 …’ for column ‘summary’ at row 1Both the client and the server were configured for UTF‑8, but MySQL’s historic utf8 charset only supports characters up to three bytes, so the four‑byte emoji could not be stored.
What UTF‑8 is
Unicode assigns a unique code point to every character. UTF‑8 encodes each code point using a variable number of bytes: one byte for the ASCII range, up to four bytes for characters outside the Basic Multilingual Plane (e.g., most emoji). This representation is space‑efficient compared with fixed‑width encodings such as UTF‑32.
MySQL’s “utf8” limitation
The MySQL‑specific utf8 charset is a three‑byte subset of Unicode. It cannot represent any code point that requires four bytes, which means many modern characters—including most emoji—are rejected with “Incorrect string value” errors.
History of MySQL character sets
MySQL added UTF‑8 support in version 4.1 (2003) based on the older RFC 2279 specification, which allowed up to six‑byte sequences. In September 2002 the implementation was deliberately changed to cap UTF‑8 at three bytes, creating the broken utf8 charset. The change was motivated by a performance idea: using fixed‑length CHAR columns under the assumption that every character would occupy the same number of bytes. The decision resulted in an incompatible, proprietary encoding.
Why the issue matters to developers
Many tutorials still recommend utf8, leading to unexpected “Incorrect string value” failures when four‑byte characters are used.
The three‑byte charset consumes more storage than necessary for multi‑byte characters and can degrade query performance.
Correct Unicode data cannot be stored, breaking internationalization and data integrity.
Resolution: use utf8mb4
In 2010 MySQL (and MariaDB) introduced the utf8mb4 charset, which implements the full Unicode UTF‑8 range (up to four bytes). All new databases should be created with utf8mb4, and existing installations should be migrated.
Migration steps
Set server‑wide defaults (e.g., in my.cnf or my.ini ).
[mysqld]
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ciConvert each database, table, and column to utf8mb4 .
-- Convert a whole database
ALTER DATABASE db_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
-- Convert a specific table
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;Verify the conversion.
SHOW VARIABLES LIKE 'character_set%';
SHOW CREATE TABLE tbl_name;Reference: https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4
The same URL contains a detailed guide for converting existing databases from utf8 to utf8mb4.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
