Why MySQL’s utf8 Isn’t Real UTF‑8 and How utf8mb4 Fixes Emoji Errors
The article explains why MySQL’s built‑in utf8 charset only supports up to three‑byte characters, causing insert errors with four‑byte emojis, and shows how switching tables, system, and connection settings to utf8mb4 resolves the issue while detailing the historical reasons behind this limitation.
1. Error Recap
Inserting emoji characters directly into a MySQL
INSERTstatement caused an error:
<code>INSERT INTO `csjdemo`.`student` (`ID`, `NAME`, `SEX`, `AGE`, `CLASS`, `GRADE`, `HOBBY`)
VALUES ('20', '陈哈哈😓', '男', '20', '181班', '9年级', '看片儿');</code>[Err] 1366 - Incorrect string value: '\xF0\x9F\x98\x93' for column 'NAME' at row 1
Changing the database, table, and column character set to utf8mb4 allowed the insert to succeed:
<code>INSERT INTO `student` (`ID`, `NAME`, `SEX`, `AGE`, `CLASS`, `GRADE`, `HOBBY`)
VALUES (null, '陈哈哈😓😓', '男', '20', '181班', '9年级', '看片儿');</code>2. MySQL’s utf8 Quirk
MySQL’s utf8 charset is not true UTF‑8; it only supports up to three bytes per character. Real UTF‑8 supports up to four bytes, which is required for emojis and some complex characters.
In MySQL, the "utf8" encoding supports a maximum of three bytes per character, while true UTF‑8 supports up to four bytes.
Chinese characters occupy three bytes, while emojis occupy four bytes, causing insert failures unless the charset is changed to utf8mb4 .
After switching to utf8mb4, the stored data shows the correct byte counts, confirming that four‑byte characters can now be saved.
MySQL introduced the utf8mb4 charset in 2010 to work around this limitation, but the older utf8 charset remains widely (and incorrectly) recommended.
1. utf8mb4 Is the Real UTF‑8
All MySQL and MariaDB users should migrate to utf8mb4 and stop using utf8.
2. Brief History of MySQL utf8
MySQL added UTF‑8 support in version 4.1 (2003), but at that time the RFC 2279 standard allowed up to six bytes per character. In September 2002, developers limited MySQL’s utf8 to a maximum of three‑byte sequences, likely to improve storage and performance for fixed‑length CHAR columns.
This change was undocumented, leading many developers to believe that MySQL’s utf8 was the full UTF‑8 standard. The limitation persisted until MySQL released utf8mb4 in 2010.
3. Conclusion
Most online articles still treat MySQL’s utf8 as true UTF‑8, which causes failures when storing four‑byte characters like emojis. When creating new MySQL or MariaDB databases, always set the server, database, table, and column character sets to utf8mb4 to ensure proper handling of all Unicode characters.
macrozheng
Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.