Why MySQL’s utf8 Isn’t Real UTF‑8 and How utf8mb4 Fixes Emoji Insertion Errors
This article explains why MySQL's default utf8 charset cannot store four‑byte characters such as emojis, demonstrates the resulting insertion error, and shows how switching to the utf8mb4 charset resolves the issue while also covering the historical reasons behind MySQL's limited utf8 implementation.
1. Error Review
Inserting an emoji directly into a MySQL INSERT statement caused the error:
INSERT INTO `csjdemo`.`student` (`ID`, `NAME`, `SEX`, `AGE`, `CLASS`, `GRADE`, `HOBBY`) VALUES ('20','陈哈哈😓','男','20','181班','9年级','看片儿');[Err] 1366 - Incorrect string value: '\xF0\x9F\x98\x93' for column 'NAME' at row 1
After changing the database, connection, and column collations to utf8mb4 , the insertion succeeds:
INSERT INTO `student` (`ID`, `NAME`, `SEX`, `AGE`, `CLASS`, `GRADE`, `HOBBY`) VALUES (null,'陈哈哈😓😓','男','20','181班','9年级','看片儿');2. Fun Facts About utf8 in MySQL
MySQL's "utf8" is not true UTF‑8.
In MySQL, the "utf8" charset only supports up to three bytes per character, while real UTF‑8 supports up to four bytes.
Chinese characters occupy three bytes, ASCII characters one byte, but emojis require four bytes, causing insertion failures unless the charset is changed to utf8mb4 .
The following diagram shows the byte count before and after switching to utf8mb4, illustrating why four‑byte characters cannot be stored in the old utf8 charset.
MySQL introduced the utf8mb4 charset in 2010 to work around this limitation, but the documentation still often incorrectly recommends using "utf8".
1. utf8mb4 Is the Real UTF‑8
MySQL's "utf8mb4" is the true UTF‑8 implementation. The older "utf8" charset is a proprietary subset that cannot represent many Unicode characters.
All MySQL and MariaDB users should migrate to utf8mb4 and stop using "utf8".
2. Brief History of utf8
MySQL added UTF‑8 support in version 4.1 (2003), but the standard UTF‑8 (RFC 3629) was defined later. The earlier RFC 2279 allowed up to six bytes per character.
In September 2002, MySQL limited its "utf8" to three‑byte sequences, effectively creating a non‑standard charset.
Developers likely made this change to improve performance for fixed‑length CHAR columns, assuming all rows would have the same byte count. However, this decision broke true UTF‑8 support, especially for emojis and some CJK characters.
Because fixing the charset would require users to rebuild their databases, MySQL kept the flawed "utf8" for years and only introduced the proper utf8mb4 charset in 2010.
3. Summary
Most online articles mistakenly treat MySQL's "utf8" as real UTF‑8, leading developers to encounter insertion errors with four‑byte characters. When creating new MySQL or MariaDB databases, always set the database, tables, and columns to utf8mb4 to ensure full Unicode compatibility.
Doing so will prevent future "Incorrect string value" errors and demonstrate a solid grasp of modern database encoding practices.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
