Databases 9 min read

Why MySQL’s utf8 Isn’t Real UTF‑8 and How utf8mb4 Solves Emoji Errors

This article explains why MySQL's built‑in utf8 charset is not true UTF‑8, how inserting 4‑byte emoji characters triggers errors, and why switching the database, tables, and connection to utf8mb4 resolves the issue, along with a brief history of MySQL's encoding decisions.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Why MySQL’s utf8 Isn’t Real UTF‑8 and How utf8mb4 Solves Emoji Errors

Last year I tried to store emoji characters in a MySQL table and kept getting errors. The problem was solved simply by changing the character set from utf8 to utf8mb4, but I never investigated why.

Later I read that emoji characters occupy four bytes and must be received with true UTF‑8; other encodings will fail. This made me realize that MySQL’s utf8 is not actually UTF‑8.

1. Error Recap

Inserting an emoji directly into an INSERT statement caused the following error:

INSERT INTO `csjdemo`.`student` (`ID`, `NAME`, `SEX`, `AGE`, `CLASS`, `GRADE`, `HOBBY`)
VALUES ('20', '陈哈哈😓', '男', '20', '181班', '9年级', '看片儿');
[Err] 1366 - Incorrect string value: '\xF0\x9F\x98\x93' for column 'NAME' at row 1

After changing the database, server, and column collations to utf8mb4, the insert succeeded:

INSERT INTO `student` (`ID`, `NAME`, `SEX`, `AGE`, `CLASS`, `GRADE`, `HOBBY`)
VALUES (null, '陈哈哈😓😓', '男', '20', '181班', '9年级', '看片儿');

2. MySQL’s utf8 Quirk

MySQL’s utf8 charset only supports up to three bytes per character, while true UTF‑8 supports up to four bytes. Chinese characters use three bytes, English letters and digits use one byte, but emoji and many complex characters need four bytes.

In MySQL, the "utf8" encoding supports a maximum of three bytes per character, whereas the real UTF‑8 standard allows up to four bytes.

Therefore, inserting four‑byte characters fails unless the charset is changed to utf8mb4.

The image shows the difference in byte count before and after converting to utf8mb4. MySQL introduced utf8mb4 in 2010 to work around this limitation, but the change was not widely advertised, leading many developers to still use the misleading utf8 setting.

3. utf8mb4 Is the Real UTF‑8

Computers store text as binary data. For example, the character "C" is stored as the byte sequence 01000011. The computer reads the byte, maps it to the Unicode code point 67, and then displays "C".

The computer reads 01000011 and obtains the number 67.

It looks up code point 67 in the Unicode table and finds the character "C".

Almost all network applications use Unicode because it covers virtually all characters. Unicode can be encoded as UTF‑32 (four bytes per character), UTF‑16, or UTF‑8. UTF‑8 is space‑efficient: common ASCII characters use one byte, while characters like "😓" need four bytes.

4. A Brief History of MySQL’s utf8

MySQL added UTF‑8 support in version 4.1 (2003). At that time the official UTF‑8 standard (RFC 3629) had not yet been published; the older RFC 2279 allowed up to six bytes per character.

In September 2002 MySQL developers limited the charset to a maximum of three‑byte sequences, likely to improve performance for fixed‑length CHAR columns. This decision was never fully documented, and the original committers remain unknown.

The limitation meant that inserting four‑byte characters such as emoji would fail, while using utf8mb4 (introduced in 2010) restores full UTF‑8 compatibility.

5. Conclusion

Most online articles mistakenly treat MySQL’s utf8 as true UTF‑8. To avoid data loss and errors, anyone setting up MySQL or MariaDB should configure the server, databases, tables, and connections to use utf8mb4 instead of utf8.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Emojicharacter encodingutf8mb4
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.