Databases 8 min read

Why MySQL’s “utf8” Isn’t Real UTF‑8 and How to Fix It with utf8mb4

The article explains that MySQL’s legacy “utf8” charset only supports three‑byte characters, causing errors when storing true four‑byte UTF‑8 data, and shows how switching to the proper “utf8mb4” charset resolves the issue.

ITPUB
ITPUB
ITPUB
Why MySQL’s “utf8” Isn’t Real UTF‑8 and How to Fix It with utf8mb4

While trying to save a UTF‑8 string through Rails into a MariaDB database configured with the "utf8" charset, the author encountered the error:

Incorrect string value: ‘\xF0\x9F\x98\x83 …’ for column ‘summary’ at row 1

The root cause is that MySQL’s "utf8" is not a true UTF‑8 implementation; it only supports characters up to three bytes, whereas the official UTF‑8 standard (RFC 3629) allows up to four bytes per character.

MySQL never fixed this limitation in the original "utf8" charset. In 2010 it introduced a new charset called "utf8mb4" that fully supports four‑byte Unicode characters, effectively bypassing the bug.

Key takeaways:

MySQL’s "utf8mb4" is the correct, full‑UTF‑8 charset.

MySQL’s legacy "utf8" is a proprietary, limited encoding that cannot represent many Unicode characters.

All MySQL and MariaDB users should migrate from "utf8" to "utf8mb4" and never use the former again.

What is an encoding? What is UTF‑8?

Computers store text as binary numbers (e.g., the character "C" as 01000011). The process involves converting the character to a numeric code point (Unicode) and then encoding that number into a byte sequence. UTF‑8 encodes common characters in a single byte and uses two to four bytes for less common symbols, saving space compared to fixed‑width encodings like UTF‑32.

MySQL history

MySQL added UTF‑8 support in version 4.1 (2003), but at that time the UTF‑8 standard was still evolving. The older RFC 2279 allowed up to six‑byte sequences, and MySQL initially adopted that. In September 2002 the MySQL source was altered to limit UTF‑8 to three‑byte sequences, a decision whose author is unknown.

The change was motivated by performance considerations: using fixed‑length CHAR columns (e.g., CHAR(1) = 6 bytes) allowed MySQL to store and compare data more quickly. However, this “utf8” charset became incompatible with true UTF‑8 and caused both space inefficiency and data loss for characters requiring four bytes.

Because fixing the charset would have required every user to rebuild their databases, MySQL left the broken charset in place until it finally released the proper "utf8mb4" charset in 2010.

Why this matters

The broken charset leads to frustrating bugs, wasted development time, and incorrect assumptions in countless tutorials that treat MySQL’s "utf8" as genuine UTF‑8.

Conclusion

If you are using MySQL or MariaDB, stop using the "utf8" charset and migrate to "utf8mb4". A detailed migration guide is available at https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

encodingmysqlCharacter Setutf8mb4MariaDB
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.