Databases 7 min read

Why MySQL’s ‘utf8’ Isn’t Real UTF‑8 and How utf8mb4 Fixes It

Discover why MySQL’s legacy ‘utf8’ charset only supports three‑byte characters, causing storage errors for true UTF‑8 data, and learn how switching to the proper ‘utf8mb4’ charset resolves these issues, with a brief history and practical migration guidance.

21CTO
21CTO
21CTO
Why MySQL’s ‘utf8’ Isn’t Real UTF‑8 and How utf8mb4 Fixes It

When trying to store a UTF‑8 string in a MariaDB database using Rails, an unexpected error occurs:

Incorrect string value: ‘😃 <…>’ for column ‘summary’ at row 1

The root cause is that MySQL’s “utf8” charset is not true UTF‑8; it only supports up to three bytes per character, while real UTF‑8 allows up to four bytes.

MySQL introduced the “utf8mb4” charset in 2010 to work around this limitation, but the older “utf8” charset remains widely recommended despite being incorrect.

MySQL’s “utf8mb4” is genuine UTF‑8.

MySQL’s “utf8” is a proprietary limited charset that cannot represent many Unicode characters.

All MySQL and MariaDB users should migrate to “utf8mb4” and stop using “utf8”.

What Is an Encoding? What Is UTF‑8?

Computers store text as binary data. For example, the character “C” is stored as the byte sequence 01000011. The computer reads the bytes, interprets them as a number (67), and looks up the corresponding Unicode character.

Unicode contains millions of characters. The simplest encoding, UTF‑32, uses 32 bits per character, which is wasteful. UTF‑8 is more space‑efficient: common characters like “C” use one byte, while less common characters may use up to four bytes.

Brief History of MySQL’s UTF‑8 Support

MySQL added UTF‑8 support in version 4.1 (2003), based on the older RFC 2279 standard that allowed up to six bytes per character. In September 2002, MySQL limited UTF‑8 to a maximum of three‑byte sequences, effectively breaking true UTF‑8 support.

The change appears to have been motivated by performance considerations: if every row used the same number of bytes, MySQL could store CHAR columns more efficiently. However, this decision introduced incompatibilities and forced developers to use the flawed “utf8” charset.

Because fixing the charset would require users to rebuild their databases, MySQL left the broken “utf8” in place until it introduced “utf8mb4” in 2010, which fully supports UTF‑8.

Why This Issue Is So Frustrating

Developers are often misled by documentation and tutorials that treat MySQL’s “utf8” as real UTF‑8, leading to baffling bugs when trying to store characters like emojis.

Conclusion

If you are using MySQL or MariaDB, stop using the “utf8” charset and switch to “utf8mb4”. A migration guide is available at https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4 to convert existing databases.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

mysqlUTF-8character encodingutf8mb4MariaDB
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.