Databases 8 min read

Why MySQL’s “utf8” Isn’t Real UTF‑8 and How to Switch to utf8mb4

This article explains why MySQL’s legacy "utf8" charset only supports three‑byte characters, why it’s not true UTF‑8, and provides a clear guide to migrating databases to the proper "utf8mb4" charset for full Unicode support.

Programmer DD
Programmer DD
Programmer DD
Why MySQL’s “utf8” Isn’t Real UTF‑8 and How to Switch to utf8mb4

Recently I encountered a bug when trying to save a UTF‑8 string in a MariaDB database using Rails with the "utf8" charset, receiving an error like:

Incorrect string value: ‘\xF9\x98\x83 …’ for column ‘summary’ at row 1

Even though the client, server, and database were all set to UTF‑8, MySQL’s "utf8" charset is not true UTF‑8; it only supports up to three bytes per character, while real UTF‑8 allows up to four.

MySQL never fixed this limitation. In 2010 it introduced the "utf8mb4" charset, which fully implements UTF‑8.

MySQL’s "utf8mb4" is genuine UTF‑8.

MySQL’s "utf8" is a proprietary, limited‑capacity charset.

All MySQL and MariaDB users should stop using "utf8" and switch to "utf8mb4".

What Is Encoding? What Is UTF‑8?

Computers store text as binary. For example, the character "C" is stored as 01000011. The computer reads the bits, maps the number 67 to the Unicode code point, and displays "C".

Unicode contains millions of characters. UTF‑32 uses 32 bits per character, which is simple but wasteful. UTF‑8 saves space: common characters like "C" use 1 byte, while less common characters may use 2–4 bytes, reducing storage to about a quarter of UTF‑32.

MySQL History

MySQL added UTF‑8 support in version 4.1 (2003), but the standard at that time (RFC 2279) allowed up to six bytes per character. In September 2002 MySQL limited its "utf8" to a maximum of three‑byte sequences, deviating from the standard.

The reason appears to be performance: early MySQL encouraged defining text columns as CHAR with fixed byte lengths, padding or truncating as needed. Developers initially used a six‑byte per character model, but later changed to three bytes to keep CHAR columns efficient.

Because the proprietary "utf8" charset was already documented and widely referenced, many developers continued to use it, unaware that it could not store true four‑byte Unicode characters.

Why This Is Frustrating

The limitation caused weeks of wasted debugging time for developers who believed "utf8" was real UTF‑8. The charset’s incompatibility introduces hidden bugs and performance penalties.

Summary

If you are using MySQL or MariaDB, stop using the "utf8" charset and migrate to "utf8mb4". A migration guide is available at https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4 .

English original: https://medium.com/@adamhooper/in-mysql-never-use-utf8-use-utf8mb4-11761243e434 Original author: Adam Hooper Translation author: Wu Ming
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

mysqlUnicodecharacter encodingutf8mb4MariaDB
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.