Databases 12 min read

Why MySQL’s “utf8” Is Not Real UTF‑8 and You Should Use utf8mb4

The article explains that MySQL’s legacy “utf8” character set only supports three‑byte sequences and therefore cannot store true UTF‑8 characters, describes the historical reasons behind this limitation, and advises all MySQL/MariaDB users to migrate to the proper utf8mb4 charset.

Top Architect

Jul 21, 2020

Why MySQL’s “utf8” Is Not Real UTF‑8 and You Should Use utf8mb4

Recently I encountered a bug when trying to save a UTF‑8 string through Rails into a MariaDB database that was configured with the "utf8" charset, which resulted in a puzzling error.

Incorrect string value: ‘\xF0\x9F\x98\x83 <…>’ for column ‘summary’ at row 1

I was using a UTF‑8 client, the server was set to UTF‑8, and even the string itself was valid UTF‑8, yet the insert failed.

The root cause is that MySQL’s "utf8" is not true UTF‑8.

MySQL’s "utf8" only supports up to three bytes per character, while the official UTF‑8 standard allows up to four bytes.

MySQL never fixed this bug; instead, in 2010 it introduced a new charset called "utf8mb4" that correctly implements full UTF‑8.

Unfortunately the new charset was not widely advertised, so many developers still receive the incorrect advice to use "utf8".

In short:

MySQL’s "utf8mb4" is the real UTF‑8.

MySQL’s "utf8" is a proprietary limited encoding that cannot represent many Unicode characters.

All MySQL and MariaDB users who are still using "utf8" should switch to "utf8mb4" and never use the old "utf8" again.

What Is Encoding? What Is UTF‑8?

Computers store text as binary bits. For example, the character "C" is stored as the byte pattern "01000011". To display the character, the computer performs two steps:

The computer reads "01000011", interprets it as the number 67 because 67 is encoded as that bit pattern.

The computer looks up 67 in the Unicode character set and finds the glyph "C".

Conversely, the computer maps "C" to Unicode code point 67, encodes 67 back to "01000011", and sends it to a web server.

Almost all web applications use the Unicode character set because there is no reason to use another set.

Unicode contains millions of characters. The simplest encoding, UTF‑32, uses 32 bits per character, which is wasteful. UTF‑8 saves space: the character "C" needs only 8 bits, while rarely used characters may need 16, 24, or 32 bits. An article like this one encoded in UTF‑8 occupies roughly a quarter of the space required by UTF‑32.

MySQL’s "utf8" charset is incompatible with other programs; its definition is essentially a broken subset of Unicode.

MySQL History

Why did MySQL developers cripple "utf8"? Looking at commit logs gives some clues.

MySQL started supporting UTF‑8 in version 4.1 (2003), but the UTF‑8 standard used today (RFC 3629) was published later.

The older UTF‑8 standard (RFC 2279) allowed up to six bytes per character. On 28 Mar 2002, MySQL developers used RFC 2279 in the first MySQL 4.1 preview.

In September of the same year they changed the source code so that "UTF8" would support at most three‑byte sequences.

Who made these changes and why remains unknown; after MySQL switched to Git many original commit authors were lost.

My speculation: in 2002 MySQL wanted to boost performance by assuming every row used the same number of bytes, encouraging users to define text columns as fixed‑length CHAR. This required padding or truncation, and the early implementation used six bytes per character (CHAR(1) = 6 bytes, CHAR(2) = 12 bytes, etc.).

Although that approach was technically correct, it was never released, yet the documentation incorrectly claimed that MySQL’s "utf8" was standard UTF‑8, and the misinformation spread.

Later, developers feared users would both define CHAR columns and set their charset to "utf8", which would defeat the intended space‑and‑speed benefits.

Consequently, there were no winners: users who chose CHAR with "utf8" ended up using more space and slower performance, while users who needed correct encoding could not store characters like the smiling‑face emoji.

Because fixing the broken charset would have required every user to rebuild their databases, MySQL left the buggy "utf8" in place until 2010, when it finally released "utf8mb4" to support true UTF‑8.

Why This Drives People Crazy

The issue drove me crazy for a week; I was fooled by "utf8" and spent a lot of time tracking down the bug, and I’m sure many others have suffered the same.

"utf8" is essentially a proprietary charset that introduces new problems without being resolved.

Summary

If you are using MySQL or MariaDB, stop using the "utf8" charset and migrate to "utf8mb4". A guide for converting existing databases is available at https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4 .

-END-

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

database encoding mysql Character Set utf8mb4 MariaDB

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.