Databases 10 min read

Never Use MySQL “utf8” – Switch to “utf8mb4” for Real UTF‑8 Support

The article explains why MySQL’s legacy “utf8” character set only supports three‑byte sequences, causing errors with genuine four‑byte UTF‑8 characters, and shows how the proper “utf8mb4” charset resolves the issue, including historical background, practical examples, and migration guidance.

Architecture Digest

Jun 23, 2020

Never Use MySQL “utf8” – Switch to “utf8mb4” for Real UTF‑8 Support

While trying to store a UTF‑8 string in a MariaDB database using Rails, the author encountered an error indicating an incorrect string value for the column summary, caused by MySQL’s misleading “utf8” charset.

Incorrect string value: ‘\xF0\x9F\x98\x83 <…>’ for column ‘summary’ at row 1

The root cause is that MySQL’s “utf8” charset only supports up to three bytes per character, whereas true UTF‑8 (as defined by RFC 3629) allows up to four bytes. MySQL never fixed this limitation; instead, in 2010 it introduced the “utf8mb4” charset, which fully implements UTF‑8.

All developers using MySQL or MariaDB should abandon the legacy “utf8” charset and migrate to “utf8mb4”, because the former cannot store many valid Unicode characters and leads to data loss or errors.

What is an encoding? What is UTF‑8?

Computers store text as binary numbers; for example, the character “C” is stored as the byte 01000011. The process of converting a character to a numeric code point (Unicode) and then to a byte sequence (encoding) is what we call encoding.

Unicode defines millions of characters. UTF‑32 uses 32 bits per character, which is simple but wasteful. UTF‑8 is a variable‑length encoding that uses 1‑4 bytes per character, saving space while remaining compatible with ASCII.

MySQL’s short history with UTF‑8

MySQL added UTF‑8 support in version 4.1 (2003), but at that time the official UTF‑8 standard (RFC 2279) allowed up to six bytes per character. MySQL developers later limited the charset to three‑byte sequences, effectively creating a proprietary “utf8” that is not true UTF‑8.

The change appears to have been motivated by performance considerations: using fixed‑length CHAR columns with a known byte size could speed up storage and retrieval. However, this decision broke compatibility with the real UTF‑8 standard and caused many developers to encounter truncation or errors when storing characters such as emojis.

Because fixing the charset would have required users to rebuild their databases, MySQL kept the broken “utf8” for years. The proper solution, “utf8mb4”, was finally released in 2010 to support the full Unicode range.

Why this matters

Developers who unknowingly use “utf8” may spend weeks debugging mysterious errors, as the author experienced. The charset’s limitation is widely misunderstood, and many tutorials still recommend “utf8” incorrectly.

Conclusion

If you are using MySQL or MariaDB, stop using the “utf8” charset and convert all tables and columns to “utf8mb4”. A helpful migration guide is available at https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4 .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

database mysql Unicode character encoding utf8mb4 MariaDB

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.