Databases 24 min read

Mastering Character Sets: Diagnose and Fix MySQL Garbled Text

This article explains character sets and encodings, shows why MySQL data can become garbled, and provides step‑by‑step methods—including proper configuration, common pitfalls, and reliable repair techniques—to prevent and correct encoding issues such as emoji storage failures.

dbaplus Community
dbaplus Community
dbaplus Community
Mastering Character Sets: Diagnose and Fix MySQL Garbled Text

Character Sets and Encodings

Character sets define the mapping between textual symbols and binary values. A character encoding (code point) specifies the numeric representation for each symbol. Common sets include ASCII, UTF‑8, UTF‑16, GBK and Unicode. UTF‑8 is a variable‑length encoding where the leading bits of each byte indicate the total number of bytes in the character.

If the first bit is 0, the byte represents a single‑byte character (7 bits of data).

If the first three bits are 110, the byte starts a two‑byte character (5 bits + 6 bits from the following byte).

If the first four bits are 1110, the byte starts a three‑byte character (4 bits + two following bytes).

Bytes beginning with 10 are continuation bytes.

Example encodings for the Chinese character "屌":

UTF‑8: 0xE5B18C  (11100101 10110001 10001100)
UTF‑16: 0x5C4C  (01011100 01001100)
GBK: 0x8CC5   (10001100 11000101)

Unicode is the abstract coded character set; UTF‑8 is one way to encode Unicode code points. UTF‑8 can represent all Unicode characters using one to four bytes, while older encodings like GBK are limited to three bytes and a smaller repertoire.

Emoji characters reside in Unicode planes beyond the Basic Multilingual Plane (e.g., U+1F601‑U+1F64F) and require four‑byte UTF‑8 sequences. Storing them in MySQL tables that use the legacy utf8 (three‑byte) charset results in errors such as ERROR 1366: Incorrect string value. The solution is to use utf8mb4 or filter/replace emojis before insertion.

MySQL Garbled Text

Garbling occurs when the client, MySQL server, and table character sets are not consistent. The data flow involves three conversion steps:

Client encodes input into a binary stream.

MySQL server decodes the stream using character_set_client.

Server re‑encodes the data to the table’s charset before storage.

When retrieving data, the reverse process occurs. Mismatches at any step produce mojibake (garbled text). Common causes include different encodings used for input and output, and inconsistent settings between character_set_client, the connection, and the table definition.

How to Avoid Garbling

Ensure that the client, character_set_client, and table charset are all set to the same encoding, preferably utf8mb4 for full Unicode support.

Incorrect Fixes

Method 1 – ALTER TABLE ... CHARSET=xxx only changes the default charset for future columns; it does not convert existing data, so it cannot fix garbled text.

Method 2 – ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name assumes the data is currently stored correctly. Applying it to tables that already contain mojibake can corrupt the data further.

Correct Fixes

Method 1 – Dump & Reload

Export the corrupted data using the client’s current (incorrect) charset.

Create a new table with the correct charset (e.g., utf8mb4).

Import the dump into the new table, allowing MySQL to reinterpret the bytes correctly.

CREATE TABLE new_tbl LIKE old_tbl;
ALTER TABLE new_tbl CONVERT TO CHARACTER SET utf8mb4;
mysql --default-character-set=utf8mb4 -u user -p db < dump.sql

Method 2 – Convert to Binary & Convert Back

-- Export as binary
SELECT CONVERT(col USING BINARY) AS bin_col FROM tbl;
-- Re‑import with proper charset
INSERT INTO new_tbl(col) VALUES (CONVERT(bin_col USING utf8mb4));

Reference URLs

http://cenalulu.github.io/python/python-encoding/

http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html

http://www.chi2ko.com/tool/CJK.htm

http://apps.timwhitlock.info/emoji/tables/unicode

http://www.joelonsoftware.com/articles/Unicode.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

EmojidatabasemysqlUTF-8character encodingGBKgarbled text
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.