Mastering Character Sets: Diagnose and Fix MySQL Garbled Text
This article explains character sets and encodings, shows why MySQL data can become garbled, and provides step‑by‑step methods—including proper configuration, common pitfalls, and reliable repair techniques—to prevent and correct encoding issues such as emoji storage failures.
Character Sets and Encodings
Character sets define the mapping between textual symbols and binary values. A character encoding (code point) specifies the numeric representation for each symbol. Common sets include ASCII, UTF‑8, UTF‑16, GBK and Unicode. UTF‑8 is a variable‑length encoding where the leading bits of each byte indicate the total number of bytes in the character.
If the first bit is 0, the byte represents a single‑byte character (7 bits of data).
If the first three bits are 110, the byte starts a two‑byte character (5 bits + 6 bits from the following byte).
If the first four bits are 1110, the byte starts a three‑byte character (4 bits + two following bytes).
Bytes beginning with 10 are continuation bytes.
Example encodings for the Chinese character "屌":
UTF‑8: 0xE5B18C (11100101 10110001 10001100)
UTF‑16: 0x5C4C (01011100 01001100)
GBK: 0x8CC5 (10001100 11000101)Unicode is the abstract coded character set; UTF‑8 is one way to encode Unicode code points. UTF‑8 can represent all Unicode characters using one to four bytes, while older encodings like GBK are limited to three bytes and a smaller repertoire.
Emoji characters reside in Unicode planes beyond the Basic Multilingual Plane (e.g., U+1F601‑U+1F64F) and require four‑byte UTF‑8 sequences. Storing them in MySQL tables that use the legacy utf8 (three‑byte) charset results in errors such as ERROR 1366: Incorrect string value. The solution is to use utf8mb4 or filter/replace emojis before insertion.
MySQL Garbled Text
Garbling occurs when the client, MySQL server, and table character sets are not consistent. The data flow involves three conversion steps:
Client encodes input into a binary stream.
MySQL server decodes the stream using character_set_client.
Server re‑encodes the data to the table’s charset before storage.
When retrieving data, the reverse process occurs. Mismatches at any step produce mojibake (garbled text). Common causes include different encodings used for input and output, and inconsistent settings between character_set_client, the connection, and the table definition.
How to Avoid Garbling
Ensure that the client, character_set_client, and table charset are all set to the same encoding, preferably utf8mb4 for full Unicode support.
Incorrect Fixes
Method 1 – ALTER TABLE ... CHARSET=xxx only changes the default charset for future columns; it does not convert existing data, so it cannot fix garbled text.
Method 2 – ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name assumes the data is currently stored correctly. Applying it to tables that already contain mojibake can corrupt the data further.
Correct Fixes
Method 1 – Dump & Reload
Export the corrupted data using the client’s current (incorrect) charset.
Create a new table with the correct charset (e.g., utf8mb4).
Import the dump into the new table, allowing MySQL to reinterpret the bytes correctly.
CREATE TABLE new_tbl LIKE old_tbl;
ALTER TABLE new_tbl CONVERT TO CHARACTER SET utf8mb4;
mysql --default-character-set=utf8mb4 -u user -p db < dump.sqlMethod 2 – Convert to Binary & Convert Back
-- Export as binary
SELECT CONVERT(col USING BINARY) AS bin_col FROM tbl;
-- Re‑import with proper charset
INSERT INTO new_tbl(col) VALUES (CONVERT(bin_col USING utf8mb4));Reference URLs
http://cenalulu.github.io/python/python-encoding/
http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html
http://www.chi2ko.com/tool/CJK.htm
http://apps.timwhitlock.info/emoji/tables/unicode
http://www.joelonsoftware.com/articles/Unicode.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
