Why MySQL Strings Get Garbled: Mastering Charset and Collation
This article dives deep into MySQL's charset and collation system, explaining concepts, configuration levels, system variables, string literals, conversion rules, Unicode sorting algorithms, binary collations, and practical tips to avoid common encoding pitfalls and ensure correct string handling.
Introduction
When using MySQL you may encounter garbled strings, unexpected sorting results, or mismatched character sets. This article explains the concepts of charset and collation, how they are configured at various levels, and how they affect string storage and comparison.
Basic Concepts
A charset (character set) defines a mapping between characters and byte sequences. A collation defines the rules for comparing characters within a charset. MySQL supports many charsets (e.g., utf8mb4, latin1) and collations (e.g., utf8mb4_0900_ai_ci, utf8mb4_bin).
Viewing Supported Charsets and Collations
SHOW CHARACTER SET; SHOW COLLATION WHERE Charset='utf8mb4';System Variables
The most important variables for connections are character_set_client, character_set_results, character_set_connection, and collation_connection. They control how the server interprets incoming literals, how it stores data, and how it returns results.
Setting Variables
Use SET NAMES 'charset_name' to set character_set_client, character_set_results, and character_set_connection to the same charset. Use SET NAMES 'charset_name' COLLATE 'collation_name' to also set the collation.
Creating Databases, Tables, and Columns
CREATE DATABASE db_name CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci; CREATE TABLE t1 (a CHAR(5) CHARACTER SET utf8mb4, b CHAR(5) CHARACTER SET latin1);If a charset or collation is omitted, the object inherits the defaults from its parent (server → database → table → column).
String Literals and Introducers
String literals have an implicit charset defined by character_set_connection. You can override it with an introducer, e.g., _latin1'text'. Introducers do not convert the literal; they only tell the parser which charset to assume.
Common Pitfalls
Using the wrong client charset leads to double conversion and garbled data.
Introducing a literal with the wrong charset (e.g., _latin1'张') can insert raw UTF‑8 bytes into a latin1 column, producing unreadable data.
Binary literals ( X'...', b'...', _binary'...') bypass charset checks and can store any byte sequence, even if the target column’s charset does not support those characters.
Unicode Collation Algorithm (UCA)
MySQL uses the Unicode Collation Algorithm to compare strings. Collations such as utf8mb4_0900_as_cs are based on UCA 9.0.0 and consider three levels: primary (base character), secondary (accent), and tertiary (case). The algorithm builds a sort key from collation elements for each character and compares the keys.
SELECT WEIGHT_STRING(_utf8mb4'aáA' COLLATE utf8mb4_0900_as_cs);Collations ending with _ai ignore accents, _as are accent‑sensitive, _ci ignore case, and _cs are case‑sensitive.
Binary Collations
Binary collations compare strings byte‑by‑byte. For non‑binary charsets, the _bin collation first converts characters to their Unicode code points (often padded to a fixed width) before comparing. The newer utf8mb4_0900_bin compares the raw UTF‑8 bytes directly, offering better performance.
SELECT WEIGHT_STRING(_utf8mb4'ç' COLLATE utf8mb4_bin);Coercibility Rules
When an expression mixes strings with different collations, MySQL chooses the collation with the lowest coercibility value. Explicit COLLATE clauses have coercibility 0, column values 2, literals 4, and NULL 6. If two collations have the same coercibility, an error is raised unless one is a Unicode charset, in which case the Unicode charset wins.
Changing Column Charsets
Use ALTER TABLE t MODIFY col VARCHAR(10) CHARACTER SET latin1; or ALTER TABLE t CONVERT TO CHARACTER SET latin1;. Be aware that binary columns copy bytes directly, while character columns attempt conversion and may replace unsupported characters with '?'.
Functions and Case Conversion
Functions like LOWER() and UPPER() work on non‑binary collations. They do not work on binary strings unless you first convert them to a non‑binary charset.
Practical Recommendations
Set the client, connection, and result charsets consistently using SET NAMES.
Avoid mixing introducers with a client charset that differs from the introducer.
Prefer utf8mb4 with utf8mb4_0900_ai_ci for general use; use utf8mb4_0900_as_cs if you need accent‑ and case‑sensitive sorting.
Use utf8mb4_0900_bin when you need exact byte comparison while still being able to call UPPER() / LOWER().
Inspect sort behavior with WEIGHT_STRING() before committing to a collation.
Never store non‑BMP characters in a collation based on UCA 4.0.0 (e.g., utf8mb4_general_ci) if you need them to be distinct.
Conclusion
Understanding MySQL's charset and collation system is essential for reliable string storage, comparison, and sorting. By configuring the appropriate charset, choosing the right collation, and being aware of conversion rules and binary behavior, you can avoid garbled data and achieve the desired ordering semantics.
References
Charset
UCA 9.0.0
UCA 5.2.0
UCA 4.0.0
Unicode
UTF-8
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
