Understanding Character Encoding: ASCII, Unicode, UTF‑8, GBK and Common Garbled‑Text Issues
This article explains the fundamentals of character encoding—from Morse‑code analogies and ASCII to Unicode, UTF‑8/UTF‑16/UTF‑32, and Chinese encodings like GB2312, GBK, and GB18030—illustrating how mismatched encodings cause garbled text and how to avoid them.
The article begins with a casual anecdote to introduce the problem of garbled text and states that understanding computer character encoding is essential to avoid it.
It uses the analogy of Morse code in spy dramas to explain how characters are transformed into a series of signals, just as computers convert human‑readable text into binary code.
ASCII is introduced as the first standardized 7‑bit encoding for English characters, defining 128 symbols including letters, digits, punctuation, and control codes.
Unicode is then presented as a universal character set that can represent virtually all world scripts, providing a common foundation for modern software and internationalization.
The article describes how Unicode does not dictate storage format, leading to transformation formats such as UTF‑8, UTF‑16, and UTF‑32, each using variable‑length byte sequences to balance compatibility and efficiency.
Chinese-specific encodings—GB2312, GBK, and GB18030—are detailed, highlighting their historical development, character coverage, advantages, and limitations compared to Unicode.
To demonstrate garbled text, the article shows a Java example where a string encoded in GBK is incorrectly decoded with UTF‑8, producing unreadable characters; the full code is shown below:
public static void main(String[] args) throws UnsupportedEncodingException {
String s = "漫话编程!";
byte[] bytes = s.getBytes(Charset.forName("GBK"));
System.out.println("GBK编码,GBK解码:" + new String(bytes, "GBK"));
System.out.println("GBK编码,GB18030解码:" + new String(bytes, "GB18030"));
System.out.println("GBK编码,UTF-8解码:" + new String(bytes, "UTF-8"));
}The program’s output shows correct decoding with GBK and GB18030, but a series of question marks when decoded as UTF‑8, illustrating how mismatched encodings generate garbled output.
Finally, the article explains the Unicode replacement character (�, U+FFFD) used to represent unrecognizable bytes, and mentions other classic garbled patterns such as "锟斤拷" and memory‑initialization artifacts like "烫" and "屯" that arise from specific byte values.
Full-Stack Internet Architecture
Introducing full-stack Internet architecture technologies centered on Java
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.