Fundamentals 10 min read

Understanding Character Encoding: ASCII, Unicode, UTF‑8, GBK and Common Garbled‑Text Issues

This article explains the fundamentals of character encoding—from Morse‑code analogies and ASCII to Unicode, UTF‑8/UTF‑16/UTF‑32, and Chinese encodings like GB2312, GBK, and GB18030—illustrating how mismatched encodings cause garbled text and how to avoid them.

Full-Stack Internet Architecture

Aug 28, 2019

Understanding Character Encoding: ASCII, Unicode, UTF‑8, GBK and Common Garbled‑Text Issues

The article begins with a casual anecdote to introduce the problem of garbled text and states that understanding computer character encoding is essential to avoid it.

It uses the analogy of Morse code in spy dramas to explain how characters are transformed into a series of signals, just as computers convert human‑readable text into binary code.

ASCII is introduced as the first standardized 7‑bit encoding for English characters, defining 128 symbols including letters, digits, punctuation, and control codes.

Unicode is then presented as a universal character set that can represent virtually all world scripts, providing a common foundation for modern software and internationalization.

The article describes how Unicode does not dictate storage format, leading to transformation formats such as UTF‑8, UTF‑16, and UTF‑32, each using variable‑length byte sequences to balance compatibility and efficiency.

Chinese-specific encodings—GB2312, GBK, and GB18030—are detailed, highlighting their historical development, character coverage, advantages, and limitations compared to Unicode.

To demonstrate garbled text, the article shows a Java example where a string encoded in GBK is incorrectly decoded with UTF‑8, producing unreadable characters; the full code is shown below:

public static void main(String[] args) throws UnsupportedEncodingException {
    String s = "漫话编程！";
    byte[] bytes = s.getBytes(Charset.forName("GBK"));
    System.out.println("GBK编码，GBK解码：" + new String(bytes, "GBK"));
    System.out.println("GBK编码，GB18030解码：" + new String(bytes, "GB18030"));
    System.out.println("GBK编码，UTF-8解码：" + new String(bytes, "UTF-8"));
}

The program’s output shows correct decoding with GBK and GB18030, but a series of question marks when decoded as UTF‑8, illustrating how mismatched encodings generate garbled output.

Finally, the article explains the Unicode replacement character (�, U+FFFD) used to represent unrecognizable bytes, and mentions other classic garbled patterns such as "锟斤拷" and memory‑initialization artifacts like "烫" and "屯" that arise from specific byte values.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Unicode UTF-8 character encoding ASCII GBK garbled text

Written by

Full-Stack Internet Architecture

Introducing full-stack Internet architecture technologies centered on Java

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.