Fundamentals 10 min read

Why Do You See Garbled Text? Master Character Encoding (ASCII, Unicode, UTF‑8, GBK)

This article explains the fundamentals of character encoding, covering character sets, encoding rules, and common schemes such as ASCII, Unicode, UTF‑8, GB2312, GBK, and GB18030, helping developers prevent and resolve garbled text issues in web and database applications.

ITPUB
ITPUB
ITPUB
Why Do You See Garbled Text? Master Character Encoding (ASCII, Unicode, UTF‑8, GBK)

When developing web applications or handling databases, encountering garbled text is a common headache. The root cause lies in character encoding, which consists of two concepts: the character set (the collection of abstract characters) and the encoding (the rules that map those characters to binary representations).

What Is a Character Set?

A character set is the complete collection of symbols a system supports, including letters, numbers, punctuation, and graphical symbols from various languages.

What Is Character Encoding?

Encoding defines how characters are translated into numeric codes that computers can store and process. Since computers operate on binary 0s and 1s, each character must be represented by a specific sequence of bits. Decoding reverses this process; mismatched encoding and decoding produce the dreaded "garbled" output.

ASCII Encoding

ASCII is the earliest widely used encoding, representing characters with a single byte (8 bits). It covers basic English letters, digits, and common symbols. The following image illustrates the ASCII table:

ASCII table
ASCII table

While sufficient for English, ASCII cannot represent characters from other languages, prompting the development of extended encodings.

Unicode and UTF‑8

Unicode assigns a unique numeric identifier to every character used in all writing systems, enabling cross‑language and cross‑platform text handling. Modern systems typically store each Unicode code point in 2 bytes (16 bits), known as a wide‑character or DBCS.

UTF‑8 is a variable‑length encoding for Unicode. It uses 1 byte for code points 0x0000‑0x007F, 2 bytes for 0x0080‑0x07FF, and 3 bytes for 0x0800‑0xFFFF. For example, the Chinese character "汉" (U+6C49) falls in the 0x0800‑0xFFFF range and is encoded as three bytes: E6 B1 89. UTF‑8 is backward compatible with ASCII and has become the de‑facto standard for web content.

GB2312, GBK, and GB18030

In China, legacy encodings were created to represent Chinese characters. GB2312 uses two bytes per character, covering about 7,000 simplified Chinese characters. GBK extends GB2312 by adding roughly 20,000 additional characters, including traditional Chinese and symbols. GB18030 further expands GBK to include minority language characters, ensuring comprehensive coverage.

When a website targets only domestic Chinese users, GBK can reduce storage compared to UTF‑8, but for international accessibility UTF‑8 is recommended to avoid garbled text.

Understanding these encoding schemes and correctly setting the character set (e.g., using charset=UTF-8 in HTML headers) is essential for preventing encoding mismatches in both front‑end and back‑end development.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Web DevelopmentUnicodeUTF-8character encodingASCIIGBK
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.