Why Do You See Garbled Text? Master Character Encoding (ASCII, Unicode, UTF‑8, GBK)
This article explains the fundamentals of character encoding, covering character sets, encoding rules, and common schemes such as ASCII, Unicode, UTF‑8, GB2312, GBK, and GB18030, helping developers prevent and resolve garbled text issues in web and database applications.
When developing web applications or handling databases, encountering garbled text is a common headache. The root cause lies in character encoding, which consists of two concepts: the character set (the collection of abstract characters) and the encoding (the rules that map those characters to binary representations).
What Is a Character Set?
A character set is the complete collection of symbols a system supports, including letters, numbers, punctuation, and graphical symbols from various languages.
What Is Character Encoding?
Encoding defines how characters are translated into numeric codes that computers can store and process. Since computers operate on binary 0s and 1s, each character must be represented by a specific sequence of bits. Decoding reverses this process; mismatched encoding and decoding produce the dreaded "garbled" output.
ASCII Encoding
ASCII is the earliest widely used encoding, representing characters with a single byte (8 bits). It covers basic English letters, digits, and common symbols. The following image illustrates the ASCII table:
While sufficient for English, ASCII cannot represent characters from other languages, prompting the development of extended encodings.
Unicode and UTF‑8
Unicode assigns a unique numeric identifier to every character used in all writing systems, enabling cross‑language and cross‑platform text handling. Modern systems typically store each Unicode code point in 2 bytes (16 bits), known as a wide‑character or DBCS.
UTF‑8 is a variable‑length encoding for Unicode. It uses 1 byte for code points 0x0000‑0x007F, 2 bytes for 0x0080‑0x07FF, and 3 bytes for 0x0800‑0xFFFF. For example, the Chinese character "汉" (U+6C49) falls in the 0x0800‑0xFFFF range and is encoded as three bytes: E6 B1 89. UTF‑8 is backward compatible with ASCII and has become the de‑facto standard for web content.
GB2312, GBK, and GB18030
In China, legacy encodings were created to represent Chinese characters. GB2312 uses two bytes per character, covering about 7,000 simplified Chinese characters. GBK extends GB2312 by adding roughly 20,000 additional characters, including traditional Chinese and symbols. GB18030 further expands GBK to include minority language characters, ensuring comprehensive coverage.
When a website targets only domestic Chinese users, GBK can reduce storage compared to UTF‑8, but for international accessibility UTF‑8 is recommended to avoid garbled text.
Understanding these encoding schemes and correctly setting the character set (e.g., using charset=UTF-8 in HTML headers) is essential for preventing encoding mismatches in both front‑end and back‑end development.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
