Why Web Pages Show Garbled Text: Charsets, Encodings & HTTP Headers
This article explains how computers store and display characters using binary, defines character sets and encodings such as ASCII, GB2312, GBK, GB18030, BIG5 and Unicode, compares UTF‑8, UTF‑16, UTF‑32, and describes related HTTP headers like Accept‑Charset, Content‑Type, and Content‑Encoding.
Have you ever opened a webpage only to see garbled characters like "бЇЯАзЪСЯ" or "�????????"? This article explores the HTTP header fields related to character sets and encodings such as Accept-Charset, Accept-Encoding, Accept-Language, Content-Encoding, and Content-Language.
1. Basic Knowledge
Computers store information as binary numbers; the characters we see on screen are the result of converting those binaries according to a specific rule called "encoding". The reverse process is "decoding". Using the wrong decoding rule produces garbled output.
A charset is the collection of abstract characters a system supports. Character encoding is a set of rules that maps characters to numeric codes that computers can store and process.
2. Common Charsets and Encodings
Typical charsets include ASCII, GB2312, BIG5, GB18030, and Unicode. To handle them, computers use corresponding encodings.
2.1 ASCII Charset & Encoding
ASCII (American Standard Code for Information Interchange) is a single‑byte encoding for modern English. It defines 128 characters using 7 bits; extended ASCII uses 8 bits for 256 characters. The table below shows the mapping.
2.2 GBxxxx Charsets & Encoding
GB2312 is the Chinese national standard for simplified Chinese characters, covering about 6,763 characters. GBK extends GB2312 by adding characters from GB13000.1‑93, while GB18030 further expands to over 70,000 characters and is compatible with Unicode.
2.3 BIG5 Charset & Encoding
BIG5 is the dominant charset for traditional Chinese used in Taiwan, Hong Kong, and Macau. It is a double‑byte encoding with specific ranges for user‑defined characters, punctuation, common Chinese characters, and less‑common characters.
3. The Great Idea: Unicode
— Unicode deserves a separate discussion
Unicode was created to provide a single universal encoding for all languages. It assigns a unique code point to each character, eliminating the incompatibility problems of multiple regional encodings.
3.1 UCS & Unicode
The Universal Character Set (UCS) defined by ISO 10646 and the Unicode Standard have been merged, sharing the same code charts.
3.2 UTF‑32
UTF‑32 uses a fixed 4‑byte representation for every Unicode code point, offering constant‑time indexing but at the cost of high memory usage.
3.3 UTF‑16
UTF‑16 uses 2 bytes for code points in the Basic Multilingual Plane and surrogate pairs for higher code points, balancing space efficiency and indexing speed.
3.4 UTF‑8
UTF‑8 is a variable‑length encoding compatible with ASCII. It uses 1‑4 bytes per character, making it the dominant encoding on the web and in many protocols.
Advantages of UTF‑8
ASCII‑compatible, so existing ASCII text remains valid.
Byte‑wise sorting yields the same order as Unicode code‑point sorting.
Supported by all modern markup languages and protocols.
Byte‑oriented search algorithms work directly on UTF‑8 data.
Invalid UTF‑8 sequences are rare, aiding validation.
Disadvantages of UTF‑8
Variable‑length encoding makes random access O(N) and requires extra bit‑manipulation to decode.
4. Related HTTP Header Fields
Accept-Charset tells the browser which charsets it can receive. Accept-Encoding indicates supported compression methods (e.g., gzip, deflate). Accept-Language specifies the preferred languages. Content-Type includes the MIME type and charset (e.g., text/html; charset=gb2312). Content-Encoding declares the compression applied to the response. Content-Language indicates the language of the response content.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
