Fundamentals 12 min read

Demystifying Character Encoding: From ASCII to Unicode and Beyond

This article walks through the evolution of character encoding—from the early ASCII standard, through Chinese extensions like GB2312 and GBK, to the universal Unicode system and its UTF‑32, UTF‑16, and UTF‑8 encodings—explaining their structures, usage, and common pitfalls.

vivo Internet Technology

Aug 11, 2021

Demystifying Character Encoding: From ASCII to Unicode and Beyond

ASCII – Early 7‑bit encoding

Computers store data as binary bits; eight bits form a byte. The first 128 byte values (0‑127) were assigned to English letters, digits, punctuation and control characters, creating the ASCII (American Standard Code for Information Interchange) standard.

Chinese non‑ASCII encodings

When the Internet expanded, languages needed characters beyond ASCII. The byte range 128‑255 became an “extended character set”.

GB2312 uses two‑byte sequences (high byte 0xA1‑0xF7, low byte 0xA1‑0xFE) to encode about 7,000 simplified Chinese characters, full‑width punctuation and symbols.

GBK extends GB2312 by retaining all its mappings and adding roughly 20,000 more characters, including traditional Chinese and minority scripts.

GB18030 further expands GBK to become a superset that covers the entire Unicode repertoire.

Unicode – Global code point space

The International Organization for Standardization (ISO) defined the Universal Multiple‑Octet Coded Character Set (UCS), commonly called Unicode. Unicode assigns a unique code point (U+0000‑U+10FFFF) to every character, symbol and emoji. The first 65,536 code points form the Basic Multilingual Plane (BMP) and are stored in two bytes; additional planes require three or four bytes.

Example: the string “v维” is represented by the bytes 0x76 0x7E 0xF4. The code point for “维” is U+7EF4.

Transport encodings (storage and transmission)

UTF‑32

Each Unicode code point is stored directly as a 32‑bit integer. This representation is simple but wasteful; the ASCII character “A” occupies four bytes.

UTF‑16

Code points in the BMP are stored as two bytes. Characters outside the BMP use surrogate pairs (two 16‑bit units). A Byte Order Mark (BOM) – FEFF for big‑endian or FFFE for little‑endian – may be prefixed to indicate endianness.

UTF‑8

UTF‑8 is a variable‑length encoding using 1‑4 bytes per code point. It is backward compatible with ASCII (code points 0‑127 encode as a single byte) and does not require a BOM.

Example: the Chinese character “知” (U+77E5) encodes to the three‑byte sequence E7 9F A5.

Legacy ANSI code pages

On Windows, “ANSI” refers to the system locale’s legacy code page (e.g., GBK for Simplified Chinese). These encodings map the first 128 bytes to ASCII and use the remaining bytes for locale‑specific characters.

Practical Q&A

Can a Java char store a Chinese character? Yes. Java uses Unicode internally; a char is 16 bits and can represent any BMP code point, including Chinese characters.

Tomcat defaults to ISO‑8859‑1, causing garbled Chinese text. How to fix it?

Method 1 – edit conf/server.xml:

<Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" URIEncoding="UTF-8" useBodyEncodingForURI="true"/>

Method 2 – set response encoding in code:

response.setCharacterEncoding("UTF-8")

response.setContentType("text/html;charset=UTF-8")

For request parameters:

POST – ensure client and server use the same encoding, e.g., request.setCharacterEncoding("utf-8").

GET – decode the query string manually if necessary:

String name = request.getParameter("name");
name = new String(name.getBytes("iso-8859-1"), "utf-8");

Conclusion

Character encoding evolved from the 7‑bit ASCII to multi‑byte schemes that support every written language. Unicode provides a universal source encoding; UTF‑32, UTF‑16 and UTF‑8 are channel encodings optimized for storage and transmission.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java tomcat Unicode UTF-8 character encoding ASCII GB2312

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.