Fundamentals 12 min read

Demystifying Character Encoding: From ASCII to Unicode and Beyond

This article walks through the evolution of character encoding—from the early ASCII standard, through Chinese extensions like GB2312 and GBK, to the universal Unicode system and its UTF‑32, UTF‑16, and UTF‑8 encodings—explaining their structures, usage, and common pitfalls.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Demystifying Character Encoding: From ASCII to Unicode and Beyond

ASCII – Early 7‑bit encoding

Computers store data as binary bits; eight bits form a byte. The first 128 byte values (0‑127) were assigned to English letters, digits, punctuation and control characters, creating the ASCII (American Standard Code for Information Interchange) standard.

ASCII keyboard layout
ASCII keyboard layout

Chinese non‑ASCII encodings

When the Internet expanded, languages needed characters beyond ASCII. The byte range 128‑255 became an “extended character set”.

GB2312 uses two‑byte sequences (high byte 0xA1‑0xF7, low byte 0xA1‑0xFE) to encode about 7,000 simplified Chinese characters, full‑width punctuation and symbols.

GB2312 table
GB2312 table

GBK extends GB2312 by retaining all its mappings and adding roughly 20,000 more characters, including traditional Chinese and minority scripts.

GB18030 further expands GBK to become a superset that covers the entire Unicode repertoire.

GBK/GB18030 overview
GBK/GB18030 overview

Unicode – Global code point space

The International Organization for Standardization (ISO) defined the Universal Multiple‑Octet Coded Character Set (UCS), commonly called Unicode. Unicode assigns a unique code point (U+0000‑U+10FFFF) to every character, symbol and emoji. The first 65,536 code points form the Basic Multilingual Plane (BMP) and are stored in two bytes; additional planes require three or four bytes.

Unicode planes diagram
Unicode planes diagram

Example: the string “v维” is represented by the bytes 0x76 0x7E 0xF4. The code point for “维” is U+7EF4.

Transport encodings (storage and transmission)

UTF‑32

Each Unicode code point is stored directly as a 32‑bit integer. This representation is simple but wasteful; the ASCII character “A” occupies four bytes.

UTF-32 layout
UTF-32 layout

UTF‑16

Code points in the BMP are stored as two bytes. Characters outside the BMP use surrogate pairs (two 16‑bit units). A Byte Order Mark (BOM) – FEFF for big‑endian or FFFE for little‑endian – may be prefixed to indicate endianness.

UTF-16 byte order
UTF-16 byte order

UTF‑8

UTF‑8 is a variable‑length encoding using 1‑4 bytes per code point. It is backward compatible with ASCII (code points 0‑127 encode as a single byte) and does not require a BOM.

UTF-8 encoding table
UTF-8 encoding table

Example: the Chinese character “知” (U+77E5) encodes to the three‑byte sequence E7 9F A5.

UTF-8 encoding of 知
UTF-8 encoding of 知

Legacy ANSI code pages

On Windows, “ANSI” refers to the system locale’s legacy code page (e.g., GBK for Simplified Chinese). These encodings map the first 128 bytes to ASCII and use the remaining bytes for locale‑specific characters.

Windows ANSI code page illustration
Windows ANSI code page illustration

Practical Q&A

Can a Java char store a Chinese character? Yes. Java uses Unicode internally; a char is 16 bits and can represent any BMP code point, including Chinese characters.

Tomcat defaults to ISO‑8859‑1, causing garbled Chinese text. How to fix it?

Method 1 – edit conf/server.xml:

<Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" URIEncoding="UTF-8" useBodyEncodingForURI="true"/>

Method 2 – set response encoding in code:

response.setCharacterEncoding("UTF-8")
response.setContentType("text/html;charset=UTF-8")

For request parameters:

POST – ensure client and server use the same encoding, e.g., request.setCharacterEncoding("utf-8").

GET – decode the query string manually if necessary:

String name = request.getParameter("name");
name = new String(name.getBytes("iso-8859-1"), "utf-8");

Conclusion

Character encoding evolved from the 7‑bit ASCII to multi‑byte schemes that support every written language. Unicode provides a universal source encoding; UTF‑32, UTF‑16 and UTF‑8 are channel encodings optimized for storage and transmission.

final encoding summary diagram
final encoding summary diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaTomcatUnicodeUTF-8character encodingASCIIGB2312
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.