Fundamentals 9 min read

Demystifying Character Encoding: From ASCII to Unicode and Beyond

This article explains the fundamentals of character encoding, covering concepts such as information, symbols, character sets, various encoding schemes like ASCII, GB2312, UTF‑8, Unicode planes, common pitfalls, and practical examples to help developers avoid garbled text.

Seewo Tech Circle

Aug 30, 2019

Demystifying Character Encoding: From ASCII to Unicode and Beyond

Character Encoding Concepts

Character encoding issues are common in software development. This article introduces the concept of character encoding and explains typical problems.

Information is used to eliminate random uncertainty (Shannon). A symbol is a tangible identifier. Text is a symbolic carrier of information. A character is a symbol representing text, such as "a" or "柯".

A character set is the collection of all characters, e.g., the set of all Chinese characters.

Character encoding defines the mapping between characters and numbers, e.g., assigning the number 356 to the character "美".

Encoding method is a specific way to perform character encoding.

Demo

In a simple world with three characters "you", "me", "him", we define a "mischief character set" and assign codes in three ways:

Method 1: 1 → "me", 2 → "you", 3 → "him" (named "sequential code").

Method 2: 1 → "him", 2 → "you", 3 → "me" (named "reverse code").

Method 3: 1 → "you", 2 → "me", 3 → "him" (named "random code").

Encoding Types

Region Code – a name for an encoding method, similar to sequential, reverse, and random codes; its practical impact is negligible for programmers.

Exchange Code – an encoding used for data exchange between different systems to avoid garbled text caused by inconsistent internal codes.

Internal Code – the encoding used internally by a computer system. Many systems use the exchange code as the internal code, such as ASCII, UTF‑8, UTF‑16, UTF‑32, Unicode, etc. (GB2312 differs).

Modern operating systems (Windows, macOS, Linux) typically adopt Unicode as the internal code.

Unicode Encoding

Unicode code points (e.g., U+4E00) represent characters; they are not internal codes but are often stored using the same numeric values as internal codes.

Unicode defines a range from U+0000 to U+10FFFF, covering about 1.1 million code points, divided into 17 planes. The first plane (U+0000–U+FFFF) is the Basic Multilingual Plane (BMP) containing the most common characters; the remaining planes are supplementary.

Common Chinese Encoding Rules

GB2312 uses region codes: a 96×96 grid (96 rows, 96 columns). The encoding rule adds 32 to the row/column numbers to produce an exchange code, then adds 128 (sets the high bit) to obtain the internal code.

UTF‑8 Encoding Rules

Common Encoding Issues

1. Even with exchange codes, garbled text can occur because different software may use different exchange codes.

2. Notepad saves Unicode files with a leading byte order mark (BOM) "FFFE" or "FEFF" to distinguish big‑endian and little‑endian. The BOM is a zero‑width non‑breaking space (U+FEFF) and is legal in Unicode; it is not added by Notepad intentionally but follows the Unicode standard.

3. Character sets and character encodings are not one‑to‑one; knowing the encoding reveals the character set, but not vice versa.

4. Differences among UTF‑8, UTF‑16, UTF‑32:

UTF‑8: variable length 1–4 bytes.

UTF‑16: variable length 2 or 4 bytes.

UTF‑32: fixed length 4 bytes.

GB2312: fixed length 2 bytes.

5. UCS‑2 is an early Unicode encoding using fixed 2‑byte units, matching Unicode code points for the first plane.

6. UTF‑16 includes UCS‑2 characters; within the first plane, their encodings are identical.

7. Unicode is often mentioned together with UTF‑8 because early Unicode implementations used UCS‑2 (effectively UTF‑16), and some tools still label the UTF‑16 encoding as "Unicode".

8. BOM in UTF‑8: the Byte Order Mark (U+FEFF) is optional; Windows Notepad adds it for consistency, but it can cause issues on Linux/macOS, so web files usually avoid a BOM.

Conclusion

This article introduced various theoretical aspects of character encoding without delving into concrete applications; a forthcoming article will explore practical encoding problems in software development.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Unicode utf-8 character encoding Fundamentals GB2312

Written by

Seewo Tech Circle

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.