Fundamentals 9 min read

Understanding Character Encoding: From GBK and UTF-8 to Unicode

This tutorial explains the origins and evolution of character encoding, covering early ASCII, Chinese GBK/GB18030, the universal Unicode standard, UTF‑8 variable‑length encoding, and practical differences between Python 2 and Python 3 with code examples.

IT Services Circle

Mar 4, 2022

Understanding Character Encoding: From GBK and UTF-8 to Unicode

Encoding issues often trouble Python learners, especially when encountering GBK, UTF‑8, GB2312, or GB18030. This article tells the story of how character encoding originated and evolved, using a narrative approach.

The "fire‑soldier" analogy illustrates binary representation: one fire can represent 0 or 1, two fires can represent 0‑3, three fires 0‑7, mirroring how computers use bits.

Early computers used ASCII, an 8‑bit code (with the highest bit reserved for extensions) that could represent 128 symbols, sufficient only for English letters, digits, and punctuation.

When computers spread to China, the need for Chinese characters led to the creation of GBK, which uses two bytes per character, providing 65,536 possible slots. Its successors GB2312 and GB18030 are upward‑compatible, with GB18030 offering the largest character set.

To support multiple languages worldwide, Unicode (also called "Universal Code") was introduced. UCS‑2 used 2 bytes (65,536 slots), while UCS‑4 used 4 bytes (over 4 billion slots), which was wasteful. UTF‑8, a variable‑length Unicode encoding, became the practical solution: 1 byte for ASCII, 2 bytes for many European scripts, and 3 bytes for Chinese characters.

In Python, the default encoding differs between versions. Python 2 defaults to ASCII, requiring an explicit # -*- coding:utf-8 -*- comment for non‑ASCII source files. Python 3 defaults to UTF‑8, treating all string literals as Unicode.

Example of encoding and decoding in Python:

# In Python 2, represent a Unicode string
my_name = u"黄伟"
# In Python 3, represent the same string
my_name = "黄伟"

# Encode and decode examples
name1 = "我是你们的teacher老师"
name1_encode = name1.encode("utf-8")
print(name1_encode)  # b'...'
print(name1_encode.decode("utf-8"))

name2 = "你们是我的student学生"
name2_encode = name2.encode("gbk")
print(name2_encode)
print(name2_encode.decode("gbk"))

# Decoding UTF‑8 bytes with GBK leads to garbled text
print(name1_encode.decode("gbk"))  # garbled output

The key takeaway is that all conversions between different encodings pass through Unicode, and understanding the history and mechanics of encodings helps avoid common pitfalls like garbled text.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Unicode UTF-8 character encoding ASCII binary GBK

Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.