Fundamentals 10 min read

Demystifying Python Encoding: From ASCII to Unicode and UTF‑8

This article explains the fundamentals of character encoding in Python, covering concepts like ASCII, GB2312/GBK, Unicode, UTF‑8, and the differences between Python 2 and Python 3, while illustrating the encoding lifecycle of a .py file with clear examples and images.

MaGe Linux Operations

Jun 18, 2017

Demystifying Python Encoding: From ASCII to Unicode and UTF‑8

Encoding (Python version)

While learning Python, I became confused by various encodings, so I compiled notes and personal understanding to help others struggling with encoding.

Concept of Encoding

Encoding converts information from one format to another; computers only understand binary, so converting visible text into binary is encoding, and converting binary back to readable text is decoding. Since computers only recognize 0 and 1, we need encoding schemes to map letters, numbers, and symbols to binary.

In Python, you can view the default encoding with:

ASCII

ASCII (American Standard Code for Information Interchange) was the early encoding used in the United States, representing 26 letters, digits, and symbols. It uses one byte (8 bits) per character, with values 0‑127 in the original version and 0‑255 after using the highest bit.

In Python, ord() converts a character to its numeric value, and chr() converts a numeric value back to a character.

GB2312 and GBK

When computers reached China, ASCII could not represent Chinese characters, so the GB2312 standard was introduced in 1980, covering 7,445 characters (including 6,763 Chinese characters). Each Chinese character uses two bytes, with high bytes ranging B0‑F7 and low bytes A1‑FE.

GBK extends GB2312, compatible with it, and supports 23,940 code points covering 21,003 Chinese characters, which is the default encoding on many Chinese systems today.

Unicode and UTF‑8

Various national encodings (e.g., Shift‑JIS, BIG5) caused incompatibility. Unicode was created as a universal encoding using two bytes per character, covering 65,536 characters, enough for most of the world's scripts.

Because Unicode can be memory‑intensive, variable‑length encodings like UTF‑8 were developed. UTF‑8 uses one byte for ASCII characters, three bytes for most Chinese characters, and four bytes for larger code points, making it efficient for storage and transmission.

Python 2 Encoding

Python 2 defaults to ASCII. It has two string types: str (bytes) and unicode. str stores raw bytes, while unicode stores Unicode code points. Converting unicode to bytes is encoding; converting bytes to unicode is decoding.

When decoding with the wrong encoding (e.g., using big5 on UTF‑8 data), garbled text appears. Matching the encoding used for encoding with the same decoding rule resolves the issue.

Python 3 Encoding

Python 3 also defines two string types: str (Unicode) and bytes. str stores Unicode data, while bytes stores raw byte sequences.

Conversion between different encodings always passes through Unicode as an intermediate step. For example, converting UTF‑8 data to GBK requires decoding UTF‑8 to Unicode, then encoding Unicode to GBK.

Python 2 automatically decodes bytes to Unicode, while Python 3 does not, providing a clearer separation between bytes and Unicode data.

The Life Cycle of a .py File

1. When creating a .py file, the editor (e.g., PyCharm) sets a default encoding, typically UTF‑8.

2. Code written in the file is stored in memory as Unicode.

3. Upon saving, Unicode data is encoded to UTF‑8 and written to disk.

4. When executing the file, the interpreter reads it. In Python 2, the default decoding is ASCII, which cannot handle Chinese characters and leads to errors.

Therefore, Python 2 scripts need an explicit encoding declaration (e.g., # -*- coding: utf-8 -*-), whereas Python 3 defaults to UTF‑8 and handles encoding automatically.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python encoding Unicode UTF-8 ASCII GBK

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.