Mastering Character Encoding in Python: From ASCII to UTF‑8
This article explains the fundamental concepts of characters, character sets, and encodings, compares common encodings such as ASCII, Unicode, and UTF‑8, and shows how Python 2 and Python 3 handle default encodings, string types, and common Unicode errors with practical code examples.
Basic Concepts
In computing, a character is an information unit that includes letters, digits, punctuation, and symbols. A character set is a collection of characters, for example ASCII, GB2312, and Unicode. A character encoding maps characters from a set to binary numbers so that computers can process them. Common encodings include ASCII, UTF‑8, and GBK.
Common Character Encodings
The most frequently discussed encodings are ASCII, Unicode, and UTF‑8.
ASCII
Developed in the 1960s in the United States, ASCII defines 128 characters (English letters, digits, and basic symbols) and their binary representations, e.g., A = 01000001 (65), a = 01100001 (97), SPACE = 00100000 (32).
Unicode
ASCII is insufficient for non‑English languages. Unicode provides a universal character set covering virtually all languages. It assigns a unique code point to each character, written as U+XXXX, e.g., U+0041 for "A" and U+4E25 for the Chinese character "严". Unicode has been continuously expanded since its first release in 1991.
UTF‑8
UTF‑8 is a variable‑length encoding of Unicode that uses one to four bytes per character. ASCII characters remain one byte, while many non‑Latin scripts use two or three bytes, and rare characters may need four. This design avoids the waste of fixed‑width encodings like UTF‑16 or UTF‑32.
Python’s Default Encoding
Python 2 defaults to ascii, while Python 3 defaults to utf‑8. You can query the default encoding:
>> import sys >> sys.getdefaultencoding() 'ascii'
>> import sys >> sys.getdefaultencoding() 'utf-8'
String Types in Python 2
Python 2 distinguishes str (byte strings) and unicode (Unicode strings). str can have various encodings (default ascii), while unicode literals are written with a leading u, e.g., u'中文'. Conversions:
>> '中文'.decode('utf-8') u'中文'
>> u'中文'.encode('utf-8') '\xe4\xb8\xad\xe6\x96\x87'
Root Causes of UnicodeEncodeError & UnicodeDecodeError
These errors occur when str and unicode objects are mixed and Python implicitly tries to encode or decode using the default ascii codec.
Example of a UnicodeDecodeError:
>> s = '你好' # str, utf‑8 encoded >> u = u'世界' # unicode >> s + u # implicit s.decode('ascii') + u Traceback (most recent call last): UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
Fix by explicitly decoding with utf‑8:
>> s = '你好' >> u = u'世界' >> s.decode('utf-8') + u u'你好世界'
Similarly, a UnicodeEncodeError appears when a unicode object is implicitly encoded to ascii:
>> u_str = u'你好' >> str(u_str) # tries ascii encoding Traceback (most recent call last): UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Explicitly encode to utf‑8 before converting:
>> u_str.encode('utf-8') '\xe4\xbd\xa0\xe5\xa5\xbd'
Practical Tips
Always decode str to unicode using the correct encoding before processing.
When a function expects a str, encode the unicode value explicitly.
Be aware that print uses the console’s encoding; redirecting output to a file may fall back to ascii and raise errors. Use print(... .encode('utf-8')) when redirecting.
Summary
UTF‑8 is a variable‑length implementation of Unicode.
Unicode provides a universal character set with many encoding standards (UTF‑8, UTF‑16, UTF‑32).
In Python 2, mixing str and unicode forces an implicit ascii conversion that often causes errors.
Explicitly specify utf‑8 for decoding and encoding to avoid UnicodeEncodeError and UnicodeDecodeError.
References
Wikipedia – Character
Wikipedia – UTF‑8
Characters, Bytes And Encoding
Ruan Yifeng’s blog – ASCII, Unicode and UTF‑8
Liao Xuefeng’s site – Strings and Encoding
Stack Overflow – Dangers of sys.setdefaultencoding('utf‑8')
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
