Fundamentals 14 min read

Mastering Character Encoding in Python: From ASCII to UTF‑8

This article explains the fundamental concepts of characters, character sets, and encodings, compares common encodings such as ASCII, Unicode, and UTF‑8, and shows how Python 2 and Python 3 handle default encodings, string types, and common Unicode errors with practical code examples.

MaGe Linux Operations

Jan 15, 2018

Mastering Character Encoding in Python: From ASCII to UTF‑8

Basic Concepts

In computing, a character is an information unit that includes letters, digits, punctuation, and symbols. A character set is a collection of characters, for example ASCII, GB2312, and Unicode. A character encoding maps characters from a set to binary numbers so that computers can process them. Common encodings include ASCII, UTF‑8, and GBK.

Common Character Encodings

The most frequently discussed encodings are ASCII, Unicode, and UTF‑8.

ASCII

Developed in the 1960s in the United States, ASCII defines 128 characters (English letters, digits, and basic symbols) and their binary representations, e.g., A = 01000001 (65), a = 01100001 (97), SPACE = 00100000 (32).

Unicode

ASCII is insufficient for non‑English languages. Unicode provides a universal character set covering virtually all languages. It assigns a unique code point to each character, written as U+XXXX, e.g., U+0041 for "A" and U+4E25 for the Chinese character "严". Unicode has been continuously expanded since its first release in 1991.

UTF‑8

UTF‑8 is a variable‑length encoding of Unicode that uses one to four bytes per character. ASCII characters remain one byte, while many non‑Latin scripts use two or three bytes, and rare characters may need four. This design avoids the waste of fixed‑width encodings like UTF‑16 or UTF‑32.

Python’s Default Encoding

Python 2 defaults to ascii, while Python 3 defaults to utf‑8. You can query the default encoding:

>> import sys >> sys.getdefaultencoding() 'ascii'

>> import sys >> sys.getdefaultencoding() 'utf-8'

String Types in Python 2

Python 2 distinguishes str (byte strings) and unicode (Unicode strings). str can have various encodings (default ascii), while unicode literals are written with a leading u, e.g., u'中文'. Conversions:

>> '中文'.decode('utf-8') u'中文'

>> u'中文'.encode('utf-8') '\xe4\xb8\xad\xe6\x96\x87'

Root Causes of UnicodeEncodeError & UnicodeDecodeError

These errors occur when str and unicode objects are mixed and Python implicitly tries to encode or decode using the default ascii codec.

Example of a UnicodeDecodeError:

>> s = '你好' # str, utf‑8 encoded >> u = u'世界' # unicode >> s + u # implicit s.decode('ascii') + u Traceback (most recent call last): UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

Fix by explicitly decoding with utf‑8:

>> s = '你好' >> u = u'世界' >> s.decode('utf-8') + u u'你好世界'

Similarly, a UnicodeEncodeError appears when a unicode object is implicitly encoded to ascii:

>> u_str = u'你好' >> str(u_str) # tries ascii encoding Traceback (most recent call last): UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

Explicitly encode to utf‑8 before converting:

>> u_str.encode('utf-8') '\xe4\xbd\xa0\xe5\xa5\xbd'

Practical Tips

Always decode str to unicode using the correct encoding before processing.

When a function expects a str, encode the unicode value explicitly.

Be aware that print uses the console’s encoding; redirecting output to a file may fall back to ascii and raise errors. Use print(... .encode('utf-8')) when redirecting.

Summary

UTF‑8 is a variable‑length implementation of Unicode.

Unicode provides a universal character set with many encoding standards (UTF‑8, UTF‑16, UTF‑32).

In Python 2, mixing str and unicode forces an implicit ascii conversion that often causes errors.

Explicitly specify utf‑8 for decoding and encoding to avoid UnicodeEncodeError and UnicodeDecodeError.

References

Wikipedia – Character

Wikipedia – UTF‑8

Characters, Bytes And Encoding

Ruan Yifeng’s blog – ASCII, Unicode and UTF‑8

Liao Xuefeng’s site – Strings and Encoding

Stack Overflow – Dangers of sys.setdefaultencoding('utf‑8')

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Unicode UTF-8 character encoding python2 python3 unicodeerror

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.