Understanding Encoding Issues in Python: Bytes, Unicode, and Best Practices
This article explains why encoding problems frequently arise with Chinese characters in development, clarifies core concepts such as bytes, characters, ASCII, Unicode and UTF encodings, compares Python 2 and Python 3 handling of strings and bytes, and provides practical best‑practice recommendations to avoid encoding bugs.
Encoding problems often appear when Chinese characters are involved; this article explores why and how to specify encodings to avoid issues, focusing on Python's handling of encoding.
1. Basic concepts – A byte is an 8‑bit binary sequence; characters are information units representing letters or symbols. ASCII is a 7‑bit encoding for English characters, while Unicode is a universal standard that assigns code points (U+hhhh) to characters, typically using 2–4 bytes. UTF‑8 is a popular Unicode transformation format that encodes characters into variable‑length byte sequences.
2. Why these concepts exist – Computers store data as binary, but humans work with strings. Encoding standards like ASCII and Unicode bridge this gap, allowing consistent translation between bytes and readable text. ASCII cannot represent Chinese, prompting the development of Unicode.
3. Encoding is everywhere – Issues can arise in terminals, editors, source files, data files, and variables. For example, Python 2 source files often start with # coding=utf-8 to declare UTF‑8 encoding.
4. Python 2 encoding pitfalls – In Python 2, str and bytes are essentially the same, leading to confusion. Example code demonstrates that str and bytes compare equal and have the same length, but their __repr__() shows raw byte sequences. Unicode strings ( u'…' ) behave differently and can trigger UnicodeWarning when compared to str or bytes . The default source encoding is ASCII, causing errors for non‑ASCII characters.
5. Best practices – Adopt the “Unicode sandwich” principle: decode input bytes to strings early, work with strings throughout, and encode to bytes only when outputting. Python 3 follows this by default, using UTF‑8 for source files and representing str as Unicode. Example code shows type differences and length behavior in Python 3.
6. Byte Order Mark (BOM) – BOM marks the byte order at the start of a stream. While unnecessary for UTF‑8 (which has no endianness), a UTF‑8 BOM (three bytes) can signal that the file is UTF‑8 encoded.
7. Recommendations – For Python 2 users: treat str as byte sequences, identify whether data is bytes or Unicode, prefer Unicode handling, and use the unicode-escape codec for strings containing escaped Unicode literals. Sample code illustrates decoding and encoding with unicode-escape .
Beike Product & Technology
As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.