Fundamentals 12 min read

Understanding Encoding Issues in Python: Bytes, Unicode, and Best Practices

This article explains why encoding problems frequently arise with Chinese characters in development, clarifies core concepts such as bytes, characters, ASCII, Unicode and UTF encodings, compares Python 2 and Python 3 handling of strings and bytes, and provides practical best‑practice recommendations to avoid encoding bugs.

Beike Product & Technology

Dec 6, 2018

Understanding Encoding Issues in Python: Bytes, Unicode, and Best Practices

Encoding problems often appear when Chinese characters are involved; this article explores why and how to specify encodings to avoid issues, focusing on Python's handling of encoding.

1. Basic concepts – A byte is an 8‑bit binary sequence; characters are information units representing letters or symbols. ASCII is a 7‑bit encoding for English characters, while Unicode is a universal standard that assigns code points (U+hhhh) to characters, typically using 2–4 bytes. UTF‑8 is a popular Unicode transformation format that encodes characters into variable‑length byte sequences.

2. Why these concepts exist – Computers store data as binary, but humans work with strings. Encoding standards like ASCII and Unicode bridge this gap, allowing consistent translation between bytes and readable text. ASCII cannot represent Chinese, prompting the development of Unicode.

3. Encoding is everywhere – Issues can arise in terminals, editors, source files, data files, and variables. For example, Python 2 source files often start with # coding=utf-8 to declare UTF‑8 encoding.

4. Python 2 encoding pitfalls – In Python 2, str and bytes are essentially the same, leading to confusion. Example code demonstrates that str and bytes compare equal and have the same length, but their __repr__() shows raw byte sequences. Unicode strings ( u'…') behave differently and can trigger UnicodeWarning when compared to str or bytes. The default source encoding is ASCII, causing errors for non‑ASCII characters.

5. Best practices – Adopt the “Unicode sandwich” principle: decode input bytes to strings early, work with strings throughout, and encode to bytes only when outputting. Python 3 follows this by default, using UTF‑8 for source files and representing str as Unicode. Example code shows type differences and length behavior in Python 3.

6. Byte Order Mark (BOM) – BOM marks the byte order at the start of a stream. While unnecessary for UTF‑8 (which has no endianness), a UTF‑8 BOM (three bytes) can signal that the file is UTF‑8 encoded.

7. Recommendations – For Python 2 users: treat str as byte sequences, identify whether data is bytes or Unicode, prefer Unicode handling, and use the unicode-escape codec for strings containing escaped Unicode literals. Sample code illustrates decoding and encoding with unicode-escape.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python encoding Unicode UTF-8 bytes python2 python3

Written by

Beike Product & Technology

As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.