Fundamentals 6 min read

Understanding Python String Encoding: ASCII, Unicode, and UTF-8 Explained

This article explains why string encoding errors occur in Python, outlines the evolution from ASCII to GB2312 and Unicode, describes the variable‑length UTF‑8 format, and shows how to correctly convert between Unicode and UTF‑8 when reading or writing files.

Python Crawling & Data Mining

Dec 5, 2018

Understanding Python String Encoding: ASCII, Unicode, and UTF-8 Explained

String encoding problems frequently appear in Python when writing files or transmitting data over networks, often triggering error messages that require a clear understanding of what the encodings actually do.

Computers process only numbers, so text must be converted to numeric codes. The original 8‑bit ASCII code can represent up to 255 characters, which suffices for English but not for Chinese, Japanese, Korean, etc. To handle Chinese, China created GB2312, using two or three bytes per character and including ASCII as a subset. Other countries devised similar national encodings.

Unicode was introduced to unify all language characters into a single encoding scheme. For example, the letter “A” has ASCII decimal 65 (binary 01000001), while the Chinese character “中” requires Unicode code point 20013 (binary 01001110 00101101). Unicode thus resolves both the original ASCII limitation and the fragmentation of national encodings.

Although Unicode provides a universal solution, it doubles storage size for pure English text compared to ASCII and doubles transmission bandwidth. When handling large files (e.g., terabytes), this overhead becomes significant. To mitigate this, the variable‑length UTF‑8 encoding was created: English characters occupy 1 byte, most Chinese characters 3 bytes, and rare symbols 4–6 bytes, offering substantial savings for predominantly English data.

Because UTF‑8’s length varies per character, it can complicate memory handling and code logic. Unicode, with its fixed length, simplifies in‑memory processing. Therefore, best practice is to convert Unicode to UTF‑8 when saving files or transmitting data, and convert UTF‑8 back to Unicode when loading data into memory.

In Python, you must explicitly specify the file encoding when opening files for reading or writing; the language’s standard libraries then handle the conversion between Unicode and UTF‑8 automatically.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Fundamentals String Encoding

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.