Backend Development 5 min read

Fixing Chinese Character Garbling in Python Web Scraping

This article walks through a real‑world Python web‑scraping issue where Chinese characters appear as garbled text, explains why the default encoding fails, and shows how setting the response’s apparent encoding resolves the problem, complete with sample code and practical tips for posting questions.

Python Crawling & Data Mining

Nov 1, 2024

Fixing Chinese Character Garbling in Python Web Scraping

Problem Overview

In a Python community, a user encountered garbled Chinese characters when scraping the novel "三国演义" page using requests and BeautifulSoup. The original code fetched the page and printed the h1 tag, but the output displayed unreadable text.

Original Code

import requests
from bs4 import BeautifulSoup

def main():
    href_lists = []
    # 从首页获取所有章节的url
    head = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 Edg/130.0.0.0"
    }
    url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
    page_text = requests.get(url, headers=head).text
    bs = BeautifulSoup(page_text, 'lxml')
    url_lists = bs.find_all('a', class_="tabli")
    for href in url_lists:
        href_lists.append(href['href'])
    for i in range(len(href_lists)):
        href = href_lists[i]
        detail_url = 'https://www.shicimingju.com' + href
        response = requests.get(detail_url, headers=head).text
        bs2 = BeautifulSoup(response, 'lxml')
        print(bs2.h1)
        break

if __name__ == '__main__':
    main()

Solution

The garbling is caused by an incorrect response encoding. Setting the response encoding to the apparent encoding resolves the issue.

response = requests.get('https://www.shicimingju.com/book/sanguoyanyi.html', cookies=cookies, headers=headers)
response.encoding = response.apparent_encoding
print(response.text)

This adjustment correctly displays the Chinese characters.

Additional Tips

When posting code‑related questions, include a minimal reproducible example, relevant error screenshots, and ensure the data size is reasonable. For longer scripts, share the .py file.

Conclusion

The article demonstrates how to diagnose and fix character encoding problems in Python web crawlers, providing a practical code fix and encouraging community collaboration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python encoding Web Scraping

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.