Backend Development 8 min read

Analyzing and Fixing Encoding Issues in Python Requests, Scrapy, and Golang Charset Libraries

The article examines how Python Requests, Scrapy, and Go’s charset package detect page encodings, reveals why they often mis‑decode Chinese GB‑series pages, and proposes a unified strategy—prefer header charset, then HTML meta, finally a reliable heuristic—to eliminate garbled text in web scraping.

Sohu Tech Products

Sep 20, 2023

Analyzing and Fixing Encoding Issues in Python Requests, Scrapy, and Golang Charset Libraries

This article demonstrates how to investigate the source code of the Python Requests and Scrapy libraries as well as the Golang charset package to understand why web pages sometimes appear as garbled text and how to improve encoding‑guess accuracy.

Garble phenomenon : When crawling a Chinese news page with Requests.get(), the printed resp.text shows unreadable characters because the library fails to detect the correct charset (GB2312/GBK/GB18030).

Example code that reproduces the issue:

import requests

# GB2312

resp = requests.get("http://news.inewsweek.cn/society/2022-05-30/15753.shtml")

print(resp.text)

The expectation is that Requests should automatically infer the encoding from the response headers or the HTML meta tag, but in practice it often falls back to an incorrect guess.

Inspecting Python Requests source : By following the call chain

1. requests.get

2. request

3. session.request

4. self.send

5. adapter.send

6. self.build_response

7. get_encoding_from_headers

the library first tries to use self.encoding (the encoding stored on the response object). If that is missing, it falls back to apparent_encoding, which is a heuristic guess based on the page content. The fallback to apparent_encoding is the main cause of the garbled output for GBK‑related pages.

Inspecting Golang charset source : The function charset.DetermineEncoding ultimately calls Lookup → htmlindex.Get. It extracts the charset=xxx token from the HTML and returns the corresponding encoding. If no charset is found, it defaults to ISO‑8859‑1. This logic mirrors the Python approach but suffers from the same problem when the page declares a charset that differs from the one indicated in the HTTP header.

Scrapy handling : Scrapy’s w3lib.encoding first checks the Content‑Type header for a charset; if absent, it parses the HTML body. This two‑step strategy is similar to the one described for Requests and Golang, but in the author’s experience Scrapy rarely produces garbled output.

Testing and results : The author collected dozens of Chinese web pages and tested the three libraries using the same detection logic as Scrapy. All tests passed after applying the combined logic (prefer header charset, fall back to HTML meta, and finally to a reliable heuristic). The original Requests implementation showed a high error rate, while Golang’s charset had only two failures before the fix.

Conclusion : By understanding the source code of these libraries and aligning their encoding‑detection strategies, developers can significantly reduce garbled‑text issues in web‑scraping projects. The article encourages readers to replicate the analysis and adapt the improved logic to their own codebases.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python encoding Web Scraping Scrapy requests

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.