Why Your Web Scraper Returns Duplicate Text: Uncovering Python’s Garbage Collection Pitfall
This article explains a puzzling bug in the General News Extractor where cached HTML elements produce inconsistent results, analyzes how Python's garbage collection and memory reuse cause duplicate text extraction, and shows how switching the cache key to XPath resolves the issue.
Problem Background
GNE (General News Extractor) version 0.2.1 significantly speeds up article body extraction, but a strange bug appears during development. The extractor reads an HTML file ( tests/163/9.html) and attempts to collect the text inside all <a> tags that are children of <div> and <h2> elements.
<body>
<div>
<a href="/xx">你好</a>
</div>
<h2>
<a>世界</a>
</h2>
</body>The code correctly extracts "你好" from the <div> branch, but when processing the <h2> branch it mistakenly returns the same <a> element, causing duplicate extraction.
Duplicate Extraction Issue
Because the extraction loop caches each <a> element, the same element is processed twice for lines 15‑20 of the original script.
To improve efficiency, a cache was introduced to store the analysis result of each <a> tag. If a tag has already been analyzed, the cached result is reused.
After modifying the code, the cache key was generated using str(element), which records the memory address of the element.
Unexpectedly, the cached result sometimes differed from the freshly extracted result, and the bug disappeared when the cache stored a list containing both the text list and the element itself.
Schrödinger’s Element
The phenomenon is traced back to Python’s garbage collection. In a simple for loop, each iteration creates a new element object; the previous object’s reference count drops to zero and the memory is reclaimed. When the same memory address is later reused for a different element, str(element) yields identical addresses for distinct nodes, leading to cache collisions.
for element in element_list:
a = element.xpath('//xxx')
b = element.xpath('.//text()')
c = 1 + 1Keeping a reference to each element (e.g., appending it to a cache list) prevents its memory from being reclaimed, ensuring each node occupies a unique address.
cache = []
for element in element_list:
a = element.xpath('//xxx')
b = element.xpath('.//text()')
c = 1 + 1
cache.append(element)Thus, using str(element) as a cache key is unreliable because the same address can correspond to different HTML nodes after garbage collection.
Solution
The fix is to replace the memory‑address key with a stable identifier such as the element’s XPath. After changing the cache key to XPath, the extractor produces correct results.
Reference
[1] GNE: 新闻网页正文通用抽取器 – https://github.com/kingname/GeneralNewsExtractor
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
