Backend Development 8 min read

Why Your Web Scraper Returns Duplicate Text: Uncovering Python’s Garbage Collection Pitfall

This article explains a puzzling bug in the General News Extractor where cached HTML elements produce inconsistent results, analyzes how Python's garbage collection and memory reuse cause duplicate text extraction, and shows how switching the cache key to XPath resolves the issue.

Python Crawling & Data Mining

Jul 12, 2020

Why Your Web Scraper Returns Duplicate Text: Uncovering Python’s Garbage Collection Pitfall

Problem Background

GNE (General News Extractor) version 0.2.1 significantly speeds up article body extraction, but a strange bug appears during development. The extractor reads an HTML file ( tests/163/9.html) and attempts to collect the text inside all <a> tags that are children of <div> and <h2> elements.

<body>
    <div>
        <a href="/xx">你好</a>
    </div>
    <h2>
        <a>世界</a>
    </h2>
</body>

The code correctly extracts "你好" from the <div> branch, but when processing the <h2> branch it mistakenly returns the same <a> element, causing duplicate extraction.

Duplicate Extraction Issue

Because the extraction loop caches each <a> element, the same element is processed twice for lines 15‑20 of the original script.

To improve efficiency, a cache was introduced to store the analysis result of each <a> tag. If a tag has already been analyzed, the cached result is reused.

After modifying the code, the cache key was generated using str(element), which records the memory address of the element.

Unexpectedly, the cached result sometimes differed from the freshly extracted result, and the bug disappeared when the cache stored a list containing both the text list and the element itself.

Schrödinger’s Element

The phenomenon is traced back to Python’s garbage collection. In a simple for loop, each iteration creates a new element object; the previous object’s reference count drops to zero and the memory is reclaimed. When the same memory address is later reused for a different element, str(element) yields identical addresses for distinct nodes, leading to cache collisions.

for element in element_list:
    a = element.xpath('//xxx')
    b = element.xpath('.//text()')
    c = 1 + 1

Keeping a reference to each element (e.g., appending it to a cache list) prevents its memory from being reclaimed, ensuring each node occupies a unique address.

cache = []
for element in element_list:
    a = element.xpath('//xxx')
    b = element.xpath('.//text()')
    c = 1 + 1
    cache.append(element)

Thus, using str(element) as a cache key is unreliable because the same address can correspond to different HTML nodes after garbage collection.

Solution

The fix is to replace the memory‑address key with a stable identifier such as the element’s XPath. After changing the cache key to XPath, the extractor produces correct results.

Reference

[1] GNE: 新闻网页正文通用抽取器 – https://github.com/kingname/GeneralNewsExtractor

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Garbage Collection Caching Bug Fix web-scraping

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.