Backend Development 7 min read

Why a Python HTML Extractor’s Cache Failed: Garbage Collection Got You

The article explains how using an element's string representation as a cache key caused duplicate extraction in a Python news‑page parser, reveals the role of Python's garbage collection and memory reuse, and shows how switching to XPath keys resolves the bug.

Python Crawling & Data Mining

Aug 13, 2020

Why a Python HTML Extractor’s Cache Failed: Garbage Collection Got You

Problem Background

GNE (General News Extractor) version 0.2.1 improved extraction speed but introduced a puzzling bug caused by Python's garbage collection and memory reuse. The extractor reads an HTML file (e.g., tests/163/9.html) and attempts to collect the text of all <a> tags under <body>, <div>, and <h2> elements.

<body>
    <div>
        <a href="/xx">你好</a>
    </div>
    <h2>
        <a>世界</a>
    </h2>
</body>

The extractor correctly retrieves "你好" from the <div> branch, but when processing the <h2> branch it mistakenly returns the same <a> element, causing duplicate work on lines 15‑20 of the original code.

Root Cause: Cache Key Misuse

The implementation cached each <a> element using str(element) as the dictionary key. str(element) includes the object's memory address (e.g., <Element a at 0x1087ba638>). When Python's garbage collector reclaims an object, that memory can be reassigned to a new element, making two different nodes share the same key. Consequently, the cached analysis for one node is incorrectly applied to another.

Changing the cache to store only the element itself (or its XPath) eliminates the duplication:

cache = []
for element in element_list:
    a = element.xpath('//xxx')
    b = element.xpath('.//text()')
    c = 1 + 1
    cache.append(element)

Keeping a reference to each element prevents its reference count from dropping to zero, so the garbage collector does not free it, and each node retains a unique memory address.

Solution

The fix replaces the string‑based key with an XPath‑based key, ensuring a one‑to‑one mapping between cache entries and HTML nodes. After this change the extractor produces correct results.

# Example of using XPath as cache key
element_flag = element.xpath('self::node()')[0]
element_text_cache[element_flag] = [element_text_list, element]

Key Takeaways

Never use str(object) (which contains memory addresses) as a persistent cache key.

Understand Python's reference counting and garbage collection when designing caches.

XPath provides a stable identifier for HTML elements across iterations.

Reference

[1] GNE: General News Extractor – https://github.com/kingname/GeneralNewsExtractor

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

html-parsing Garbage Collection Caching Bug Fix web-scraping

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.