Why a Python HTML Extractor’s Cache Failed: Garbage Collection Got You
The article explains how using an element's string representation as a cache key caused duplicate extraction in a Python news‑page parser, reveals the role of Python's garbage collection and memory reuse, and shows how switching to XPath keys resolves the bug.
Problem Background
GNE (General News Extractor) version 0.2.1 improved extraction speed but introduced a puzzling bug caused by Python's garbage collection and memory reuse. The extractor reads an HTML file (e.g., tests/163/9.html) and attempts to collect the text of all <a> tags under <body>, <div>, and <h2> elements.
<body>
<div>
<a href="/xx">你好</a>
</div>
<h2>
<a>世界</a>
</h2>
</body>The extractor correctly retrieves "你好" from the <div> branch, but when processing the <h2> branch it mistakenly returns the same <a> element, causing duplicate work on lines 15‑20 of the original code.
Root Cause: Cache Key Misuse
The implementation cached each <a> element using str(element) as the dictionary key. str(element) includes the object's memory address (e.g., <Element a at 0x1087ba638>). When Python's garbage collector reclaims an object, that memory can be reassigned to a new element, making two different nodes share the same key. Consequently, the cached analysis for one node is incorrectly applied to another.
Changing the cache to store only the element itself (or its XPath) eliminates the duplication:
cache = []
for element in element_list:
a = element.xpath('//xxx')
b = element.xpath('.//text()')
c = 1 + 1
cache.append(element)Keeping a reference to each element prevents its reference count from dropping to zero, so the garbage collector does not free it, and each node retains a unique memory address.
Solution
The fix replaces the string‑based key with an XPath‑based key, ensuring a one‑to‑one mapping between cache entries and HTML nodes. After this change the extractor produces correct results.
# Example of using XPath as cache key
element_flag = element.xpath('self::node()')[0]
element_text_cache[element_flag] = [element_text_list, element]Key Takeaways
Never use str(object) (which contains memory addresses) as a persistent cache key.
Understand Python's reference counting and garbage collection when designing caches.
XPath provides a stable identifier for HTML elements across iterations.
Reference
[1] GNE: General News Extractor – https://github.com/kingname/GeneralNewsExtractor
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
