How to Fix Common XPath Errors in Python Web Scraping: A Step-by-Step Guide
This article walks through a real‑world Python web‑scraping problem, shows why the original XPath selector fails, provides corrected code with a working XPath expression, and highlights best practices such as adding request headers for reliable crawling.
1. Introduction
The author shares a question from a Python community about an XPath selector that returns incorrect results when crawling a jokes website.
2. Problem Code
The original script uses requests and lxml.etree with an XPath expression that does not match the desired elements, leading to unexpected output.
from lxml import etree
import requests
url = "http://www.xiaohua.com/duanzi/"
resp = requests.get(url)
html = etree.HTML(resp.text)
print('*---*' * 20)
result = html.xpath("/html/body/div[@class='main']/div[@class='content']/div[@class='grid clearfix']/div[@class='content-left']/div[@class='one-cont'][*]/p[@class='fonts']")
print(type(result))
print(result)
print('*-*' * 20)
b = 0
for i in result:
b += 1
print(i, len(result))
print(b, etree.tostring(i).decode('utf-8'))
if b > 1:
breakThe output confirms the XPath is incorrect.
3. Solution
A community member provided a revised XPath and minor adjustments, allowing the script to extract the joke text correctly.
Running the revised script now yields the expected joke text.
Another participant shared notes summarizing the fix.
4. Conclusion
The article demonstrates how to diagnose and correct XPath issues in Python web crawlers and reminds readers to set appropriate request headers and follow good scraping practices.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
