Fixing XPath Errors in Python Web Scraping: A Step-by-Step Guide

This article walks through diagnosing and fixing an XPath extraction issue in a Python web scraper, showing the original faulty code, the corrected selector, and the final working script with sample outputs, while emphasizing best practices like setting request headers.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Fixing XPath Errors in Python Web Scraping: A Step-by-Step Guide

大家好,我是皮皮。

一、前言

前几天在 Python 钻石交流群里,有人询问关于 Python 网络爬虫的选择器提取问题,截图如下:

代码初步看上去没有明显错误,但结果不对。

from lxml import etree
import requests
url = "http://zw.hainan.gov.cn/wssc/emalls.html"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
}
html = requests.get(url, headers=headers)
html = html.content.decode('utf-8')
doc = etree.HTML(html)
res = doc.xpath('/html/body/div[5]/ul/text()')
print('*-*--'*20)
for item in res:
    print(type(item))
    print(item[0])
print('*-*--'*20)

初步判断是 xpath 写得有问题。

二、实现过程

根据需求,修改提取规则后可以顺利得到预期文本:

运行后得到想要的结果:

最终代码如下:

from lxml import etree
import requests
url = "http://zw.hainan.gov.cn/wssc/emalls.html"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
}
html = requests.get(url, headers=headers)
html = html.content.decode('utf-8')
doc = etree.HTML(html)
res = doc.xpath('.//div/ul/li/a[2]/text()')
print('*-*--'*20)
for item in res:
    print(type(item))
    print(item)
print('*-*--'*20)

爬虫时记得添加请求头等好习惯。

三、总结

本文针对一个 Python 网络爬虫的 xpath 提取问题给出了解析思路和完整代码,实现了正确的数据抓取。

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Backend DevelopmentrequestsXPathlxml
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.