Master Python Web Scraping with pandas.read_html and XPath – A Step-by-Step Guide
This article walks through a real‑world Python web‑scraping problem, demonstrating how to extract clean table data using pandas.read_html and an alternative XPath approach, complete with code snippets, output screenshots, and practical tips for robust data collection.
Introduction
Hello, I’m PiPi. A few days ago a member of the Python Diamond group asked a question about a Python web‑scraping issue, which I’m sharing here.
The original extraction captured an extra line and page numbers, resulting in a lot of useless information.
Implementation
One contributor suggested using pd.read_html for efficient extraction. The code is shown below:
import requests
import pandas as pd
url = "https://zw.hainan.gov.cn/ggzy/ggzy/jgzbgg/index.jhtml"
resp = requests.get(url, headers=headers, cookies=cookies)
df = pd.read_html(resp.text)[0]
df.drop([len(df)-1], inplace=True)
print(df)The script successfully solved the fan’s problem, as illustrated in the following screenshot:
There are many ways to extract web information. Another contributor offered an XPath‑based solution, with the code shown in the image below:
The XPath method includes an anonymous function that may be confusing at first, but it is explained in the accompanying illustration:
After cleaning whitespace and line breaks, the code was further optimized for robustness, as shown here:
Conclusion
This article presented a Python web‑scraping problem, provided detailed analysis, and shared concrete code implementations using both pandas and XPath, helping the original asker resolve the issue effectively.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
