Fundamentals 4 min read

Master Python Web Scraping with pandas.read_html and XPath – A Step-by-Step Guide

This article walks through a real‑world Python web‑scraping problem, demonstrating how to extract clean table data using pandas.read_html and an alternative XPath approach, complete with code snippets, output screenshots, and practical tips for robust data collection.

Python Crawling & Data Mining

Mar 9, 2023

Master Python Web Scraping with pandas.read_html and XPath – A Step-by-Step Guide

Introduction

Hello, I’m PiPi. A few days ago a member of the Python Diamond group asked a question about a Python web‑scraping issue, which I’m sharing here.

The original extraction captured an extra line and page numbers, resulting in a lot of useless information.

Implementation

One contributor suggested using pd.read_html for efficient extraction. The code is shown below:

import requests
import pandas as pd
url = "https://zw.hainan.gov.cn/ggzy/ggzy/jgzbgg/index.jhtml"
resp = requests.get(url, headers=headers, cookies=cookies)
df = pd.read_html(resp.text)[0]
df.drop([len(df)-1], inplace=True)
print(df)

The script successfully solved the fan’s problem, as illustrated in the following screenshot:

There are many ways to extract web information. Another contributor offered an XPath‑based solution, with the code shown in the image below:

The XPath method includes an anonymous function that may be confusing at first, but it is explained in the accompanying illustration:

After cleaning whitespace and line breaks, the code was further optimized for robustness, as shown here:

Conclusion

This article presented a Python web‑scraping problem, provided detailed analysis, and shared concrete code implementations using both pandas and XPath, helping the original asker resolve the issue effectively.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Pandas XPath web-scraping data-extraction

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.