Fundamentals 6 min read

Mastering Python Web Scraping: Clean Price Extraction with XPath Tricks

This article walks through a Python web‑scraping problem, demonstrates why the original XPath extraction returns noisy or missing price data, and provides multiple refined code solutions—including filtering empty entries, correcting XPath selectors, and using map‑filter techniques—to produce clean, formatted price lists.

Python Crawling & Data Mining

Aug 26, 2025

Mastering Python Web Scraping: Clean Price Extraction with XPath Tricks

1. Introduction

Hello, I am Pipi. A few days ago a follower asked a question about a Python web‑crawler in the "Python Diamond" group. The initial code fetched a page, parsed it with lxml.etree, and tried to extract price information via an XPath expression, but the result was incorrect.

Initial code:

url = "http://zw.hainan.gov.cn/wssc/ec/jlyhnkj.html"
resp = requests.get(url, headers=headers)
text = resp.text
parse = etree.HTML(text)
price = parse.xpath("//div[@class='productlist']/ul/li/div[4]/text()")
print(type(price), len(price))
for i in price:
    print(i.strip())

The output contained newline characters and many empty entries, making the data hard to read.

2. Solution Process

Community members offered several improvements:

First attempt replaced newline characters directly:

Second attempt filtered out empty strings:

Root cause fix – the original XPath selector was wrong. The correct selector should target the element with class product_price:

Final robust solution uses map and filter to strip whitespace and remove empty items:

Python data cleaning lxml

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.