Fundamentals 6 min read

How to Clean and Extract Product Prices with Python Web Scraping (XPath Tips)

This article walks through a Python web‑scraping problem where price data extracted via XPath contains empty lines and unwanted characters, presenting multiple code solutions—including list comprehensions, map‑filter, and conditional printing—to clean and correctly display the product prices.

Python Crawling & Data Mining

Aug 30, 2022

How to Clean and Extract Product Prices with Python Web Scraping (XPath Tips)

Introduction

In a Python community, a user asked about a web‑scraping issue where the extracted price data from a government website was messy, containing newline characters and empty entries.

Initial Attempt

The original code used requests and lxml.etree to fetch the page and extract prices with the XPath expression //div[@class='productlist']/ul/li/div[4]/text(). The output included newline characters and blank strings.

url = "http://zw.hainan.gov.cn/wssc/ec/jlyhnkj.html"
resp = requests.get(url, headers=headers)
text = resp.text
parse = etree.HTML(text)
price = parse.xpath("//div[@class='productlist']/ul/li/div[4]/text()")
print(type(price), len(price))
for i in price:
    print(i.strip())

First Fix

A contributor suggested stripping newline characters directly:

for i in price:
    print(i.strip().replace('
', ''))

This reduced some noise but still left empty entries.

Filtering Empty Values

Another solution filtered out empty strings using filter and map:

import requests
from lxml import etree
url = "http://zw.hainan.gov.cn/wssc/ec/jlyhnkj.html"
resp = requests.get(url)
text = resp.text
parse = etree.HTML(text)
price = parse.xpath('//div[@class="product_price"]/text()')
for i in map(str.strip, filter(str.strip, price)):
    print(i)

Alternative Approaches

Other contributors offered different methods:

# Iterate over each product node and extract the price directly
resp = requests.get(url, headers=headers)
text = resp.text
parse = etree.HTML(text)
price = parse.xpath("//div[@class='productlist']/ul/li")
for i in price:
    print(i.xpath('./div[@class="product_price"]/text()')[1].strip())

# Simple loop with conditional stripping
resp = requests.get(url, headers=headers)
text = resp.text
parse = etree.HTML(text)
price = parse.xpath("//div[@class='product_price']/text()")
for i in price:
    if i.strip():
        print(i.replace('
', '').strip())

# List comprehension to remove empty entries
price = [i.strip() for i in price if i.strip()]
print(price)
# Optional conversion to numbers (commented out)
# price = [int(float(i.replace('¥', '').replace(',', ''))) for i in price]
print(price)

Conclusion

The article demonstrates several practical ways to clean scraped price data in Python, emphasizing the use of strip(), list comprehensions, map / filter, and correct XPath selection to obtain accurate results.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

requests XPath lxml

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.