How to Clean and Extract Product Prices with Python Web Scraping (XPath Tips)
This article walks through a Python web‑scraping problem where price data extracted via XPath contains empty lines and unwanted characters, presenting multiple code solutions—including list comprehensions, map‑filter, and conditional printing—to clean and correctly display the product prices.
Introduction
In a Python community, a user asked about a web‑scraping issue where the extracted price data from a government website was messy, containing newline characters and empty entries.
Initial Attempt
The original code used requests and lxml.etree to fetch the page and extract prices with the XPath expression //div[@class='productlist']/ul/li/div[4]/text(). The output included newline characters and blank strings.
url = "http://zw.hainan.gov.cn/wssc/ec/jlyhnkj.html"
resp = requests.get(url, headers=headers)
text = resp.text
parse = etree.HTML(text)
price = parse.xpath("//div[@class='productlist']/ul/li/div[4]/text()")
print(type(price), len(price))
for i in price:
print(i.strip())First Fix
A contributor suggested stripping newline characters directly:
for i in price:
print(i.strip().replace('
', ''))This reduced some noise but still left empty entries.
Filtering Empty Values
Another solution filtered out empty strings using filter and map:
import requests
from lxml import etree
url = "http://zw.hainan.gov.cn/wssc/ec/jlyhnkj.html"
resp = requests.get(url)
text = resp.text
parse = etree.HTML(text)
price = parse.xpath('//div[@class="product_price"]/text()')
for i in map(str.strip, filter(str.strip, price)):
print(i)Alternative Approaches
Other contributors offered different methods:
# Iterate over each product node and extract the price directly
resp = requests.get(url, headers=headers)
text = resp.text
parse = etree.HTML(text)
price = parse.xpath("//div[@class='productlist']/ul/li")
for i in price:
print(i.xpath('./div[@class="product_price"]/text()')[1].strip()) # Simple loop with conditional stripping
resp = requests.get(url, headers=headers)
text = resp.text
parse = etree.HTML(text)
price = parse.xpath("//div[@class='product_price']/text()")
for i in price:
if i.strip():
print(i.replace('
', '').strip()) # List comprehension to remove empty entries
price = [i.strip() for i in price if i.strip()]
print(price)
# Optional conversion to numbers (commented out)
# price = [int(float(i.replace('¥', '').replace(',', ''))) for i in price]
print(price)Conclusion
The article demonstrates several practical ways to clean scraped price data in Python, emphasizing the use of strip(), list comprehensions, map / filter, and correct XPath selection to obtain accurate results.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
