Master XPath to Scrape Precise JD.com Product Data with Python
This tutorial shows how to encode a JD.com search URL, fetch the page, and use XPath expressions in Python to reliably extract product names, links, images, and prices, offering a clearer alternative to regular‑expression or BeautifulSoup scraping.
After previously demonstrating product crawling on JD.com with regular expressions and BeautifulSoup, this article introduces XPath as a more straightforward method for accurately locating product information in the site’s HTML.
HTML pages are composed of nested tags forming a tree; XPath selects nodes by path expressions. The target data for JD.com products resides inside <li class="gl-item"> elements.
First, construct the search URL by inserting the desired keyword (e.g., "dog food") into the keyword query parameter and URL‑encode it with urllib.parse.quote. Send an HTTP GET request to the resulting URL and obtain the response HTML.
Parse the HTML with an XPath‑capable selector (e.g., Scrapy’s Selector or lxml). Use the following expression to collect all product items:
items = selector.xpath('//li[@class="gl-item"]')Iterate over the items and extract individual fields. Example snippets:
title = selector.xpath('//div[@class="p-img"]/a')[i].get('title')Similar XPath queries retrieve the product link, image URL, and price. The loop typically uses range(len(items)) to process each product sequentially.
Running the script produces a list of product details, as illustrated by the screenshots below.
The final output displays the extracted product names, URLs, images, and prices, confirming that XPath offers a cleaner and more maintainable approach than regular‑expression based scraping for JD.com.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
