How to Scrape JD.com Product Data with Python Regex: A Step‑by‑Step Guide
This tutorial shows how to build a keyword‑driven web crawler for JD.com using Python's urllib for URL encoding and opening, combined with powerful regular expressions to accurately extract product information such as dog food listings, and explains how to extend the scraper for multi‑page data collection.
JD.com is China’s largest self‑operated e‑commerce platform, holding a 56.3% market share in the Q1 2015 B2C market.
To retrieve product data, you can enter a keyword—e.g., "dog food"—on JD.com, which generates a URL like
https://search.jd.com/Search?keyword=%E7%8B%97%E7%B2%AE&enc=utf-8. The encoded keyword parameter allows direct access to the target page.
Using Python 3, the urllib.parse.quote function encodes the keyword, and urllib.request.urlopen fetches the page source. After obtaining the HTML, regular expressions are applied to extract the desired fields.
The core regex patterns used are [\w\W]+? and [\s\S]+?, which act as full‑wildcard matches covering all characters, including line breaks—something the dot . cannot match.
Explanation of the patterns: [\s\S] matches any whitespace ( \s) or non‑whitespace ( \S) character, effectively matching every possible character. Similarly, [\w\W] matches any word character or non‑word character. These constructs are preferred when a truly exhaustive match is needed.
After processing, the scraper outputs the extracted product information (in the example, four fields from a single page). The final result is shown in the screenshot below.
The tutorial notes that this is a basic single‑page scraper; readers can modify the regular expressions and add pagination logic to collect more data. In the next article, BeautifulSoup will be introduced for more robust HTML parsing.
Finally, a brief introduction to regular expressions is provided, emphasizing that while they can seem complex for beginners, understanding when and how to use specific patterns is sufficient for effective data extraction.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
