How to Scrape JD.com Product Data with Python Regex: A Step‑by‑Step Guide
This tutorial shows how to build a JD.com search URL, encode keywords, fetch the page with Python's urllib, and extract product details using regular expressions, providing code snippets, regex explanations, and sample output for beginners.
JD.com is China’s largest self‑operated e‑commerce platform. By entering a keyword such as “狗粮” (dog food) into the search box, the resulting URL looks like
https://search.jd.com/Search?keyword=%E7%8B%97%E7%B2%AE&enc=utf-8. The keyword parameter is URL‑encoded, so any desired search term can be inserted after encoding.
Constructing the Search URL
Use Python’s urllib.parse.quote to encode the keyword and concatenate it with the base URL and the enc=utf-8 flag.
import urllib.parse
keyword = '狗粮'
search_url = 'https://search.jd.com/Search?keyword=' + urllib.parse.quote(keyword) + '&enc=utf-8'Fetching the Page Source
Retrieve the HTML with urllib.request.urlopen and decode it as UTF‑8.
import urllib.request
response = urllib.request.urlopen(search_url)
html = response.read().decode('utf-8')Extracting Information with Regular Expressions
The article explains that patterns like [\w\W]+? or [\s\S]+? act as a full‑character wildcard, matching any character including line breaks, which is more powerful than the dot . operator. Example regexes are demonstrated to capture product titles, prices, and URLs.
import re
# Example pattern that matches a product block (simplified)
pattern = r'"skuName":"(.*?)".*?"price":"(.*?)"'
matches = re.findall(pattern, html, re.S)
for title, price in matches:
print(f'Title: {title}, Price: {price}')Sample Output
The script prints the extracted fields, and a screenshot of the console output is shown below.
Next Steps
The guide mentions that only four fields were captured on a single page; readers are encouraged to modify the regexes and add pagination logic to collect more data. A follow‑up article will demonstrate using BeautifulSoup for more robust parsing.
Finally, a brief introduction to regular expressions is provided, emphasizing that beginners need not memorize every pattern—understanding when and how to use common constructs like [\s\S] is sufficient for effective web scraping.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
