Scrapy Crawl Template for Automatically Extracting JD.com Product Information

This article provides a step‑by‑step guide on using Scrapy’s crawl template to automatically scrape product details such as ID, title, shop name, shop link, and price from JD.com, including source analysis, project setup, code snippets, and result verification.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Scrapy Crawl Template for Automatically Extracting JD.com Product Information

The project aims to automatically collect product information from JD.com using Scrapy’s crawl template. The workflow begins with analyzing the JD homepage source to determine whether product links are directly available or require packet capture.

By inspecting the homepage source, the required product links are found, eliminating the need for packet capture. Subsequent pages (category, sub‑category, and product detail pages) are similarly examined, confirming that each level’s links are present in the HTML source.

Implementation steps include creating a Scrapy project, generating a spider, extracting product URLs, and defining extraction rules for product ID, title, shop name, shop link, and price.

scrapy startproject jingdong
scrapy genspider -t crawl jd jd.com

Product ID extraction:

thisurl = response.url
pat = 'item.jd.com/(.*?).html'
thisid = re.compile(pat).findall(thisurl)

Product title extraction: title = response.xpath('/html/head/title/text()').extract() Shop name extraction:

shop = response.xpath('//div[@class="name"]/a/@title').extract()

Shop link extraction:

shoplink = response.xpath('//div[@class="name"]/a/@href').extract()

Price extraction (requires packet capture):

priceurl = 'https://p.3.cn/prices/mgets?callback=jQuery&skuIds=J_' + str(thisid)
pricedata = urllib.request.urlopen(priceurl).read().decode('utf-8', 'ignore')
pricepat = '"p":"(.*?)"'
price = re.compile(pricepat).findall(pricedata)

After defining the spider, the database and table are created, and the jd.py spider file is written. The spider is then executed: scrapy crawl jd --nolog Results can be viewed directly in the command line output and verified in the database, confirming successful extraction of the desired JD.com product data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonBackend DevelopmentData ExtractionWeb ScrapingScrapyJD.com
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.