Scrapy Crawl Template for Automatically Extracting JD.com Product Information
This article provides a step‑by‑step guide on using Scrapy’s crawl template to automatically scrape product details such as ID, title, shop name, shop link, and price from JD.com, including source analysis, project setup, code snippets, and result verification.
The project aims to automatically collect product information from JD.com using Scrapy’s crawl template. The workflow begins with analyzing the JD homepage source to determine whether product links are directly available or require packet capture.
By inspecting the homepage source, the required product links are found, eliminating the need for packet capture. Subsequent pages (category, sub‑category, and product detail pages) are similarly examined, confirming that each level’s links are present in the HTML source.
Implementation steps include creating a Scrapy project, generating a spider, extracting product URLs, and defining extraction rules for product ID, title, shop name, shop link, and price.
scrapy startproject jingdong
scrapy genspider -t crawl jd jd.comProduct ID extraction:
thisurl = response.url
pat = 'item.jd.com/(.*?).html'
thisid = re.compile(pat).findall(thisurl)Product title extraction: title = response.xpath('/html/head/title/text()').extract() Shop name extraction:
shop = response.xpath('//div[@class="name"]/a/@title').extract()Shop link extraction:
shoplink = response.xpath('//div[@class="name"]/a/@href').extract()Price extraction (requires packet capture):
priceurl = 'https://p.3.cn/prices/mgets?callback=jQuery&skuIds=J_' + str(thisid)
pricedata = urllib.request.urlopen(priceurl).read().decode('utf-8', 'ignore')
pricepat = '"p":"(.*?)"'
price = re.compile(pricepat).findall(pricedata)After defining the spider, the database and table are created, and the jd.py spider file is written. The spider is then executed: scrapy crawl jd --nolog Results can be viewed directly in the command line output and verified in the database, confirming successful extraction of the desired JD.com product data.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
