How to Scrape JD.com Reviews and Analyze Color & Size Trends with Python
This article demonstrates building a Python web scraper to collect JD.com product reviews, extract product IDs, store comments in MongoDB, clean color and size data, and visualize the most popular colors and sizes using matplotlib, providing a complete end‑to‑end data analysis pipeline.
I wrote a small crawler example that fetches user reviews from JD.com, then analyzes the collected data to answer questions such as which bra color is most popular among Chinese women and what the average size is.
Finding the request pattern
By opening the browser’s developer tools → Network tab on the review page, we see a request that contains three main parameters: productId, page, and pageSize. The last two are pagination parameters, while productId uniquely identifies each product.
Inspecting the search result page’s source reveals that each product is rendered inside an <li> element with a data-pid attribute; this attribute’s value is the productId we need.
Extracting product IDs
import requests
import re
"""
查询商品id
"""
def find_product_id(key_word):
jd_url = 'https://search.jd.com/Search'
product_ids = []
# 爬前3页的商品
for i in range(1,3):
param = {'keyword': key_word, 'enc': 'utf-8', 'page': i}
response = requests.get(jd_url, params=param)
# 商品id
ids = re.findall('data-pid="(.*?)"', response.text, re.S)
product_ids += ids
return product_idsThe function above crawls the first three pages of search results for a given keyword (e.g., "胸罩") and returns a list of product IDs.
Fetching comments
The comment API returns a string that prefixes a JSON payload. By stripping the prefix we can parse the JSON and obtain the comments array.
def get_comment_message(product_id):
urls = [
'https://sclub.jd.com/comment/productPageComments.action?'
'callback=fetchJSON_comment98vv53282&'
'productId={}'
'&score=0&sortType=5&'
'page={}'
'&pageSize=10&isShadowSku=0&rid=0&fold=1'.format(product_id, page)
for page in range(1,11)
]
for url in urls:
response = requests.get(url)
html = response.text
# 删除无用字符
html = html.replace('fetchJSON_comment98vv53282(', '').replace(');', '')
data = json.loads(html)
comments = data['comments']
t = threading.Thread(target=save_mongo, args=(comments,))
t.start()Storing data in MongoDB
# mongo服务
client = pymongo.MongoClient('mongodb://127.0.0.1:27017/')
# jd数据库
db = client.jd
# product表,没有自动创建
product_db = db.product
# 保存mongo
def save_mongo(comments):
for comment in comments:
product_data = {}
# 颜色
product_data['product_color'] = flush_data(comment['productColor'])
# size
product_data['product_size'] = flush_data(comment['productSize'])
# 评论内容
product_data['comment_content'] = comment['content']
# create_time
product_data['create_time'] = comment['creationTime']
product_db.insert(product_data)Simple data cleaning
Because color and size descriptions vary, a quick cleaning function normalizes common Chinese color words.
def flush_data(data):
if '肤' in data:
return '肤色'
if '黑' in data:
return '黑色'
if '紫' in data:
return '紫色'
if '粉' in data:
return '粉色'
if '蓝' in data:
return '蓝色'
if '白' in data:
return '白色'
if '灰' in data:
return '灰色'
if '槟' in data:
return '香槟色'
if '琥' in data:
return '琥珀色'
if '红' in data:
return '红色'
# fallback for single‑letter sizes
if data in ['A','B','C','D']:
return data
return dataMultithreaded crawling
# 创建一个线程锁
lock = threading.Lock()
# 获取评论线程
def spider_jd(ids):
while ids:
lock.acquire()
id = ids[0]
del ids[0]
lock.release()
get_comment_message(id)
product_ids = find_product_id('胸罩')
for i in (1,5):
t = threading.Thread(target=spider_jd, args=(product_ids,))
t.start()The lock prevents multiple threads from consuming the same product ID.
Visualizing results
After the data is stored, we use matplotlib to create a pie chart for color distribution and a bar chart for size distribution.
import pymongo
from pylab import *
client = pymongo.MongoClient('mongodb://127.0.0.1:27017/')
db = client.jd
product_db = db.product
color_arr = ['肤色','黑色','紫色','粉色','蓝色','白色','灰色','香槟色','红色']
color_num_arr = []
for i in color_arr:
num = product_db.count({'product_color': i})
color_num_arr.append(num)
# 绘制饼图(省略颜色映射细节)
patches,l_text,p_text = plt.pie(color_num_arr, labels=color_arr, autopct='%3.1f%%', startangle=90)
plt.axis('equal')
plt.title('内衣颜色比例图', fontproperties='SimHei')
plt.show()The chart shows that the "肤色" (skin tone) color is the most popular, followed by black.
# 统计尺码分布并绘制柱状图
index = ["A","B","C","D"]
value = []
for i in index:
num = product_db.count({'product_size': i})
value.append(num)
plt.bar(left=index, height=value, color="green", width=0.5)
plt.show()The bar chart indicates that size B is slightly more common among the sampled reviews.
Overall, this end‑to‑end example demonstrates how to collect, store, clean, and visualize e‑commerce review data using Python.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
