Backend Development 10 min read

How to Scrape JD.com Reviews and Analyze Color & Size Trends with Python

This article demonstrates building a Python web scraper to collect JD.com product reviews, extract product IDs, store comments in MongoDB, clean color and size data, and visualize the most popular colors and sizes using matplotlib, providing a complete end‑to‑end data analysis pipeline.

MaGe Linux Operations

Jul 25, 2019

How to Scrape JD.com Reviews and Analyze Color & Size Trends with Python

I wrote a small crawler example that fetches user reviews from JD.com, then analyzes the collected data to answer questions such as which bra color is most popular among Chinese women and what the average size is.

Finding the request pattern

By opening the browser’s developer tools → Network tab on the review page, we see a request that contains three main parameters: productId, page, and pageSize. The last two are pagination parameters, while productId uniquely identifies each product.

Inspecting the search result page’s source reveals that each product is rendered inside an <li> element with a data-pid attribute; this attribute’s value is the productId we need.

Extracting product IDs

import requests
import re
"""
查询商品id
"""
def find_product_id(key_word):
    jd_url = 'https://search.jd.com/Search'
    product_ids = []
    # 爬前3页的商品
    for i in range(1,3):
        param = {'keyword': key_word, 'enc': 'utf-8', 'page': i}
        response = requests.get(jd_url, params=param)
        # 商品id
        ids = re.findall('data-pid="(.*?)"', response.text, re.S)
        product_ids += ids
    return product_ids

The function above crawls the first three pages of search results for a given keyword (e.g., "胸罩") and returns a list of product IDs.

Fetching comments

The comment API returns a string that prefixes a JSON payload. By stripping the prefix we can parse the JSON and obtain the comments array.

def get_comment_message(product_id):
    urls = [
        'https://sclub.jd.com/comment/productPageComments.action?'
        'callback=fetchJSON_comment98vv53282&'
        'productId={}'
        '&score=0&sortType=5&'
        'page={}'
        '&pageSize=10&isShadowSku=0&rid=0&fold=1'.format(product_id, page)
        for page in range(1,11)
    ]
    for url in urls:
        response = requests.get(url)
        html = response.text
        # 删除无用字符
        html = html.replace('fetchJSON_comment98vv53282(', '').replace(');', '')
        data = json.loads(html)
        comments = data['comments']
        t = threading.Thread(target=save_mongo, args=(comments,))
        t.start()

Storing data in MongoDB

# mongo服务
client = pymongo.MongoClient('mongodb://127.0.0.1:27017/')
# jd数据库
db = client.jd
# product表,没有自动创建
product_db = db.product

# 保存mongo
def save_mongo(comments):
    for comment in comments:
        product_data = {}
        # 颜色
        product_data['product_color'] = flush_data(comment['productColor'])
        # size
        product_data['product_size'] = flush_data(comment['productSize'])
        # 评论内容
        product_data['comment_content'] = comment['content']
        # create_time
        product_data['create_time'] = comment['creationTime']
        product_db.insert(product_data)

Simple data cleaning

Because color and size descriptions vary, a quick cleaning function normalizes common Chinese color words.

def flush_data(data):
    if '肤' in data:
        return '肤色'
    if '黑' in data:
        return '黑色'
    if '紫' in data:
        return '紫色'
    if '粉' in data:
        return '粉色'
    if '蓝' in data:
        return '蓝色'
    if '白' in data:
        return '白色'
    if '灰' in data:
        return '灰色'
    if '槟' in data:
        return '香槟色'
    if '琥' in data:
        return '琥珀色'
    if '红' in data:
        return '红色'
    # fallback for single‑letter sizes
    if data in ['A','B','C','D']:
        return data
    return data

Multithreaded crawling

# 创建一个线程锁
lock = threading.Lock()

# 获取评论线程
def spider_jd(ids):
    while ids:
        lock.acquire()
        id = ids[0]
        del ids[0]
        lock.release()
        get_comment_message(id)

product_ids = find_product_id('胸罩')
for i in (1,5):
    t = threading.Thread(target=spider_jd, args=(product_ids,))
    t.start()

The lock prevents multiple threads from consuming the same product ID.

Visualizing results

After the data is stored, we use matplotlib to create a pie chart for color distribution and a bar chart for size distribution.

import pymongo
from pylab import *

client = pymongo.MongoClient('mongodb://127.0.0.1:27017/')
db = client.jd
product_db = db.product

color_arr = ['肤色','黑色','紫色','粉色','蓝色','白色','灰色','香槟色','红色']
color_num_arr = []
for i in color_arr:
    num = product_db.count({'product_color': i})
    color_num_arr.append(num)

# 绘制饼图（省略颜色映射细节）
patches,l_text,p_text = plt.pie(color_num_arr, labels=color_arr, autopct='%3.1f%%', startangle=90)
plt.axis('equal')
plt.title('内衣颜色比例图', fontproperties='SimHei')
plt.show()

The chart shows that the "肤色" (skin tone) color is the most popular, followed by black.

# 统计尺码分布并绘制柱状图
index = ["A","B","C","D"]
value = []
for i in index:
    num = product_db.count({'product_size': i})
    value.append(num)
plt.bar(left=index, height=value, color="green", width=0.5)
plt.show()

The bar chart indicates that size B is slightly more common among the sampled reviews.

Overall, this end‑to‑end example demonstrates how to collect, store, clean, and visualize e‑commerce review data using Python.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python MongoDB Matplotlib JD.com

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.