Backend Development 16 min read

How to Scrape and Analyze Tmall Lingerie Sales Data with Python

This tutorial walks you through using Python to crawl Tmall lingerie product and comment data, store it in MySQL, clean and analyze cup‑size distribution and popular colors, and finally visualize the results with pyecharts.

Python Programming Learning Circle

Dec 17, 2019

How to Scrape and Analyze Tmall Lingerie Sales Data with Python

Overview

With a professional and rigorous mindset, we demonstrate how to use Python to fetch Tmall lingerie sales data, extract common bra cup sizes, identify the most popular colors, and analyze comment keywords, ultimately enabling you to choose the perfect bra for a partner.

Step‑by‑Step Process

Research the Tmall website – Open a product page that shows comments, press F12, go to the Network tab, refresh, and locate a request ending with list_detail_rate.htm?itemId=…. This URL returns a JSON payload containing the comment list.

Scrape Tmall comment data – Build a function getCommentDetail(itemId, currentPage) that requests the URL, strips the surrounding JSONP wrapper, converts the string to a Python dictionary, and returns it.

Store and analyze data – Use MySQL to create a table bra (fields: color, size, source, comment, comment_time). Insert scraped comments in batches, clean the data with simple UPDATE statements, and run aggregation queries to compute cup‑size ratios and color popularity.

Data visualization – Employ the pyecharts library to generate pie charts for cup‑size distribution and color frequency, rendering them as HTML files.

Key Code Snippets

# Get product comment data
import json, re
def getCommentDetail(itemId, currentPage):
    url = ('https://rate.tmall.com/list_detail_rate.htm?itemId=' + str(itemId) +
           '&sellerId=2451699564&order=3¤tPage=' + str(currentPage) + '&append=0callback=jsonp336')
    html = common.getUrlContent(url)
    html = html.replace('jsonp128(', '').replace(')', '')
    html = html.replace('false', '"false"').replace('true', '"true"')
    return json.loads(html)

# Get maximum comment page
def getLastPage(itemId):
    return getCommentDetail(itemId, 1)['rateDetail']['paginator']['lastPage']

# Get list of product IDs from a search result
from bs4 import BeautifulSoup
def getProductIdList():
    url = 'https://list.tmall.com/search_product.htm?q=内衣'
    html = common.getUrlContent(url)
    soup = BeautifulSoup(html, 'html.parser')
    ids = []
    for prod in soup.find_all('div', {'class': 'product'}):
        ids.append(prod['data-id'])
    return ids

The main script iterates over the first 30 product IDs, retrieves up to 20 pages of comments per product (capped at 400 comments), extracts color, size, source, comment text, and timestamp, and inserts the records into the MySQL table.

if __name__ == '__main__':
    productIdList = getProductIdList()
    for idx, itemId in enumerate(productIdList[:30]):
        try:
            maxPage = getLastPage(itemId)
            for page in range(1, min(maxPage, 20) + 1):
                try:
                    data = getCommentDetail(itemId, page)
                    # process and batch‑insert comments
                except Exception as e:
                    continue
        except Exception as e:
            continue

Database Setup

CREATE TABLE `bra` (
  `bra_id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'id',
  `bra_color` varchar(25) NULL COMMENT '颜色',
  `bra_size` varchar(25) NULL COMMENT '罩杯',
  `resource` varchar(25) NULL COMMENT '数据来源',
  `comment` varchar(500) CHARACTER SET utf8mb4 NULL COMMENT '评论',
  `comment_time` datetime NULL COMMENT '评论时间',
  PRIMARY KEY (`bra_id`)
) CHARACTER SET utf8;

After populating the table, clean the bra_color and bra_size fields with simple UPDATE statements, then run aggregation queries to calculate the percentage and count of each cup size and each color.

SELECT 'A罩杯' AS cup, CONCAT(ROUND(COUNT(*)/(SELECT COUNT(*) FROM bra)*100,2), '%') AS ratio, COUNT(*) AS sales
FROM bra WHERE bra_size LIKE '%A'
UNION ALL
SELECT 'B罩杯', CONCAT(ROUND(COUNT(*)/(SELECT COUNT(*) FROM bra)*100,2), '%'), COUNT(*)
FROM bra WHERE bra_size LIKE '%B'
-- (continue for C, D, … H)
ORDER BY sales DESC;

Visualization with pyecharts

from pyecharts import Pie
# Cup size pie chart
attr = ['A罩杯','B罩杯','C罩杯','D罩杯','E罩杯','F罩杯','G罩杯','H罩杯']
values = [/* results from SQL */]
pie = Pie('内衣罩杯', width=1300, height=620)
pie.add('', attr, values, is_label_show=True)
pie.render('size.html')

# Color pie chart
color_attr = ['肤色','灰色','黑色','蓝色','粉色','红色','紫色','绿色','白色','褐色','黄色']
color_vals = [/* results from SQL */]
pieColor = Pie('内衣颜色', width=1300, height=620)
pieColor.add('', color_attr, color_vals, is_label_show=True)
pieColor.render('color.html')

Running the script produces visualizations of bra cup‑size distribution and color popularity, providing actionable insights into Chinese women’s lingerie preferences.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data analysis mysql visualization

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.