How to Scrape and Analyze Tmall Lingerie Sales Data with Python
This tutorial walks you through using Python to crawl Tmall lingerie product and comment data, store it in MySQL, clean and analyze cup‑size distribution and popular colors, and finally visualize the results with pyecharts.
Overview
With a professional and rigorous mindset, we demonstrate how to use Python to fetch Tmall lingerie sales data, extract common bra cup sizes, identify the most popular colors, and analyze comment keywords, ultimately enabling you to choose the perfect bra for a partner.
Step‑by‑Step Process
Research the Tmall website – Open a product page that shows comments, press F12, go to the Network tab, refresh, and locate a request ending with list_detail_rate.htm?itemId=…. This URL returns a JSON payload containing the comment list.
Scrape Tmall comment data – Build a function getCommentDetail(itemId, currentPage) that requests the URL, strips the surrounding JSONP wrapper, converts the string to a Python dictionary, and returns it.
Store and analyze data – Use MySQL to create a table bra (fields: color, size, source, comment, comment_time). Insert scraped comments in batches, clean the data with simple UPDATE statements, and run aggregation queries to compute cup‑size ratios and color popularity.
Data visualization – Employ the pyecharts library to generate pie charts for cup‑size distribution and color frequency, rendering them as HTML files.
Key Code Snippets
# Get product comment data
import json, re
def getCommentDetail(itemId, currentPage):
url = ('https://rate.tmall.com/list_detail_rate.htm?itemId=' + str(itemId) +
'&sellerId=2451699564&order=3¤tPage=' + str(currentPage) + '&append=0callback=jsonp336')
html = common.getUrlContent(url)
html = html.replace('jsonp128(', '').replace(')', '')
html = html.replace('false', '"false"').replace('true', '"true"')
return json.loads(html)
# Get maximum comment page
def getLastPage(itemId):
return getCommentDetail(itemId, 1)['rateDetail']['paginator']['lastPage']
# Get list of product IDs from a search result
from bs4 import BeautifulSoup
def getProductIdList():
url = 'https://list.tmall.com/search_product.htm?q=内衣'
html = common.getUrlContent(url)
soup = BeautifulSoup(html, 'html.parser')
ids = []
for prod in soup.find_all('div', {'class': 'product'}):
ids.append(prod['data-id'])
return idsThe main script iterates over the first 30 product IDs, retrieves up to 20 pages of comments per product (capped at 400 comments), extracts color, size, source, comment text, and timestamp, and inserts the records into the MySQL table.
if __name__ == '__main__':
productIdList = getProductIdList()
for idx, itemId in enumerate(productIdList[:30]):
try:
maxPage = getLastPage(itemId)
for page in range(1, min(maxPage, 20) + 1):
try:
data = getCommentDetail(itemId, page)
# process and batch‑insert comments
except Exception as e:
continue
except Exception as e:
continueDatabase Setup
CREATE TABLE `bra` (
`bra_id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'id',
`bra_color` varchar(25) NULL COMMENT '颜色',
`bra_size` varchar(25) NULL COMMENT '罩杯',
`resource` varchar(25) NULL COMMENT '数据来源',
`comment` varchar(500) CHARACTER SET utf8mb4 NULL COMMENT '评论',
`comment_time` datetime NULL COMMENT '评论时间',
PRIMARY KEY (`bra_id`)
) CHARACTER SET utf8;After populating the table, clean the bra_color and bra_size fields with simple UPDATE statements, then run aggregation queries to calculate the percentage and count of each cup size and each color.
SELECT 'A罩杯' AS cup, CONCAT(ROUND(COUNT(*)/(SELECT COUNT(*) FROM bra)*100,2), '%') AS ratio, COUNT(*) AS sales
FROM bra WHERE bra_size LIKE '%A'
UNION ALL
SELECT 'B罩杯', CONCAT(ROUND(COUNT(*)/(SELECT COUNT(*) FROM bra)*100,2), '%'), COUNT(*)
FROM bra WHERE bra_size LIKE '%B'
-- (continue for C, D, … H)
ORDER BY sales DESC;Visualization with pyecharts
from pyecharts import Pie
# Cup size pie chart
attr = ['A罩杯','B罩杯','C罩杯','D罩杯','E罩杯','F罩杯','G罩杯','H罩杯']
values = [/* results from SQL */]
pie = Pie('内衣罩杯', width=1300, height=620)
pie.add('', attr, values, is_label_show=True)
pie.render('size.html')
# Color pie chart
color_attr = ['肤色','灰色','黑色','蓝色','粉色','红色','紫色','绿色','白色','褐色','黄色']
color_vals = [/* results from SQL */]
pieColor = Pie('内衣颜色', width=1300, height=620)
pieColor.add('', color_attr, color_vals, is_label_show=True)
pieColor.render('color.html')Running the script produces visualizations of bra cup‑size distribution and color popularity, providing actionable insights into Chinese women’s lingerie preferences.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
