Big Data 8 min read

How to Scrape and Analyze 1,585 Cherry Listings from Taobao with Python

Using Python, this tutorial demonstrates how to scrape 1,585 cherry product listings from Taobao, extract key fields, clean and transform the data, and visualize regional sales and price distributions, highlighting the most popular provinces, stores, and price ranges, all with reproducible code.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
How to Scrape and Analyze 1,585 Cherry Listings from Taobao with Python

1. Data Acquisition

Using Python we collected 1,585 cherry sellers from Taobao, obtaining fields such as product name, price, number of buyers, shop name, and shipping address. The main function of the crawler is shown below.

def main():
    browser.get('https://www.taobao.com/')
    page = search_product(key_word)
    print(page)
    get_data()
    page_num = 70
    while int(page) != page_num:
        print('-' * 100)
        print("正在爬取第{}页数据".format(page_num + 1))
        browser.get('https://s.taobao.com/search?q={}&s={}'.format(key_word, page_num*44))
        browser.implicitly_wait(10)
        get_data()
        page_num += 1
    print("数据抓取完成")

if __name__ == '__main__':
    key_word = "车厘子"
    browser = webdriver.Chrome("./chromedriver")
    main()

2. Data Processing

We read the CSV file with pandas, previewed the data, and observed missing values and data types.

import pandas as pd
import numpy as np
df = pd.read_csv('/菜J学Python/淘宝/车厘子.csv', header=None,
                 names=['商品名称','商品价格','付款人数','店铺名称','发货地址'])
# Preview sample rows
print(df.sample(5))

3. Data Cleaning

We removed rows with missing values, split the shipping address into province and city, extracted numeric buyer counts, handled units (万), and sorted by price in descending order with a reset index.

# Drop missing records
df.dropna(axis=0, how='any', inplace=True)

# Split address into province and city
df["省份"] = df["发货地址"].str.split(' ', expand=True)[0]
df["城市"] = df["发货地址"].str.split(' ', expand=True)[1]
df["城市"].fillna(df["省份"], inplace=True)

# Extract numbers from buyer count
import re
df["数字"] = [re.findall(r'(\d+\.{0,1}\d*)', i)[0] for i in df["付款人数"]]
df["数字"] = df["数字"].astype('float')
# Extract unit (万) and convert to multiplier
df["单位"] = [''.join(re.findall(r'(万)', i)) for i in df["付款人数"]]
df["单位"] = df["单位"].apply(lambda x: 10000 if x == '万' else 1)
# Compute actual buyer numbers
df["付款人数"] = df["数字"] * df["单位"]
# Remove intermediate columns
df.drop(["发货地址", "数字", "单位"], axis=1, inplace=True)

# Sort by price descending and reset index
df = df.sort_values(by="商品价格", ascending=False)
df = df.reset_index(drop=True)

4. Data Visualization

We visualized the cleaned data using Excel, showing which provinces have the highest cherry sales, price distribution, and top‑selling stores.

Shanghai, Zhejiang and Guangdong have the largest sales, while Tibet, Qinghai and Inner Mongolia have lower sales.

Most cherries are priced between 201–500 CNY, with less than 4 % below 50 CNY.

Flagship stores such as 福瑞达 and 百果园 dominate the market.

Text analysis of product names reveals common keywords like “fresh”, “Chile”, “seasonal”, and “large”.

Conclusion

This analysis is for learning and research purposes only; conclusions should be considered independently.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonTaobaodata analysisvisualizationWeb Scraping
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.