Fundamentals 10 min read

How to Clean and Analyze Messy Taobao Data with Python Regex and Pandas

This article walks through cleaning chaotic Taobao CSV data using Python's regular expressions and pandas, removing unwanted characters with stop‑words, performing word segmentation, and generating word‑frequency statistics through both a classic approach and a pandas‑optimized method, complete with code snippets and visual results.

Python Crawling & Data Mining

Aug 23, 2021

How to Clean and Analyze Messy Taobao Data with Python Regex and Pandas

Introduction

Hello, I am a Python enthusiast. Recently a group member shared a messy Taobao dataset that looked chaotic, but after applying regex and pandas it became clean, enabling further processing such as ingredient and shelf‑life analysis.

1. Raw Data Pre‑processing

The original data is stored in a single cell, making it hard to read.

Instead of using Excel's text‑to‑columns, we apply a Python regular‑expression solution:

import re
import pandas as pd
result = []
with open(r"淘宝数据.csv") as f:
    for line in f:
        row = dict(re.findall("([^：\t]+)：([^：\t]+)", line))
        if row:
            result.append(row)
df = pd.DataFrame(result)
df.to_excel('new_data.xlsx', encoding='utf-8')
print(df)

2. Cleaning Ingredient and Shelf‑Life Columns

Special characters such as %、、 spaces appear in the ingredient and shelf‑life fields. We remove them using a stop‑words list:

# Create stop‑words list
def stopwordslist(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='gbk').readlines()]
    return stopwords

# Segment sentences and filter stop‑words
def seg_sentence(sentence):
    sentence_seged = jieba.cut(sentence.strip())
    stopwords = stopwordslist('stop_word.txt')
    outstr = ''
    for word in sentence_seged:
        if word not in stopwords and word != '\t':
            outstr += word + ' '
    return outstr

3. Word‑Frequency Statistics

Two methods are provided to count word frequencies.

Method 1: Classic Approach

#!/usr/bin/env python3
# -*- coding:utf-8 -*-
import sys
import jieba
import jieba.analyse
import xlwt

if __name__ == "__main__":
    wbk = xlwt.Workbook(encoding='ascii')
    sheet = wbk.add_sheet("wordCount")
    word_lst = []
    key_list = []
    for line in open('1.txt', encoding='utf-8'):
        item = line.strip('
\r').split('\t')
        tags = jieba.analyse.extract_tags(item[0])
        for t in tags:
            word_lst.append(t)
    word_dict = {}
    with open("wordCount_all_lyrics.txt", 'w') as wf2:
        for item in word_lst:
            if item not in word_dict:
                word_dict[item] = 1
            else:
                word_dict[item] += 1
        orderList = list(word_dict.values())
        orderList.sort(reverse=True)
        for i in range(len(orderList)):
            for key in word_dict:
                if word_dict[key] == orderList[i]:
                    wf2.write(key + ' ' + str(word_dict[key]) + '
')
                    key_list.append(key)
                    word_dict[key] = 0
    for i in range(len(key_list)):
        sheet.write(i, 1, label=orderList[i])
        sheet.write(i, 0, label=key_list[i])
    wbk.save('wordCount_all_lyrics.xls')

Method 2: Pandas Optimized

def get_data(df):
    df.loc[:, '食品添加剂'] = df['食品添加剂'].fillna('无')
    df.loc[:, '保质期'] = df['保质期'].fillna('无')
    df.loc[:, '配料表'] = df['配料表'].fillna('无')
    names = df.配料表.apply(jieba.lcut).explode()
    df1 = names[names.apply(len) > 1].value_counts()
    with pd.ExcelWriter("taobao.xlsx") as writer:
        df1.to_excel(writer, sheet_name='配料')
    df2 = pd.read_excel('taobao.xlsx', header=None, skiprows=1, names=['column1', 'column2'])
    print(df2)

Conclusion

The raw Taobao data was cleaned using regex and pandas, stop‑words removed unwanted characters, and word‑frequency statistics were generated via both classic and pandas‑based methods. The next step will involve visualizing the results with Pyecharts (pie charts, bar charts, tables, funnel charts, etc.).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data cleaning regex Word Frequency

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.