How to Clean and Analyze Messy Taobao Data with Python Regex and Pandas
This article walks through cleaning chaotic Taobao CSV data using Python's regular expressions and pandas, removing unwanted characters with stop‑words, performing word segmentation, and generating word‑frequency statistics through both a classic approach and a pandas‑optimized method, complete with code snippets and visual results.
Introduction
Hello, I am a Python enthusiast. Recently a group member shared a messy Taobao dataset that looked chaotic, but after applying regex and pandas it became clean, enabling further processing such as ingredient and shelf‑life analysis.
1. Raw Data Pre‑processing
The original data is stored in a single cell, making it hard to read.
Instead of using Excel's text‑to‑columns, we apply a Python regular‑expression solution:
import re
import pandas as pd
result = []
with open(r"淘宝数据.csv") as f:
for line in f:
row = dict(re.findall("([^:\t]+):([^:\t]+)", line))
if row:
result.append(row)
df = pd.DataFrame(result)
df.to_excel('new_data.xlsx', encoding='utf-8')
print(df)2. Cleaning Ingredient and Shelf‑Life Columns
Special characters such as %、、 spaces appear in the ingredient and shelf‑life fields. We remove them using a stop‑words list:
# Create stop‑words list
def stopwordslist(filepath):
stopwords = [line.strip() for line in open(filepath, 'r', encoding='gbk').readlines()]
return stopwords
# Segment sentences and filter stop‑words
def seg_sentence(sentence):
sentence_seged = jieba.cut(sentence.strip())
stopwords = stopwordslist('stop_word.txt')
outstr = ''
for word in sentence_seged:
if word not in stopwords and word != '\t':
outstr += word + ' '
return outstr3. Word‑Frequency Statistics
Two methods are provided to count word frequencies.
Method 1: Classic Approach
#!/usr/bin/env python3
# -*- coding:utf-8 -*-
import sys
import jieba
import jieba.analyse
import xlwt
if __name__ == "__main__":
wbk = xlwt.Workbook(encoding='ascii')
sheet = wbk.add_sheet("wordCount")
word_lst = []
key_list = []
for line in open('1.txt', encoding='utf-8'):
item = line.strip('
\r').split('\t')
tags = jieba.analyse.extract_tags(item[0])
for t in tags:
word_lst.append(t)
word_dict = {}
with open("wordCount_all_lyrics.txt", 'w') as wf2:
for item in word_lst:
if item not in word_dict:
word_dict[item] = 1
else:
word_dict[item] += 1
orderList = list(word_dict.values())
orderList.sort(reverse=True)
for i in range(len(orderList)):
for key in word_dict:
if word_dict[key] == orderList[i]:
wf2.write(key + ' ' + str(word_dict[key]) + '
')
key_list.append(key)
word_dict[key] = 0
for i in range(len(key_list)):
sheet.write(i, 1, label=orderList[i])
sheet.write(i, 0, label=key_list[i])
wbk.save('wordCount_all_lyrics.xls')Method 2: Pandas Optimized
def get_data(df):
df.loc[:, '食品添加剂'] = df['食品添加剂'].fillna('无')
df.loc[:, '保质期'] = df['保质期'].fillna('无')
df.loc[:, '配料表'] = df['配料表'].fillna('无')
names = df.配料表.apply(jieba.lcut).explode()
df1 = names[names.apply(len) > 1].value_counts()
with pd.ExcelWriter("taobao.xlsx") as writer:
df1.to_excel(writer, sheet_name='配料')
df2 = pd.read_excel('taobao.xlsx', header=None, skiprows=1, names=['column1', 'column2'])
print(df2)Conclusion
The raw Taobao data was cleaned using regex and pandas, stop‑words removed unwanted characters, and word‑frequency statistics were generated via both classic and pandas‑based methods. The next step will involve visualizing the results with Pyecharts (pie charts, bar charts, tables, funnel charts, etc.).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
