Big Data 13 min read

How to Scrape, Clean, and Visualize Black Cat Rental Complaints with Python

This tutorial walks through collecting complaint data from the Black Cat platform during the pandemic, cleaning and merging the datasets with pandas, analyzing temporal trends, and generating insightful word clouds using jieba and WordCloud, all illustrated with Python code and visualizations.

MaGe Linux Operations

Aug 19, 2021

How to Scrape, Clean, and Visualize Black Cat Rental Complaints with Python

1. Data Scraping

The pandemic accelerated negative news about a long‑term rental brand, prompting the author to crawl complaint data from the Black Cat platform. Two API endpoints are used: one with a merchant UID and another with keyword search. Random user‑agents are generated to avoid blocking, HTTPS warnings are suppressed, and the JSON response is parsed into a pandas DataFrame.

import requests, time
import pandas as pd
import numpy as np
from fake_useragent import UserAgent
requests.packages.urllib3.disable_warnings()

def request_data_uid(req_s, couid, page, total_page):
    params = {
        'couid': couid,
        'type': '1',
        'page_size': page * 10,
        'page': page,
    }
    print(f"Fetching page {page}/{total_page}, remaining {total_page-page}")
    url = 'https://tousu.sina.com.cn/api/company/received_complaints'
    header = {'user-agent': UserAgent().random}
    res = req_s.get(url, headers=header, params=params, verify=False)
    info_list = res.json()['result']['data']['complaints']
    result = []
    for info in info_list:
        _data = info['main']
        timestamp = float(_data['timestamp'])
        date = time.strftime("%Y-%m-%d", time.localtime(timestamp))
        data = [date, _data['sn'], _data['title'], _data['appeal'], _data['summary']]
        result.append(data)
    return pd.DataFrame(result, columns=["投诉日期","投诉编号","投诉问题","投诉诉求","详细说明"])

def request_data_keywords(req_s, keyword, page, total_page):
    params = {
        'keywords': keyword,
        'type': '1',
        'page_size': page * 10,
        'page': page,
    }
    print(f"Fetching page {page}/{total_page}, remaining {total_page-page}")
    url = 'https://tousu.sina.com.cn/api/index/s?'
    header = {'user-agent': UserAgent().random}
    res = req_s.get(url, headers=header, params=params, verify=False)
    info_list = res.json()['result']['data']['lists']
    result = []
    for info in info_list:
        _data = info['main']
        timestamp = float(_data['timestamp'])
        date = time.strftime("%Y-%m-%d", time.localtime(timestamp))
        data = [date, _data['sn'], _data['title'], _data['appeal'], _data['summary']]
        result.append(data)
    return pd.DataFrame(result, columns=["投诉日期","投诉编号","投诉问题","投诉诉求","详细说明"])

req_s = requests.Session()
result = pd.DataFrame()
total_page = 2507
for page in range(1, total_page + 1):
    data = request_data_uid(req_s, '5350527288', page, total_page)
    result = result.append(data)
result['投诉对象'] = "某壳公寓"
result.to_csv("某壳公寓投诉数据.csv", index=False)

2. Data Cleaning and Visualization

After downloading the CSV files, the script removes non‑Chinese characters from complaint titles, merges all files, filters records up to 2020‑11‑09, and aggregates daily complaint counts. A bar chart is plotted with matplotlib to show the temporal distribution.

import os, re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['font.size'] = 18
plt.rcParams['figure.figsize'] = (12, 8)
plt.style.use('ggplot')

# Clean titles
data_path = os.path.join('data', '某梧桐投诉数据.csv')
data = pd.read_csv(data_path)
pattern = r'[^\u4e00-\u9fa5\d]'
data['投诉问题'] = data['投诉问题'].apply(lambda x: re.sub(pattern, '', x))
data.to_csv(data_path, index=False, encoding='utf_8_sig')

# Merge all CSVs
result = pd.DataFrame()
for wj in os.listdir('data'):
    df = pd.read_csv(os.path.join('data', wj))
    result = result.append(df)
result.to_csv('data/合并后某壳投诉数据.csv', index=False, encoding='utf_8_sig')

# Load merged data and filter dates
data = pd.read_csv('data/合并后某壳投诉数据.csv')
data = data[data.投诉日期 <= '2020-11-09']
print(f"截至2020-11-09之前，黑猫投诉累计收到某壳公寓相关投诉共计 {len(data)} 条")

# Daily counts
_daily = data.groupby('投诉日期').count().reset_index()[['投诉日期', '投诉编号']]
_daily.rename(columns={'投诉编号': '投诉数量'}, inplace=True)

# Split periods for analysis
num1 = _daily[_daily.投诉日期 <= '2020-01-30']['投诉数量'].sum()
num2 = _daily[(_daily.投诉日期 >= '2020-02-21') & (_daily.投诉日期 <= '2020-11-05')]['投诉数量'].sum()
print(f"2020-11-06当天投诉量{_daily[_daily.投诉日期=='2020-11-06'].iloc[0,1]}条")

new_data = pd.concat([pd.DataFrame([['2020-01-30之前', num1]], columns=['投诉日期','投诉数量']),
                      _daily[(_daily.投诉日期 >= '2020-02-01') & (_daily.投诉日期 <= '2020-02-21')],
                      pd.DataFrame([['2020-02-21 ~ 2020-11-05', num2]], columns=['投诉日期','投诉数量']),
                      _daily[(_daily.投诉日期 > '2020-11-06') & (_daily.投诉日期 <= '2020-11-09')]])
new_data.set_index('投诉日期').plot(kind='bar')

Findings: complaints were low before 2020‑01‑30, surged in February due to pandemic‑related issues, remained stable until early November, then spiked dramatically on 2020‑11‑06 (over 24,000 complaints) after news that the company’s affiliate was subject to a large execution order.

3. Word Cloud Generation

Using jieba for Chinese segmentation, the script creates three word clouds: one for detailed complaint descriptions, one for complaint titles, and one for complaint appeals. The resulting images highlight frequent terms such as “提现”, “活动返现”, “客服”, and “保洁”.

import jieba, re, collections
from wordcloud import WordCloud
import PIL.Image as img

# Combine all detailed descriptions
all_word = ''
for line in data.values:
    all_word += line[4]
result = list(jieba.cut(all_word))
wc = WordCloud(width=800, height=600, background_color='white', font_path='C:\\Windows\\Fonts\\msyh.ttc', max_font_size=500, min_font_size=20)
wc.generate(' '.join(result)).to_file('某壳公寓投诉详情.png')

# Combine all titles
all_word = ''
for line in data.values:
    all_word += line[2]
result = list(jieba.cut(all_word))
WordCloud(width=800, height=600, background_color='white', font_path='C:\\Windows\\Fonts\\msyh.ttc', max_font_size=500, min_font_size=20).generate(' '.join(result)).to_file('某壳公寓投诉问题.png')

# Combine all appeals
all_word = ''
for line in data.values:
    all_word += line[3]
result = list(jieba.cut(all_word))
WordCloud(width=800, height=600, background_color='white', font_path='C:\\Windows\\Fonts\\msyh.ttc', max_font_size=500, min_font_size=20).generate(' '.join(result)).to_file('某壳公寓投诉诉求.png')

The word clouds reveal that the main complaint topics are withdrawal issues, delayed cash‑back promotions, unreachable customer service, and cleaning problems, while the dominant user demand is for refunds and compensation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data analysis Web Scraping Matplotlib word cloud Complaint Data

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.