Big Data 13 min read

Data Analysis Project: Scraping and Visualizing Complaints About a Rental Apartment Company

This article presents a complete data‑analysis workflow that scrapes complaint records from a Chinese platform, cleans and merges the data, performs time‑series analysis, generates visualizations and word‑clouds, and interprets the findings to reveal complaint trends and user demands.

Python Programming Learning Circle

Dec 12, 2024

Data Analysis Project: Scraping and Visualizing Complaints About a Rental Apartment Company

The article describes a practical data‑analysis project that collects complaint data about a rental‑apartment brand from the Black Cat complaint platform, processes the data, and visualizes the results.

1. Data Scraping

The script uses requests with a fake user‑agent to fetch complaint records either by UID or by keyword, parses JSON responses, extracts fields such as date, complaint number, title, appeal, and summary, and stores them in a pandas DataFrame.

import requests,time
import pandas as pd
import numpy as
requests.packages.urllib3.disable_warnings()  # 屏蔽https请求证书验证警告
from fake_useragent import UserAgent  # 生成随机请求头

def request_data_uid(req_s,couid,page,total_page):
    params = {
        'couid': couid, # 商家ID
        'type': '1',
        'page_size': page * 10, # 每页10条
        'page': page,  # 第几页
    }
    print(f"正在爬取第{page}页,共计{total_page}页，剩余{total_page-page}页")
    url = 'https://tousu.sina.com.cn/api/company/received_complaints'
    header={'user-agent':UserAgent().random}
    res=req_s.get(url,headers=header,params=params, verify=False)
    info_list = res.json()['result']['data']['complaints']
    result =[]
    for info in info_list:
        _data = info['main']
        timestamp =float(_data['timestamp'])
        date = time.strftime("%Y-%m-%d",time.localtime(timestamp))
        data = [date,_data['sn'],_data['title'],_data['appeal'],_data['summary']]
        result.append(data)
    pd_result = pd.DataFrame(result,columns=["投诉日期","投诉编号","投诉问题","投诉诉求","详细说明"])
    return pd_result

def request_data_keywords(req_s,keyword,page,total_page):
    params = {
        'keywords':keyword, # 检索关键词
        'type': '1',
        'page_size': page * 10, # 每页10条
        'page': page,  # 第几页
    }
    print(f"正在爬取第{page}页,共计{total_page}页，剩余{total_page-page}页")
    url ='https://tousu.sina.com.cn/api/index/s?'
    header={'user-agent':UserAgent().random}
    res=req_s.get(url,headers=header,params=params, verify=False)
    info_list = res.json()['result']['data']['lists']
    result =[]
    for info in info_list:
        _data = info['main']
        timestamp =float(_data['timestamp'])
        date = time.strftime("%Y-%m-%d",time.localtime(timestamp))
        data = [date,_data['sn'],_data['title'],_data['appeal'],_data['summary']]
        result.append(data)
    pd_result = pd.DataFrame(result,columns=["投诉日期","投诉编号","投诉问题","投诉诉求","详细说明"])
    return pd_result

req_s = requests.Session()
result = pd.DataFrame()
total_page = 2507
for page in range(1,total_page+1):
    data = request_data_uid(req_s,'5350527288',page,total_page)
    result = result.append(data)
result['投诉对象'] = "某壳公寓"
result.to_csv("某壳公寓投诉数据.csv",index=False)

# Keyword based scraping for another entity
result = pd.DataFrame()
total_page = 56
for page in range(1,total_page+1):
    data = request_data_keywords(req_s,'某梧桐',page,total_page)
    result = result.append(data)
result['投诉对象'] = "某梧桐"
result.to_csv("某梧桐投诉数据.csv",index=False)

2. Data Cleaning and Merging

The script removes non‑Chinese characters from complaint titles, saves cleaned files, and merges all CSVs into a single dataset.

import os,re
import pandas as pd
import numpy as

data_path = os.path.join('data','某梧桐投诉数据.csv')
data =pd.read_csv(data_path)
pattern=r'[^\u4e00-\u9fa5\d]'
data['投诉问题']=data['投诉问题'].apply(lambda x: re.sub(pattern,'',x))
data.to_csv(data_path,index=False,encoding="utf_8_sig")

result = pd.DataFrame()
for wj in os.listdir('data'):
    data_path = os.path.join('data',wj)
    data =pd.read_csv(data_path)
    result = result.append(data)
result.to_csv("data/合并后某壳投诉数据.csv",index=False,encoding="utf_8_sig")

3. Data Exploration

After loading the merged data, the analysis filters records up to a specific date, counts total complaints, and examines temporal distribution, highlighting a sharp increase on 2020‑11‑06 linked to news about the company’s legal issues.

data = pd.read_csv("data/合并后某壳投诉数据.csv")
# Keep records up to 2020‑11‑09
data = data[data.投诉日期<='2020-11-09']
print(f"截至2020-11-09之前，黑猫投诉累计收到某壳公寓相关投诉共计 {len(data)} 条")

_data = data.groupby('投诉日期').count().reset_index()[['投诉日期','投诉编号']]
_data.rename(columns={"投诉编号":"投诉数量"},inplace = True)

num1 = _data[_data.投诉日期<='2020-01-30'].投诉数量.sum()
data0 = pd.DataFrame([['2020-01-30之前',num1]],columns=['投诉日期','投诉数量'])

# Further time‑range aggregations omitted for brevity

4. Visualization

Matplotlib is configured for Chinese fonts and a bar chart of complaint counts per period is plotted.

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['font.size']=18
plt.rcParams['figure.figsize']=(12,8)
plt.style.use("ggplot")
new_data.set_index('投诉日期').plot(kind='bar')

5. Word‑Cloud Generation

Using jieba for Chinese segmentation, three word clouds are created for complaint details, titles, and appeals, and saved as PNG images.

import jieba, re, collections
import PIL.Image as img
from wordcloud import WordCloud

# Combine all detail texts
all_word=''
for line in data.values:
    all_word += line[4]
result = list(jieba.cut(all_word))
wordcloud = WordCloud(width=800,height=600,background_color='white',font_path='C:\\Windows\\Fonts\\msyh.ttc',max_font_size=500,min_font_size=20).generate(' '.join(result))
wordcloud.to_file('某壳公寓投诉详情.png')

# Repeat for titles (column index 2) and appeals (column index 3)

6. Findings

The analysis shows that before 2020‑01‑30 complaint volume was low, surged in February 2020 likely due to pandemic‑related issues, stabilized thereafter, and experienced an outlier spike on 2020‑11‑06 linked to a court execution notice. Word‑clouds reveal frequent issues such as cash‑withdrawal problems, delayed cash‑back, unresponsive customer service, and cleaning concerns, while user demands focus on refunds and compensation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data analysis visualization Pandas complaints word cloud

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.