Big Data 20 min read

What Wuhan’s Second‑Hand Housing Data Reveals: A Python‑Powered Deep Dive

This article walks through extracting, cleaning, and visualizing Wuhan’s second‑hand housing dataset using Python libraries such as pandas, matplotlib, seaborn, and pyecharts, revealing insights on price distribution, regional trends, house types, decoration levels, and other factors that influence market dynamics.

Python Crawling & Data Mining

Jan 6, 2022

What Wuhan’s Second‑Hand Housing Data Reveals: A Python‑Powered Deep Dive

Introduction

The dataset used in this article was previously scraped from Lianjia and contains second‑hand housing information for Wuhan. The following analysis explores the hidden patterns within the data.

Python libraries used

pandas

: read CSV files and perform data manipulation. matplotlib: a plotting library based on NumPy, used for basic charts. seaborn: built on matplotlib, provides higher‑level statistical visualizations. pyecharts: generates ECharts visualizations, suitable for interactive maps. jieba: Chinese word segmentation library. collections: used for Counter to count occurrences.

1. Data Reading

First read the house_info.csv file and inspect its structure.

import pandas as pd

df = pd.read_csv('house_info.csv')
df.info()

The dataset has 27 columns. Columns house_label contain many missing values, and floor and house_area are of type object, which need to be converted to numeric types.

2. Data Preprocessing

2.1 Missing value handling

Rows containing missing values are dropped, leaving 5,108 rows.

df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)

2.2 Column processing

Because the districts “东湖高新区” and “沌口开发区” lack detailed latitude/longitude, they are reassigned to “洪山区” and “汉南区” respectively. Additional processing includes extracting numeric floor values, stripping the “m²” unit from house_area, and appending “区” to district names.

Extract numeric part of floor.

Convert house_area from strings like “85.99m²” to float 85.99.

Map “东湖高新” → “洪山”, “沌口开发区” → “汉南”.

Add “区” suffix to district names.

# Extract numeric floor
df['floor'] = df['floor'].str.extract(r'(\d+)', expand=False).astype('int')
# Remove unit from house_area
df['house_area'] = df['house_area'].apply(lambda x: x[:-1]).astype('float')
# Reassign districts
df.loc[df['region'] == '东湖高新', 'region'] = '洪山'
df.loc[df['region'] == '沌口开发区', 'region'] = '汉南'
# Append suffix
df['region'] = df['region'] + '区'

The describe() function is used to view statistical summaries of numeric columns. Setting include='all' displays all columns.

df.describe()

3. Bar chart of house counts per district

Count the number of listings per district and plot a bar chart.

import pyecharts.options as opts
from pyecharts.charts import Bar
from pyecharts.globals import ThemeType

region_list = df['region'].value_counts().index.tolist()
house_count_list = df['region'].value_counts().values.tolist()

c = Bar(init_opts=opts.InitOpts(theme=ThemeType.CHALK))
c.add_xaxis(region_list)
c.add_yaxis("武汉市", house_count_list)
c.set_global_opts(
    title_opts=opts.TitleOpts(title="武汉各区二手房数量柱状图"),
    xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(interval=0))
)
c.render_notebook()

4. 2D map of median unit price per district

Calculate the median unit price for each district and render a 2D map using local Wuhan GeoJSON data.

region_list = df['region'].value_counts().index.tolist()
median_unit_price = []
for region in region_list:
    median_unit_price.append(df.loc[df['region'] == region, 'unit_price'].median())

from pyecharts.charts import Map
import json

json_data = json.load(open('武汉市.json', encoding='utf-8'))
data_pair = list(zip(region_list, median_unit_price))

c = Map(init_opts=opts.InitOpts(width='1500px', height='700px', bg_color='#404a58'))
c.add_js_funcs("echarts.registerMap('武汉市',{});".format(json_data))
c.add(series_name="武汉市", data_pair=data_pair, maptype="武汉市", label_opts=opts.LabelOpts(color='#fff'))
c.set_global_opts(
    legend_opts=opts.LegendOpts(textstyle_opts=opts.TextStyleOpts(color='#fff')),
    title_opts=opts.TitleOpts(title="武汉", title_textstyle_opts=opts.TextStyleOpts(color='#fff')),
    visualmap_opts=opts.VisualMapOpts(split_number=6, max_=30000, range_text=['高','低'], textstyle_opts=opts.TextStyleOpts(color='#fff'))
)
c.render_notebook()

5. 3D map of unit price distribution

The same data is used to create a 3D map (code omitted for brevity).

6. Box plot of unit price per district

Collect unit price lists for each district and draw a box plot.

unit_price_list = []
for region in region_list:
    unit_price_list.append(df.loc[df['region'] == region, 'unit_price'].values.tolist())

from pyecharts.charts import Boxplot

c = Boxplot(init_opts=opts.InitOpts(theme=ThemeType.CHALK))
c.add_xaxis(region_list)
c.add_yaxis("武汉市", c.prepare_data(unit_price_list))
c.set_global_opts(
    title_opts=opts.TitleOpts(title="武汉各区二手房总价箱型图"),
    xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(interval=0))
)
c.render_notebook()

The box plot shows a right‑skewed distribution in districts such as 洪山区, 江岸区, and 武昌区, indicating many listings have prices far above the average due to location or decoration.

7. Relationship between house area and price

Seaborn is used to plot the distribution of house area and a regression line showing the correlation between area and total price.

import matplotlib.pyplot as plt
import seaborn as sns

f, [ax1, ax2] = plt.subplots(1, 2, figsize=(16, 6))

# Area distribution
sns.distplot(df['house_area'], ax=ax1, color='r')
sns.kdeplot(df['house_area'], shade=True, ax=ax1)
ax1.set_xlabel('面积')

# Area vs total price
sns.regplot(x='house_area', y='total_price', data=df, ax=ax2)
ax2.set_xlabel('面积')
ax2.set_ylabel('总价')

plt.show()

House areas mainly range from 60 m² to 130 m², with an outlier at 400 m² and a total price of 20 million RMB.

8. 3D bar chart of floor vs price per district

A 3D bar chart visualizes how floor level and district affect unit price. Higher floors in 武昌区 and 江汉区 tend to command higher prices.

9. Horizontal bar chart of house types

Count different house layouts and plot a horizontal bar chart.

series = df['house_type'].value_counts()
series.sort_index(ascending=False, inplace=True)
house_type_list = series.index.tolist()
count_list = series.values.tolist()

c = Bar(init_opts=opts.InitOpts(theme=ThemeType.CHALK))
c.add_xaxis(house_type_list)
c.add_yaxis("武汉市", count_list)
c.reversal_axis()
c.set_series_opts(label_opts=opts.LabelOpts(position="right"))
c.set_global_opts(
    title_opts=opts.TitleOpts(title="武汉二手房各户型横向条形图"),
    datazoom_opts=[opts.DataZoomOpts(yaxis_index=0, type_="slider", orient="vertical")]
)
c.render_notebook()

The most common layout is “两室两厅一厨一卫”.

10. Pie chart of decoration level

Count decoration categories and visualize them with a pie chart.

decoration_list = df['decoration'].value_counts().index.tolist()
count_list = df['decoration'].value_counts().values.tolist()

c = Pie(init_opts=opts.InitOpts(theme=ThemeType.CHALK))
c.add(
    series_name="房屋装修",
    data_pair=[list(z) for z in zip(decoration_list, count_list)],
    rosetype="radius",
    radius="55%",
    center=["50%", "50%"],
    label_opts=opts.LabelOpts(is_show=False, position="center")
)
c.set_global_opts(
    title_opts=opts.TitleOpts(title="武汉二手房房屋装修饼状图", pos_left="center", pos_top="20", title_textstyle_opts=opts.TextStyleOpts(color="#fff")),
    legend_opts=opts.LegendOpts(is_show=False)
)
c.set_series_opts(
    tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
    label_opts=opts.LabelOpts(color="rgba(255, 255, 255, 255)")
)
c.render_notebook()

Most listings are “精装”, about 25 % are “简装”, and the remainder are other types or raw shells.

11. Elevator presence vs price

Analyze the proportion of listings with elevators and compare unit prices. Districts generally show higher prices for listings with elevators, especially 武昌区.

12. Funnel chart of popular tags

Count tags from listings with more than three followers and display a funnel chart.

from collections import Counter

detail_df = df.loc[df['follower_numbers'] > 3]
label_list = []
for house_label in detail_df['house_label']:
    label_list += house_label.split(',')

label_and_count = Counter(label_list).most_common()

c = Funnel(init_opts=opts.InitOpts(theme=ThemeType.CHALK))
c.add("商品", [list(z) for z in label_and_count])
c.set_global_opts(title_opts=opts.TitleOpts(title="武汉热门二手房标签漏斗图"))
c.render_notebook()

The most frequent tag is “VR看装修”.

13. Keyword extraction from popular titles

Load stopwords, use jieba to segment titles, remove stopwords, count frequencies, and generate a word cloud.

def load_stopwords(read_path):
    """Read each line of a file into a list."""
    result = []
    with open(read_path, "r", encoding='utf-8') as f:
        for line in f.readlines():
            line = line.strip('
')
            result.append(line)
    return result

stopwords = load_stopwords('wordcloud_stopwords.txt')

import jieba
jieba.load_userdict("自定义词典.txt")

token_list = []
for title in detail_df['title']:
    tokens = jieba.lcut(title, cut_all=False)
    token_list += [token for token in tokens if token not in stopwords]

from collections import Counter
token_count_list = Counter(token_list).most_common(100)

c = WordCloud()
c.add(series_name="热词", data_pair=[(token, str(count)) for token, count in token_count_list], word_size_range=[20, 200])
c.set_global_opts(
    title_opts=opts.TitleOpts(title="武汉热门二手房标题关键词", title_textstyle_opts=opts.TextStyleOpts(font_size=23)),
    tooltip_opts=opts.TooltipOpts(is_show=True)
)
c.render_notebook()

Frequent words include “电梯”, “楼层”, “采光”, “精装修”, “户型”, “满二”, and “交通”.

Conclusion

The analysis provides a rough overview of Wuhan’s second‑hand housing market: central districts start around 15,000 RMB/m², peripheral areas can be as low as 7,800 RMB/m². Higher floors tend to be pricier, especially with good views. A size of about 100 m² is typical, and the popular layout is “两室两厅一厨一卫”. While the data offers useful hints, it is not exhaustive for making purchase decisions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

real estate Seaborn

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.