How to Scrape and Analyze 46k Rental Listings with Python: From Crawling to Visual Insights

Learn step‑by‑step how to crawl 46,000+ rental listings from Ziroom using Python, extract house details with regex, clean and transform the data with pandas, and visualize distribution, pricing and location insights through pyecharts, matplotlib and seaborn, revealing rental market patterns in Beijing.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
How to Scrape and Analyze 46k Rental Listings with Python: From Crawling to Visual Insights

Overview

Beijing, Shanghai, Guangzhou and Shenzhen host the majority of migrant workers, and Ziroom is a leading third‑party rental platform. After the Eggshell incident, we collected 46,000+ Ziroom listings from the four cities; the following analysis focuses on Beijing.

Data Collection – Crawling

Each search result shows up to 50 pages (~1,500 listings). To obtain all listings we iterate over region and price‑range filters (500‑unit granularity). Frequent requests trigger IP bans, and rental prices are rendered as sprite images, requiring special handling.

2.1 House Information Parsing

Open the Ziroom website, press F12, and locate the JSON‑like data in the page source. We used regular expressions to extract house ID, title, area and floor information; other parsers such as XPath or BeautifulSoup are also possible.

# Get specific house information
houseId = re.findall('x/(.*?).html"target="_blank">', item)[0]
title = re.findall('target="_blank">(.*?)</a>', item)[0]  # orientation‑community‑layout‑bedrooms
large = re.findall('<div>(.*?)</div>', item)[0]  # area‑floor
location = re.findall('<divclass="location">(.*?)</div>', item)[0]  # location

2.2 House Price Parsing

The price is not displayed as plain text but as a background image with CSS background-position. By downloading the sprite PNG and applying OCR (pytesseract) after adding a white background, we recover the numeric price.

# Extract price image URL and positions
backgroundHtml = re.findall('url\((.*?)\)', item)
priceList = re.findall('background-position:(.*?)px', item)
image = requests.get('http:' + backgroundHtml[0]).content
f = open('.\price.png', 'wb')
f.write(image)
f.close()
# OCR to get price digits
text = get_pricetext()
price = ''
for i in priceList:
    num = int(float(i) / -20)  # 20 for discounted, 21.4 otherwise
    price = price + text[num]
# Convert transparent PNG to white background for OCR
def get_pricetext():
    im = Image.open('.\price.png')
    x, y = im.size
    try:
        p = Image.new('RGBA', im.size, (255,255,255))
        p.paste(im, (0,0,x,y), im)
        p.save('.\price.png')
    except:
        pass
    text = pytesseract.image_to_string(Image.open('.\price.png'),
        config='--psm 10 --oem 3 -c tessedit_char_whitelist=1234567890',
        lang='eng')
    return re.sub('\s', '', text)

Data Processing – Cleaning

During crawling we encountered inconsistent formats for the same field, so we iteratively refined the parsing logic. After collection we performed a unified cleaning pipeline.

3.1 House Name Cleaning

Example strings like "合租·李村东里3居室-北卧" contain type, community, layout and orientation. We split them with a regular expression and assign each part to a new column.

# Split house name into components
s = '整租·牛街182室1厅-西'
parts = re.split(r'(.*?)·(.*)(\d居*室.*)-(.*)', s)
# parts[1]=type, parts[2]=community, parts[3]=layout, parts[4]=orientation
# Apply to dataframe
df['类型'] = df['房屋名称'].apply(lambda x: re.split(r'(.*?)·(.*)(\d居*室.*)-(.*)', x)[1])
df['小区'] = df['房屋名称'].apply(lambda x: re.split(r'(.*?)·(.*)(\d居*室.*)-(.*)', x)[2])
df['户型'] = df['房屋名称'].apply(lambda x: re.split(r'(.*?)·(.*)(\d居*室.*)-(.*)', x)[3])
df['卧室朝向'] = df['房屋名称'].apply(lambda x: re.split(r'(.*?)·(.*)(\d居*室.*)-(.*)', x)[4])

3.2 Room Information Cleaning

Room size and floor are stored together, e.g. "87.26㎡|11/29层". We split them with a regex, handling abnormal floor values such as "7层" or "-1/5层".

# Parse size and floor
s = '87.26㎡|11/29层'
size, floor, subfloor = re.split(r'(.*?)㎡\|(-?\d+)/?(.*?)层', s)
# Apply to dataframe
df['房间大小'] = df['面积/楼层'].apply(lambda x: re.split(r'(.*?)㎡\|(-?\d+)/?(.*?)层', x)[1])
df['房间楼层'] = df['面积/楼层'].apply(lambda x: re.split(r'(.*?)㎡\|(-?\d+)/?(.*?)层', x)[2])
df['房间楼房层数'] = df['面积/楼层'].apply(lambda x: re.split(r'(.*?)㎡\|(-?\d+)/?(.*?)层', x)[3])

3.3 Location Information Cleaning

The location field records the nearest subway station and walking distance, e.g. "小区距地铁站步行约500米". We extract the station name and distance; rows without such information are left blank.

# Remove HTML tags
df['位置'] = df['位置'].apply(lambda x: re.sub(r'<(.*?)>', '', x))
# Extract station
def getMetro(x):
    if len(x) >= 9:
        return re.split(r'小区距(.*?)步行约(\d+?)米', x)[1]
    return ''
# Extract distance
def getDistance(x):
    if len(x) >= 9:
        return re.split(r'小区距(.*?)步行约(\d+?)米', x)[2]
    return ''
df['附近地铁站'] = df['位置'].apply(getMetro)
df['距离地铁站距离'] = df['位置'].apply(getDistance)

3.4 Selecting Fields for Analysis

After cleaning we keep only the columns needed for analysis and compute a per‑square‑meter monthly rent field "price".

# Select columns
data = df[[
    'id', '房屋名称', '租金', '租金单位', '标签', '地区', '类型', '小区', '户型',
    '卧室朝向', '房间大小', '房间楼层', '房间楼房层数', '附近地铁站', '距离地铁站距离'
]]
# Compute price per sqm
data.loc[data['租金单位']=='月', 'price'] = round(data['租金']/data['房间大小'].astype(float), 2)
data.loc[data['租金单位']=='天', 'price'] = round(30*data['租金']/data['房间大小'].astype(float), 2)

Data Statistics and Visualization

The cleaned dataset contains 23,574 listings. Visualizations are created with pyecharts , matplotlib and seaborn .

4.1 Map of Listings

Listings are concentrated in central and sub‑central districts; suburbs such as Yanqing, Huairou, Miyun and Pinggu have none.

Ziroom Beijing rental distribution map
Ziroom Beijing rental distribution map

4.2 Listings per District

Chaoyang district has the most listings (7,925), followed by Haidian, Fengtai and Changping.

Listings count per district
Listings count per district

4.3 Rental Type Distribution

Ziroom offers three main rental types: shared (合租), whole‑unit (整租) and luxury (豪宅). Shared rentals dominate in Chaoyang, while luxury units appear only in Dongcheng and Chaoyang.

Rental type distribution
Rental type distribution

4.4 Proximity to Subway Stations

Listings near popular stations such as Tenri‑bao (Chaoyang), Yongtai Zhuang (Haidian), Jiao Men East (Fengtai) and Huairou (Changping) are highlighted.

Top subway stations by listing count
Top subway stations by listing count

4.5 Top 10 Subway Areas by Average Rent

The ten subway circles with the highest average rent reach up to 320 CNY per square meter per month, equivalent to a 10 m² single room costing over 3,200 CNY monthly.

Top 10 subway areas by average rent
Top 10 subway areas by average rent

4.6 Box Plots of Average Rent

Shared rentals average around 300 CNY/m², while whole‑unit rentals average roughly half of that.

Box plot for shared rentals
Box plot for shared rentals
Box plot for whole‑unit rentals
Box plot for whole‑unit rentals

Room‑Level Statistics and Visualizations

Key room attributes such as size, rent, floor, orientation and distance to subway are examined.

5.1 Histogram of Shared‑Room Sizes

Most shared rooms are around 10 m².

Shared room size histogram
Shared room size histogram

5.2 Histogram of Whole‑Unit Sizes

Whole‑unit rooms typically range between 40–60 m².

Whole‑unit room size histogram
Whole‑unit room size histogram

5.3 Rent Distribution for Shared Rooms

Most shared rooms fall in the 2,000–4,000 CNY monthly range.

Shared room rent histogram
Shared room rent histogram

5.4 Rent Distribution for Whole‑Unit Rooms

Whole‑unit rents concentrate between 5,000–7,500 CNY.

Whole‑unit rent histogram
Whole‑unit rent histogram

5.5 Distance to Nearest Subway

Most listings are within 1,000 m of a subway station; the majority lie under 1.5 km.

Distance to subway histogram
Distance to subway histogram

5.6 Rent vs. Distance Regression

A simple regression shows that rents tend to be lower the farther a listing is from the subway, though the relationship is modest.

Rent vs. distance scatter plot
Rent vs. distance scatter plot

5.7 Heatmap of Orientation vs. Rent

North‑facing and northeast‑facing rooms command higher average rents.

Orientation vs. rent heatmap
Orientation vs. rent heatmap

5.8 Layout Distribution

Shared rentals are dominated by 3‑bedroom layouts, while whole‑unit rentals mainly consist of one‑room‑one‑hall and two‑room‑one‑hall units.

Layout distribution
Layout distribution
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data cleaningData visualizationWeb ScrapingPyecharts
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.