How to Scrape and Analyze 46k Rental Listings with Python: From Crawling to Visual Insights
Learn step‑by‑step how to crawl 46,000+ rental listings from Ziroom using Python, extract house details with regex, clean and transform the data with pandas, and visualize distribution, pricing and location insights through pyecharts, matplotlib and seaborn, revealing rental market patterns in Beijing.
Overview
Beijing, Shanghai, Guangzhou and Shenzhen host the majority of migrant workers, and Ziroom is a leading third‑party rental platform. After the Eggshell incident, we collected 46,000+ Ziroom listings from the four cities; the following analysis focuses on Beijing.
Data Collection – Crawling
Each search result shows up to 50 pages (~1,500 listings). To obtain all listings we iterate over region and price‑range filters (500‑unit granularity). Frequent requests trigger IP bans, and rental prices are rendered as sprite images, requiring special handling.
2.1 House Information Parsing
Open the Ziroom website, press F12, and locate the JSON‑like data in the page source. We used regular expressions to extract house ID, title, area and floor information; other parsers such as XPath or BeautifulSoup are also possible.
# Get specific house information
houseId = re.findall('x/(.*?).html"target="_blank">', item)[0]
title = re.findall('target="_blank">(.*?)</a>', item)[0] # orientation‑community‑layout‑bedrooms
large = re.findall('<div>(.*?)</div>', item)[0] # area‑floor
location = re.findall('<divclass="location">(.*?)</div>', item)[0] # location2.2 House Price Parsing
The price is not displayed as plain text but as a background image with CSS background-position. By downloading the sprite PNG and applying OCR (pytesseract) after adding a white background, we recover the numeric price.
# Extract price image URL and positions
backgroundHtml = re.findall('url\((.*?)\)', item)
priceList = re.findall('background-position:(.*?)px', item)
image = requests.get('http:' + backgroundHtml[0]).content
f = open('.\price.png', 'wb')
f.write(image)
f.close()
# OCR to get price digits
text = get_pricetext()
price = ''
for i in priceList:
num = int(float(i) / -20) # 20 for discounted, 21.4 otherwise
price = price + text[num] # Convert transparent PNG to white background for OCR
def get_pricetext():
im = Image.open('.\price.png')
x, y = im.size
try:
p = Image.new('RGBA', im.size, (255,255,255))
p.paste(im, (0,0,x,y), im)
p.save('.\price.png')
except:
pass
text = pytesseract.image_to_string(Image.open('.\price.png'),
config='--psm 10 --oem 3 -c tessedit_char_whitelist=1234567890',
lang='eng')
return re.sub('\s', '', text)Data Processing – Cleaning
During crawling we encountered inconsistent formats for the same field, so we iteratively refined the parsing logic. After collection we performed a unified cleaning pipeline.
3.1 House Name Cleaning
Example strings like "合租·李村东里3居室-北卧" contain type, community, layout and orientation. We split them with a regular expression and assign each part to a new column.
# Split house name into components
s = '整租·牛街182室1厅-西'
parts = re.split(r'(.*?)·(.*)(\d居*室.*)-(.*)', s)
# parts[1]=type, parts[2]=community, parts[3]=layout, parts[4]=orientation # Apply to dataframe
df['类型'] = df['房屋名称'].apply(lambda x: re.split(r'(.*?)·(.*)(\d居*室.*)-(.*)', x)[1])
df['小区'] = df['房屋名称'].apply(lambda x: re.split(r'(.*?)·(.*)(\d居*室.*)-(.*)', x)[2])
df['户型'] = df['房屋名称'].apply(lambda x: re.split(r'(.*?)·(.*)(\d居*室.*)-(.*)', x)[3])
df['卧室朝向'] = df['房屋名称'].apply(lambda x: re.split(r'(.*?)·(.*)(\d居*室.*)-(.*)', x)[4])3.2 Room Information Cleaning
Room size and floor are stored together, e.g. "87.26㎡|11/29层". We split them with a regex, handling abnormal floor values such as "7层" or "-1/5层".
# Parse size and floor
s = '87.26㎡|11/29层'
size, floor, subfloor = re.split(r'(.*?)㎡\|(-?\d+)/?(.*?)层', s) # Apply to dataframe
df['房间大小'] = df['面积/楼层'].apply(lambda x: re.split(r'(.*?)㎡\|(-?\d+)/?(.*?)层', x)[1])
df['房间楼层'] = df['面积/楼层'].apply(lambda x: re.split(r'(.*?)㎡\|(-?\d+)/?(.*?)层', x)[2])
df['房间楼房层数'] = df['面积/楼层'].apply(lambda x: re.split(r'(.*?)㎡\|(-?\d+)/?(.*?)层', x)[3])3.3 Location Information Cleaning
The location field records the nearest subway station and walking distance, e.g. "小区距地铁站步行约500米". We extract the station name and distance; rows without such information are left blank.
# Remove HTML tags
df['位置'] = df['位置'].apply(lambda x: re.sub(r'<(.*?)>', '', x))
# Extract station
def getMetro(x):
if len(x) >= 9:
return re.split(r'小区距(.*?)步行约(\d+?)米', x)[1]
return ''
# Extract distance
def getDistance(x):
if len(x) >= 9:
return re.split(r'小区距(.*?)步行约(\d+?)米', x)[2]
return ''
df['附近地铁站'] = df['位置'].apply(getMetro)
df['距离地铁站距离'] = df['位置'].apply(getDistance)3.4 Selecting Fields for Analysis
After cleaning we keep only the columns needed for analysis and compute a per‑square‑meter monthly rent field "price".
# Select columns
data = df[[
'id', '房屋名称', '租金', '租金单位', '标签', '地区', '类型', '小区', '户型',
'卧室朝向', '房间大小', '房间楼层', '房间楼房层数', '附近地铁站', '距离地铁站距离'
]]
# Compute price per sqm
data.loc[data['租金单位']=='月', 'price'] = round(data['租金']/data['房间大小'].astype(float), 2)
data.loc[data['租金单位']=='天', 'price'] = round(30*data['租金']/data['房间大小'].astype(float), 2)Data Statistics and Visualization
The cleaned dataset contains 23,574 listings. Visualizations are created with pyecharts , matplotlib and seaborn .
4.1 Map of Listings
Listings are concentrated in central and sub‑central districts; suburbs such as Yanqing, Huairou, Miyun and Pinggu have none.
4.2 Listings per District
Chaoyang district has the most listings (7,925), followed by Haidian, Fengtai and Changping.
4.3 Rental Type Distribution
Ziroom offers three main rental types: shared (合租), whole‑unit (整租) and luxury (豪宅). Shared rentals dominate in Chaoyang, while luxury units appear only in Dongcheng and Chaoyang.
4.4 Proximity to Subway Stations
Listings near popular stations such as Tenri‑bao (Chaoyang), Yongtai Zhuang (Haidian), Jiao Men East (Fengtai) and Huairou (Changping) are highlighted.
4.5 Top 10 Subway Areas by Average Rent
The ten subway circles with the highest average rent reach up to 320 CNY per square meter per month, equivalent to a 10 m² single room costing over 3,200 CNY monthly.
4.6 Box Plots of Average Rent
Shared rentals average around 300 CNY/m², while whole‑unit rentals average roughly half of that.
Room‑Level Statistics and Visualizations
Key room attributes such as size, rent, floor, orientation and distance to subway are examined.
5.1 Histogram of Shared‑Room Sizes
Most shared rooms are around 10 m².
5.2 Histogram of Whole‑Unit Sizes
Whole‑unit rooms typically range between 40–60 m².
5.3 Rent Distribution for Shared Rooms
Most shared rooms fall in the 2,000–4,000 CNY monthly range.
5.4 Rent Distribution for Whole‑Unit Rooms
Whole‑unit rents concentrate between 5,000–7,500 CNY.
5.5 Distance to Nearest Subway
Most listings are within 1,000 m of a subway station; the majority lie under 1.5 km.
5.6 Rent vs. Distance Regression
A simple regression shows that rents tend to be lower the farther a listing is from the subway, though the relationship is modest.
5.7 Heatmap of Orientation vs. Rent
North‑facing and northeast‑facing rooms command higher average rents.
5.8 Layout Distribution
Shared rentals are dominated by 3‑bedroom layouts, while whole‑unit rentals mainly consist of one‑room‑one‑hall and two‑room‑one‑hall units.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
