Backend Development 16 min read

How to Scrape and Analyze Beijing Dank Apartment Data with Python

This article demonstrates how to crawl 6,025 Beijing Dank Apartment listings using Python, clean and enrich the data with Pandas, and visualize distribution, price, size, floor, and subway proximity through charts, revealing key market insights and correlation patterns.

Python Crawling & Data Mining

Dec 13, 2020

How to Scrape and Analyze Beijing Dank Apartment Data with Python

Introduction

The rapid collapse of Dank Apartment caused widespread tenant and landlord disputes in Beijing. To provide a data‑driven perspective, 6,025 apartment records from the Beijing region were scraped, cleaned, and visualized.

Data Acquisition

The website has a simple structure; pagination URLs are generated automatically. A small number of pages return 404 and are filtered out. The core crawler uses requests to fetch pages and xpath to extract fields such as price, area, ID, layout, floor, location and subway information.

def get_danke(href):
    time.sleep(random.uniform(0, 1))  # avoid overloading the server
    response = requests.get(url=href, headers=headers)
    if response.status_code == 200:
        res = response.content.decode('utf-8')
        div = etree.HTML(res)
        items = div.xpath("/html/body/div[3]/div[1]/div[2]/div[2]")
        for item in items:
            house_price = item.xpath("./div[3]/div[2]/div/span/div/text()")[0]
            house_area = item.xpath("./div[4]/div[1]/div[1]/label/text()")[0].replace('建筑面积：约', '').replace('㎡（以现场勘察为准）', '')
            house_id = item.xpath("./div[4]/div[1]/div[2]/label/text()")[0].replace('编号：', '')
            house_type = item.xpath("./div[4]/div[1]/div[3]/label/text()")[0].replace('
', '').replace(' ', '').replace('户型：', '')
            house_floor = item.xpath("./div[4]/div[2]/div[3]/label/text()")[0].replace('楼层：', '')
            house_position_1 = item.xpath("./div[4]/div[2]/div[4]/label/div/a[1]/text()")[0]
            house_position_2 = item.xpath("./div[4]/div[2]/div[4]/label/div/a[2]/text()")[0]
            house_position_3 = item.xpath("./div[4]/div[2]/div[4]/label/div/a[3]/text()")[0]
            house_subway = item.xpath("./div[4]/div[2]/div[5]/label/text()")[0]
    else:
        house_price = house_area = house_id = house_type = house_floor = house_position_1 = house_position_2 = house_position_3 = house_subway = None

Data Processing

All CSV files generated by the crawler are concatenated with pandas.concat. Duplicate rows are removed, and non‑numeric columns (price, area) are cast to float64. Floor information is split into current floor and total floors. Subway count is derived by counting occurrences of “号线”, and distance to the nearest subway is extracted with a regular expression.

# Convert price and area to numeric types
jg = df['价格'] != '价格'
df = df.loc[jg, :]
df['价格'] = df['价格'].astype('float64')
df['面积'] = df['面积'].astype('float64')

# Extract floor numbers
df = df[df['楼层'].notnull()]
df['所在楼层'] = df['楼层'].apply(lambda x: x.split('/')[0]).astype('int32')
df['总楼层'] = df['楼层'].apply(lambda x: x.split('/')[1]).str.replace('层', '').astype('int32')

# Subway utilities
def get_subway_num(row):
    return row.count('号线')

def get_subway_distance(row):
    m = re.search(r'\d+(?=米)', row)
    return int(m.group()) if m else -1

df['地铁数'] = df['地铁'].apply(get_subway_num)
df['距离地铁距离'] = df['地铁'].apply(get_subway_distance).astype('int32')

Data Visualization

Using matplotlib, seaborn, and pyecharts, several charts were produced:

Bar chart of apartment counts per district (Chaoyang 1,877, Tongzhou 1,027).

Top‑10 residential complexes by apartment count.

Rent distribution showing >50% of units priced between 2,000–3,000 CNY/month.

Floor‑level distribution (73.9% below 10 floors).

Area distribution (86.8% below 20 m²).

Word‑cloud of commercial circles highlighting popular neighborhoods.

Correlation Analysis

A correlation matrix shows that apartment area (0.81) and the number of nearby subway lines (0.36) have the strongest positive relationship with price, while floor level has little impact.

color_map = sns.light_palette('orange', as_cmap=True)
df.corr().style.background_gradient(color_map)

Conclusion

The analysis reveals that most Dank Apartments in Beijing are small, low‑rise units concentrated in Chaoyang and Tongzhou, with rent heavily influenced by size and subway accessibility. These insights help tenants and landlords understand market dynamics during the crisis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Web Scraping Pandas

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.