How to Scrape Thousands of New‑House Listings in Python: A Step‑by‑Step Guide

This article demonstrates how to use Python's requests, fake_useragent, and lxml libraries to batch‑scrape nearly a thousand new‑house listings from the 惠民之家 website, extracting 41 fields such as name, price, layout, opening date, plot ratio and green ratio, while handling pagination and anti‑scraping measures.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
How to Scrape Thousands of New‑House Listings in Python: A Step‑by‑Step Guide

Project Background

Hello, I'm J. Real‑estate new‑house data is valuable for buyers, developers and agents. This article uses the "惠民之家" website as an example to demonstrate how to batch‑scrape nearly a thousand new‑house listings with Python, extracting 41 fields such as name, price, layout, opening date, plot ratio, green ratio, etc.

Project Goal

Target website: http://www.fz0752.com/ List page URL:

http://www.fz0752.com/project/list.shtml

Preparation

Software: PyCharm

Third‑party libraries: requests, fake_useragent, lxml

Web Analysis

List Page Analysis

After clicking “Next Page”, the URL contains the pagination parameter pageNO and region parameter qy. By iterating over regions and pages we can collect all listing URLs.

Detail Page Analysis

Each listing URL contains an ID (e.g., 00020170060). The detail page URL is formed as http://newhouse.fz0752.com/project/detail.shtml?num=20170060. This pattern can be derived by inspecting several listings.

Anti‑Scraping Measures

Frequent requests from the same IP may be blocked. The script uses fake_useragent to randomize the User‑Agent header, reducing the risk of IP bans.

Code Implementation

The script imports required libraries, builds a region list, iterates over regions and pages (up to 50), requests each list page, extracts listing URLs, constructs detail URLs, and parses 41 fields using XPath. Extracted data is saved to a CSV file and can be exported to Excel.

# -*- coding = uft-8 -*-
# @Time : 2020/12/21 9:29 下午
# @Author : J哥
# @File : newhouse.py

import csv
import time
import random
import requests
import traceback
from lxml import etree
from fake_useragent import UserAgent

def main():
    #46:惠城区,47:仲恺区,171:惠阳区,172:大亚湾,173:博罗县,174:惠东县,175:龙门县
    qy_list = [46,47,171,172,173,174,175]
    for qy in qy_list:
        for page in range(1,50):
            url = f'http://www.fz0752.com/project/list.shtml?state=&key=&qy={qy}&area=&danjia=&func=&fea=&type=&kp=&mj=&sort=&pageNO={page}'
            response = requests.request("GET", url, headers=headers, timeout=5)
            if response.status_code == 200:
                # parse list page and extract hrefs
                ...

Additional helper functions get_href and get_detail handle URL extraction and field parsing, with error handling and CSV writing.

Summary

This article provides a practical Python web‑scraping solution for acquiring new‑house data.

Do not overload the target server; scrape responsibly.

Reply with “新房” to obtain the full source code.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Pythondata miningCSVWeb ScrapingrequestsReal Estate Datalxml
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.