How to Scrape Thousands of New‑House Listings in Python: A Step‑by‑Step Guide
This article demonstrates how to use Python's requests, fake_useragent, and lxml libraries to batch‑scrape nearly a thousand new‑house listings from the 惠民之家 website, extracting 41 fields such as name, price, layout, opening date, plot ratio and green ratio, while handling pagination and anti‑scraping measures.
Project Background
Hello, I'm J. Real‑estate new‑house data is valuable for buyers, developers and agents. This article uses the "惠民之家" website as an example to demonstrate how to batch‑scrape nearly a thousand new‑house listings with Python, extracting 41 fields such as name, price, layout, opening date, plot ratio, green ratio, etc.
Project Goal
Target website: http://www.fz0752.com/ List page URL:
http://www.fz0752.com/project/list.shtmlPreparation
Software: PyCharm
Third‑party libraries: requests, fake_useragent, lxml
Web Analysis
List Page Analysis
After clicking “Next Page”, the URL contains the pagination parameter pageNO and region parameter qy. By iterating over regions and pages we can collect all listing URLs.
Detail Page Analysis
Each listing URL contains an ID (e.g., 00020170060). The detail page URL is formed as http://newhouse.fz0752.com/project/detail.shtml?num=20170060. This pattern can be derived by inspecting several listings.
Anti‑Scraping Measures
Frequent requests from the same IP may be blocked. The script uses fake_useragent to randomize the User‑Agent header, reducing the risk of IP bans.
Code Implementation
The script imports required libraries, builds a region list, iterates over regions and pages (up to 50), requests each list page, extracts listing URLs, constructs detail URLs, and parses 41 fields using XPath. Extracted data is saved to a CSV file and can be exported to Excel.
# -*- coding = uft-8 -*-
# @Time : 2020/12/21 9:29 下午
# @Author : J哥
# @File : newhouse.py
import csv
import time
import random
import requests
import traceback
from lxml import etree
from fake_useragent import UserAgent
def main():
#46:惠城区,47:仲恺区,171:惠阳区,172:大亚湾,173:博罗县,174:惠东县,175:龙门县
qy_list = [46,47,171,172,173,174,175]
for qy in qy_list:
for page in range(1,50):
url = f'http://www.fz0752.com/project/list.shtml?state=&key=&qy={qy}&area=&danjia=&func=&fea=&type=&kp=&mj=&sort=&pageNO={page}'
response = requests.request("GET", url, headers=headers, timeout=5)
if response.status_code == 200:
# parse list page and extract hrefs
...Additional helper functions get_href and get_detail handle URL extraction and field parsing, with error handling and CSV writing.
Summary
This article provides a practical Python web‑scraping solution for acquiring new‑house data.
Do not overload the target server; scrape responsibly.
Reply with “新房” to obtain the full source code.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
