Big Data 12 min read

How to Scrape Chinese Real‑Estate Listings with Python: A Step‑by‑Step Guide

This article walks you through building a Python web‑scraper that extracts second‑hand housing data from Lianjia, covering target identification, page structure analysis, multi‑level filtering, pagination handling, detailed page parsing, and data storage with practical code examples.

Python Crawling & Data Mining

Aug 12, 2021

How to Scrape Chinese Real‑Estate Listings with Python: A Step‑by‑Step Guide

1. Determine the Target

The target website is https://www.lianjia.com/ . For a specific city, e.g., Shenzhen, the second‑hand house page is https://sz.lianjia.com/ershoufang/ ("sz" stands for Shenzhen, "gz" for Guangzhou).

Two pages need to be considered: the listing page (shown after opening the URL) and the detail page (opened after clicking a house link).

1.1 Listing Page

The listing page consists of three parts: a top search section, a middle list section, and a bottom pagination section.

Search section appears optional but is required to reduce the total number of records. For example, selecting Shenzhen yields 39,053 records, but only 100 pages are available (30 items per page), limiting the extractable data to 3,000 items without filters.

Therefore, meaningful filters such as region + layout + orientation should be applied to keep the result set under 3,000.

Other filter combinations can achieve the same effect; set them as needed.

Using the browser's developer tools, the mapping for layout and orientation can be observed:

In code, these mappings are represented as:

# Layout: 1‑room, 2‑room, 3‑room, 4‑room, 5‑room, 5‑room+
self.rooms_number = ['l1', 'l2', 'l3', 'l4', 'l5', 'l6']
# Orientation: East, South, West, North, South‑North
self.orientation = ['f1', 'f2', 'f3', 'f4', 'f5']

1.2 Detail Page

The detail page contains three sections: price + location, basic + transaction info, and map data.

From the price section you can obtain total price and unit price; from the location you get community name and region hierarchy.

The basic/transaction section provides fields such as listing time, mortgage status, property rights, etc.

The map section includes nearby subway stations, bus stops, and hidden latitude/longitude coordinates.

2. Process Design

Summarizing the workflow:

Check if the city’s total records exceed 3,000. If so, apply filters in the order: region → layout → orientation, stopping when the result set drops below 3,000.

After filters are set, iterate through each listing page, construct pagination URLs, and collect links to detail pages.

Parse each detail page and save the extracted fields to a local file.

A flowchart (illustrated in the original article) visualizes these steps.

3. Core Code Implementation

Below are the essential code snippets; the full script is available at the end of the article.

3.1 Get House Count

def get_house_count(self):
    """Get the number of houses for the current filter condition."""
    # Fetch the initial page
    response = requests.get(url=self.current_url, headers=self.headers)
    # Parse with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    # Extract total count
    count = soup.find('h2', class_='total fl').find('span').string.lstrip()
    return soup, count

3.2 Main Page Logic

def get_main_page(self):
    # Get total count for current filter
    soup, count_main = self.get_house_count()
    if int(count_main) > self.page_size * self.max_pages:
        # Retrieve all districts as the first filter
        soup_uls = soup.find('div', attrs={'data-role': 'ershoufang'}).div.find_all('a')
        self.area = self.get_area_list(soup_uls)
        for area in self.area:
            self.get_area_page(area)
    else:
        # Directly fetch data without extra filtering
        self.get_pages(int(count_main), '', '', '')
    # Save results
    self.data_to_csv()

3.3 Data Saving

def data_to_csv(self):
    """Append or write data to a CSV file."""
    df_data = pd.DataFrame(self.data_info)
    if os.path.exists(self.save_file_path) and os.path.getsize(self.save_file_path):
        df_data.to_csv(self.save_file_path, mode='a', encoding='utf-8', header=False, index=False)
    else:
        df_data.to_csv(self.save_file_path, mode='a', encoding='utf-8', index=False)
    self.data_info = []

3.4 Duplicate Check

# IDs of already saved houses for deduplication
self.house_id = self.get_exists_house_id()

def get_exists_house_id(self):
    """Read existing CSV and return a list of house IDs."""
    if os.path.exists(self.save_file_path):
        df_data = pd.read_csv(self.save_file_path, encoding='utf-8')
        df_data['house_id'] = df_data['house_id'].astype(str)
        return df_data['house_id'].to_list()
    else:
        return []

3.5 Entry Point

if __name__ == '__main__':
    city_number = 'sz'
    city_name = '深圳'
    url = 'https://{0}.lianjia.com/ershoufang/'.format(city_number)
    page_size = 30
    save_file_path = '二手房数据-sz.csv'
    house = House(city_name, url, page_size, save_file_path)
    house.get_main_page()

4. Execution Screenshots

First run (parameters set):

Subsequent run (no parameters needed):

Sample of the extracted data:

5. Final Remarks

Although the workflow is a bit involved, the overall approach remains a solid introductory example for Python web scraping and data mining.

For more details on proxy settings, request headers, and HTML parsing, refer to the author's earlier articles.

Remember to add reasonable sleep intervals during crawling to act responsibly.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

real estate Web Scraping Lianjia

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.