How to Scrape Chinese Real‑Estate Listings with Python: A Step‑by‑Step Guide
This article walks you through building a Python web‑scraper that extracts second‑hand housing data from Lianjia, covering target identification, page structure analysis, multi‑level filtering, pagination handling, detailed page parsing, and data storage with practical code examples.
1. Determine the Target
The target website is https://www.lianjia.com/ . For a specific city, e.g., Shenzhen, the second‑hand house page is https://sz.lianjia.com/ershoufang/ ("sz" stands for Shenzhen, "gz" for Guangzhou).
Two pages need to be considered: the listing page (shown after opening the URL) and the detail page (opened after clicking a house link).
1.1 Listing Page
The listing page consists of three parts: a top search section, a middle list section, and a bottom pagination section.
Search section appears optional but is required to reduce the total number of records. For example, selecting Shenzhen yields 39,053 records, but only 100 pages are available (30 items per page), limiting the extractable data to 3,000 items without filters.
Therefore, meaningful filters such as region + layout + orientation should be applied to keep the result set under 3,000.
Other filter combinations can achieve the same effect; set them as needed.
Using the browser's developer tools, the mapping for layout and orientation can be observed:
In code, these mappings are represented as:
# Layout: 1‑room, 2‑room, 3‑room, 4‑room, 5‑room, 5‑room+
self.rooms_number = ['l1', 'l2', 'l3', 'l4', 'l5', 'l6']
# Orientation: East, South, West, North, South‑North
self.orientation = ['f1', 'f2', 'f3', 'f4', 'f5']1.2 Detail Page
The detail page contains three sections: price + location, basic + transaction info, and map data.
From the price section you can obtain total price and unit price; from the location you get community name and region hierarchy.
The basic/transaction section provides fields such as listing time, mortgage status, property rights, etc.
The map section includes nearby subway stations, bus stops, and hidden latitude/longitude coordinates.
2. Process Design
Summarizing the workflow:
Check if the city’s total records exceed 3,000. If so, apply filters in the order: region → layout → orientation, stopping when the result set drops below 3,000.
After filters are set, iterate through each listing page, construct pagination URLs, and collect links to detail pages.
Parse each detail page and save the extracted fields to a local file.
A flowchart (illustrated in the original article) visualizes these steps.
3. Core Code Implementation
Below are the essential code snippets; the full script is available at the end of the article.
3.1 Get House Count
def get_house_count(self):
"""Get the number of houses for the current filter condition."""
# Fetch the initial page
response = requests.get(url=self.current_url, headers=self.headers)
# Parse with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Extract total count
count = soup.find('h2', class_='total fl').find('span').string.lstrip()
return soup, count3.2 Main Page Logic
def get_main_page(self):
# Get total count for current filter
soup, count_main = self.get_house_count()
if int(count_main) > self.page_size * self.max_pages:
# Retrieve all districts as the first filter
soup_uls = soup.find('div', attrs={'data-role': 'ershoufang'}).div.find_all('a')
self.area = self.get_area_list(soup_uls)
for area in self.area:
self.get_area_page(area)
else:
# Directly fetch data without extra filtering
self.get_pages(int(count_main), '', '', '')
# Save results
self.data_to_csv()3.3 Data Saving
def data_to_csv(self):
"""Append or write data to a CSV file."""
df_data = pd.DataFrame(self.data_info)
if os.path.exists(self.save_file_path) and os.path.getsize(self.save_file_path):
df_data.to_csv(self.save_file_path, mode='a', encoding='utf-8', header=False, index=False)
else:
df_data.to_csv(self.save_file_path, mode='a', encoding='utf-8', index=False)
self.data_info = []3.4 Duplicate Check
# IDs of already saved houses for deduplication
self.house_id = self.get_exists_house_id()
def get_exists_house_id(self):
"""Read existing CSV and return a list of house IDs."""
if os.path.exists(self.save_file_path):
df_data = pd.read_csv(self.save_file_path, encoding='utf-8')
df_data['house_id'] = df_data['house_id'].astype(str)
return df_data['house_id'].to_list()
else:
return []3.5 Entry Point
if __name__ == '__main__':
city_number = 'sz'
city_name = '深圳'
url = 'https://{0}.lianjia.com/ershoufang/'.format(city_number)
page_size = 30
save_file_path = '二手房数据-sz.csv'
house = House(city_name, url, page_size, save_file_path)
house.get_main_page()4. Execution Screenshots
First run (parameters set):
Subsequent run (no parameters needed):
Sample of the extracted data:
5. Final Remarks
Although the workflow is a bit involved, the overall approach remains a solid introductory example for Python web scraping and data mining.
For more details on proxy settings, request headers, and HTML parsing, refer to the author's earlier articles.
Remember to add reasonable sleep intervals during crawling to act responsibly.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
