Scrape Lianjia Real‑Estate Listings with Python and Export to Word

This guide walks through building a Python web‑scraper that fetches house names, prices and popularity from Lianjia, formats the data into a Word template, and generates individual Word documents while demonstrating key libraries, URL handling, HTML parsing, and best‑practice considerations.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Scrape Lianjia Real‑Estate Listings with Python and Export to Word

Introduction

With the growing importance of housing data, the article demonstrates how to crawl second‑hand house listings from Lianjia (bj.lianjia.com) using Python and generate individual Word documents containing the house name, price and popularity.

Project Goal

Extract house name, price, and attention metrics, fill a Word template, and produce separate Word files for each listing.

Target Site and Libraries

Target URL pattern: https://bj.lianjia.com/ershoufang/pg{}/ where the page number is formatted into the URL.

Required libraries: requests , time , lxml .

Implementation Details

1. Class Definition

import requests
from lxml import etree
import time

class LianJia(object):
    def __init__(self):
        self.url = "https://bj.lianjia.com/ershoufang/pg{}/"
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
        }

    def get_page(self, url):
        html = requests.get(url=url, headers=self.headers).content.decode("utf-8")
        self.page_process(html)

    def page_process(self, html):
        parse_html = etree.HTML(html)
        items = parse_html.xpath('//*[@id="content"]/div[1]/ul/li')
        for li in items:
            house = {}
            house["名称"] = li.xpath('.//div[@class="infoclear"]//div[@class="title"]/a/text()')[0].strip()
            house["价格"] = li.xpath(".//div[@class='priceInfo']/div[@class='totalPrice']/span/text()")[0].strip() + "万"
            house["关注度"] = li.xpath('.//div[@class="info clear"]//div[@class="followInfo"]//text()')[0].strip()
            self.save_to_word(house)

    def save_to_word(self, house_dict):
        with open('房子.doc', 'a', encoding='utf-8') as f:
            f.write(str(house_dict) + "
")
            print(house_dict)

    def main(self):
        for pg in range(1, 101):
            url = self.url.format(pg)
            print("=" * 50)
            self.get_page(url)
            time.sleep(1.4)

if __name__ == "__main__":
    spider = LianJia()
    spider.main()

2. Running the Spider

Execute the script; the console will display progress and the generated Word file (named “房子.doc”) will contain the scraped data.

Result Showcase

Console output
Console output
Word document
Word document

Conclusion

Do not scrape excessively to avoid overloading the server.

The project helps understand recent housing price trends.

Python web crawling combined with Word automation provides a practical way to collect and present real‑estate data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonData ExtractionWeb ScrapingLianjiaWord Automation
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.