Backend Development 11 min read

Scrapy‑Based Zhihu User Follow/Followers Crawler with MongoDB Storage

This tutorial demonstrates how to build a Scrapy spider that crawls Zhihu user follow and follower data via Zhihu’s public APIs, handles request headers, parses JSON responses, paginates results, and stores the extracted information into MongoDB using a custom item pipeline.

Python Programming Learning Circle

Apr 6, 2022

Scrapy‑Based Zhihu User Follow/Followers Crawler with MongoDB Storage

The environment for this project builds on a previous setup, adding MongoDB and the PyMongo driver; it assumes MongoDB is already installed and running.

A simple spider is first shown that fetches a user’s following and follower counts, prints the results, and illustrates a 500 error that is resolved by adding appropriate request headers in settings.py.

By inspecting Zhihu’s XHR requests in Firefox, the tutorial identifies the core API endpoints: one for detailed user information (

https://www.zhihu.com/api/v4/members/{user}?include={include}

) and another for the list of followees (

https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit}

). The required include parameters and pagination logic are explained.

A complete spider implementation is provided, defining URL templates, include strings, and a start user (e.g., satoshi_nakamoto). The start_requests method formats the URLs and yields requests to parse_user and parse_follow callbacks.

import scrapy

class ZhuHuSpider(scrapy.Spider):
    """知乎爬虫"""
    name = 'zhuhu'
    allowed_domains = ['zhihu.com']
    user_detail = 'https://www.zhihu.com/api/v4/members/{user}?include={include}'
    user_include = (
        'allow_message,is_followed,'
        'is_following,'
        'is_org,is_blocking,'
        'employments,'
        'answer_count,'
        'follower_count,'
        'articles_count,'
        'gender,'
        'badge[?(type=best_answerer)].topics'
    )
    follow_url = 'https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit}'
    follow_include = (
        'data[*].answer_count,'
        'articles_count,'
        'gender,'
        'follower_count,'
        'is_followed,'
        'is_following,'
        'badge[?(type=best_answerer)].topics'
    )
    start_user = 'satoshi_nakamoto'

    def start_requests(self):
        yield scrapy.Request(
            self.user_detail.format(user=self.start_user, include=self.user_include),
            callback=self.parse_user)
        yield scrapy.Request(
            self.follow_url.format(user=self.start_user, include=self.follow_include, offset=20, limit=20),
            callback=self.parse_follow)

    def parse_user(self, response):
        # processing logic will be added later
        pass

    def parse_follow(self, response):
        # processing logic will be added later
        pass

The parse_user method converts the response to JSON, populates a UserItem with all available fields, yields the item, and schedules a request for the user’s followees using the formatted follow URL.

def parse_user(self, response):
    """Parse detailed user information"""
    results = json.loads(response.text)
    item = UserItem()
    for field in item.fields:
        if field in results:
            item[field] = results.get(field)
    yield item
    yield scrapy.Request(
        self.follow_url.format(user=results.get('url_token'), include=self.follow_include, offset=0, limit=20),
        callback=self.parse_follow)

The parse_follow method extracts the list of followees, recursively requests each followee’s detail page, and follows pagination until the is_end flag is true.

def parse_follow(self, response):
    """Parse followees list and handle pagination"""
    results = json.loads(response.text)
    if 'data' in results:
        for result in results['data']:
            yield scrapy.Request(
                self.user_detail.format(user=result.get('url_token'), include=self.user_include),
                callback=self.parse_user)
    if results.get('paging', {}).get('is_end') == False:
        next_page = results['paging']['next']
        yield scrapy.Request(next_page, callback=self.parse_follow)

To store the scraped data, a custom item pipeline writes items into MongoDB. The pipeline class ZhiHuspiderPipeline opens a connection, authenticates (if needed), and uses the update method with upsert to avoid duplicates.

class ZhiHuspiderPipeline(object):
    """Store Zhihu data into MongoDB"""
    collection_name = 'user'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db_auth = self.client.admin
        self.db_auth.authenticate('admin', 'password')
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].update(
            {'url_token': item['url_token']},
            dict(item),
            upsert=True
        )
        return item

The MongoDB update syntax is explained, emphasizing the upsert option that inserts a document when it does not already exist, thereby achieving de‑duplication.

Finally, the tutorial shows the required Scrapy settings.py entries for MongoDB connection and demonstrates successful spider runs, with screenshots of console output and MongoDB documents confirming that the data has been collected and stored correctly.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data pipeline Python API MongoDB Web Scraping Scrapy zhihu

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.