Backend Development 5 min read

Build a Fast Sina Weibo Scrapy Spider with MongoDB Storage

This guide details a Python‑based Scrapy spider that crawls Sina Weibo user profiles, posts, followers and followings from the WAP site, explains the required environment, configuration steps, and the MongoDB schema used to store the collected data.

Python Programming Learning Circle

Nov 11, 2019

Build a Fast Sina Weibo Scrapy Spider with MongoDB Storage

Project Overview

The project is similar to a QQ Space crawler but targets Sina Weibo, extracting personal information, tweets, followers and followings.

Key Features

Uses Sina Weibo cookies for login; multiple accounts can be purchased to bypass anti‑scraping measures.

Crawls the simple WAP version of Weibo, offering higher speed with slightly less data.

Achieves crawling speeds of over 13 million records per day on a campus network.

Environment & Architecture

Language : Python 2.7

OS : 64‑bit Windows 8, 4 GB RAM, i7‑3612QM CPU

Database : MongoDB 3.2.0

IDE : PyCharm 5.0.4

MongoDB tool : MongoBooster 1.1.1

Built on the Scrapy framework.

Downloader middleware randomly selects a cookie and User‑Agent from pools. start_requests launches four requests per user ID to fetch profile, tweets, followings and followers.

Newly discovered follower/following IDs are added to the crawl queue after deduplication.

Pre‑Launch Configuration

MongoDB only needs to be installed and running; no extra configuration.

Install Scrapy for 64‑bit Python and other dependencies.

Additional Python packages: pymongo, json, base64, requests.

Add your Weibo login credentials to cookies.py (template includes two example accounts).

Adjust Scrapy settings such as download delay, log level, and concurrent requests as needed.

Running Screenshots

Database Description

SinaSpider stores data in four collections: Information , Tweets , Follows , and Fans . Below are the fields for the first two collections.

Information collection fields:
_id          : User ID (unique)
Birthday     : Date of birth
City         : City
Gender       : Gender
Marriage     : Marital status
NickName     : Weibo nickname
Num_Fans     : Number of fans
Num_Follows  : Number of followings
Num_Tweets   : Number of tweets posted
Province     : Province
Signature    : Personal signature
URL          : Link to personal homepage

Tweets collection fields:
_id          : "UserID‑TweetID" (unique)
Co_oridinates: Geolocation coordinates of the tweet
Comment      : Number of comments
Content      : Tweet text
ID           : User ID
Like         : Number of likes
PubTime      : Publication time
Tools        : Device or platform used to post
Transfer     : Number of retweets

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python MongoDB Weibo

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.