Backend Development 7 min read

32 Must‑Try Python Web Scraping Projects to Boost Your Data Skills

This article presents a curated list of 32 Python web‑scraping projects, each with a brief description of its purpose, technology stack, and data output format, helping developers quickly find useful open‑source crawlers on GitHub.

MaGe Linux Operations

Jul 1, 2017

32 Must‑Try Python Web Scraping Projects to Boost Your Data Skills

Today we compiled 32 Python web‑scraping projects. You can search them on GitHub or comment for links.

WechatSogou – a WeChat public‑account crawler based on Sogou search, returning a list of account dictionaries.

DouBanSpider – crawls books under DouBan tags, ranks by rating, stores results in Excel, uses User‑Agent spoofing and random delays.

zhihu_spider – scrapes Zhihu user info and social graph using Scrapy, stores data in MongoDB.

bilibili-user – extracts Bilibili user data (ID, nickname, gender, avatar, level, experience, followers, etc.) and generates a user report.

SinaSpider – crawls Sina Weibo user profiles, posts, followers and followings; logs in with cookies and supports multi‑account login to bypass anti‑scraping.

distribute_crawler – a distributed novel‑download crawler built with Scrapy, Redis, MongoDB and Graphite for status monitoring.

CnkiSpider – crawls China National Knowledge Infrastructure (CNKI) after setting search criteria; saves data files with field headers.

LianJiaSpider – scrapes Beijing second‑hand house transaction records from LianJia, includes login simulation code.

scrapy_jingdong – a JD.com crawler based on Scrapy, outputs CSV files.

QQ‑Groups‑Spider – batch extracts QQ group information (name, number, members, owner, description) into XLS/X CSV.

wooyun_public – crawls public vulnerability data from Wooyun, stores results in MongoDB; includes Flask web UI for search.

spider – hao123 site crawler that collects outbound links, internal links, titles; about 100 k URLs per 24 h on Windows 7.

findtrip – flight‑ticket crawler for Qunar and Ctrip built with Scrapy.

163spider – NetEase client content crawler using requests, MySQLdb, torndb.

doubanspiders – collection of crawlers for DouBan movies, books, groups, albums, etc.

QQSpider – QQ space crawler for logs, posts, personal info; can fetch 4 million records per day.

baidu-music-spider – full‑site Baidu MP3 crawler with Redis support for resumable downloads.

tbcrawler – Taobao and Tmall crawler that extracts page information by keyword or item ID, stores data in MongoDB.

stockholm – stock‑data (Shanghai & Shenzhen) crawler and strategy‑testing framework; saves JSON/CSV.

BaiduyunSpider – Baidu Cloud Disk crawler.

Spider – social‑data crawler supporting Weibo, Zhihu, DouBan.

proxy pool – Python proxy‑IP pool for crawlers.

music-163 – crawls comments of all songs on NetEase Cloud Music.

jandan_spider – scrapes pictures from Jandan.

CnblogsSpider – crawls list pages of Cnblogs.

spider_smooc – scrapes video courses from MOOC.

CnkiSpider – another CNKI crawler.

knowsecSpider2 – Zhidao Chuangyu crawling challenge.

aiss-spider – Aiss app image crawler.

SinaSpider – uses dynamic IPs to bypass Sina anti‑scraping and quickly fetch content.

csdn-spider – crawls blog articles from CSDN.

ProxySpider – crawls proxy IPs from Xici and validates their usability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GitHub Crawler web-scraping data-collection

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.