Big Data 9 min read

23 Python Web Scraping Projects with GitHub Links

This article compiles twenty‑three Python web‑scraping projects, each described with its purpose, key features, and a direct GitHub repository link, offering developers a ready‑made toolbox for data collection across platforms such as WeChat, DouBan, Zhihu, Bilibili, and more.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
23 Python Web Scraping Projects with GitHub Links

Today we present 23 Python web‑scraping projects that are easy to start with and ideal for beginners to build confidence; all links point to GitHub repositories.

1. WechatSogou – 微信公众号爬虫 : An interface based on Sogou WeChat search that returns a list of public‑account dictionaries. GitHub

2. DouBanSpider – 豆瓣读书爬虫 : Crawls all books under DouBan tags, ranks by rating, stores results in Excel with optional sheets per theme, using User‑Agent spoofing and random delays. GitHub

3. zhihu_spider – 知乎爬虫 : Retrieves Zhihu user info and social graph; built with Scrapy and stores data in MongoDB. GitHub

4. bilibili-user – Bilibili用户爬虫 : Captures user ID, nickname, gender, avatar, level, experience, followers, birthday, address, registration time, signature, etc., generating a Bilibili user data report. GitHub

5. SinaSpider – 新浪微博爬虫 : Extracts personal info, posts, followers, and following from Sina Weibo; obtains cookies for login and supports multi‑account to evade anti‑scraping. GitHub

6. distribute_crawler – 小说下载分布式爬虫 : Uses Scrapy, Redis, MongoDB, and Graphite to build a distributed crawler for a novel site, storing data in a MongoDB cluster. GitHub

7. CnkiSpider – 中国知网爬虫 : After setting search criteria, runs src/CnkiSpider.py to fetch data stored under /data with field names in the first line. GitHub

8. LianJiaSpider – 链家网爬虫 : Scrapes Beijing second‑hand house transaction records, including login simulation code. GitHub

9. scrapy_jingdong – 京东爬虫 : Scrapy‑based crawler for JD.com, saving results as CSV. GitHub

10. QQ‑Groups‑Spider – QQ 群爬虫 : Batch extracts QQ group name, number, member count, owner, description, outputting XLS(X) / CSV files. GitHub

11. wooyun_public – 乌云爬虫 : Crawls public vulnerability data from Wooyun, storing text in MongoDB (~2 GB) and optionally full site content; provides a Flask web UI. GitHub

12. spider – hao123网站爬虫 : Starts from hao123, collects outbound links, records title, in‑link/out‑link counts; ~100 k URLs per 24 h on Windows 7 32‑bit. GitHub

13. findtrip – 机票爬虫 : Scrapy‑based flight scraper integrating Qunar and Ctrip. GitHub

14. 163spider – 网易客户端内容爬虫 : Uses requests, MySQLdb, torndb to fetch NetEase client data. GitHub

15. doubanspiders – 豆瓣综合爬虫集 : Crawls movies, books, groups, albums, etc., from DouBan. GitHub

16. QQSpider – QQ空间爬虫 : Harvests logs, posts, personal info; up to 4 million records per day. GitHub

17. baidu‑music‑spider – 百度音乐爬虫 : Full‑site Baidu MP3 crawler with Redis for resumable downloads. GitHub

18. tbcrawler – 淘宝/天猫爬虫 : Searches by keyword or item ID, stores data in MongoDB. GitHub

19. stockholm – 股票数据爬虫与选股框架 : Retrieves Shanghai/Shenzhen market data, supports expression‑based strategies, multithreading, outputs JSON/CSV. GitHub

20. BaiduyunSpider – 百度云盘爬虫 : GitHub

21. Spider – 社交数据爬虫 : Supports Weibo, Zhihu, DouBan. GitHub

22. proxy_pool – Python爬虫代理IP池 : GitHub

23. music-163 – 网易云音乐评论爬虫 : GitHub

data collectionGitHubScrapyRequestsweb-scraping
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.