Python Web Scraping Tutorial: Crawling QDaily, Storing in SQLite, Analyzing Data and Generating a Word Cloud
This tutorial walks through building a simple Python web crawler for the QDaily website, covering target analysis, environment setup, SQLite database creation, data extraction with requests and BeautifulSoup, storing articles and comments, performing basic analysis, and visualizing results with a word cloud.
Python offers a wealth of libraries for beginners, and this guide presents a complete, beginner‑friendly web‑scraping project that combines data collection, storage, analysis, and visualization.
Target site : The popular Chinese site "好奇心日报" (QDaily) provides rich, high‑resolution articles and images, making it an ideal target for crawling.
Preparation : Analyze the site’s structure, check for anti‑scraping measures, and plan the workflow. The project uses the lightweight sqlite3 database for storage.
1) Create the database : A QDaily_DB class handles creation, insertion, and closing of the SQLite file qdaily.db . Two tables are defined – qdality for article metadata (id, title, likes, shares, date, comment count) and comments for storing each comment linked to its article id.
2) Web crawling : The crawler uses only requests and BeautifulSoup , avoiding complex frameworks or concurrency to keep the load gentle on QDaily’s servers. For each article id, a URL is constructed, fetched, and the HTML content is passed to a parsing function.
Construct URLs from article IDs. Handle encoding, retrieve html_content , and feed it to a parser. Wrap network calls in try/except/finally for robustness.
3) Parsing and comment extraction : BeautifulSoup parses the article page to extract the required fields. Comments require a custom header to bypass restrictions; they are then parsed and stored.
4) Save data to SQLite : Using the save_db method of QDaily_DB , the collected article and comment data are inserted with standard SQL INSERT statements, followed by commit() .
5) Data presentation : After crawling ~50,000 articles, the data is sorted by share count and comment count to identify popular content. Visualizations show that newer articles tend to receive more shares, indicating a growing user base.
6) Word cloud generation : Using matplotlib and wordcloud , a word cloud of the most frequent comment words is created, revealing common short expressions like "哈哈哈", "是的", "呵呵", and "谢谢".
Overall, this end‑to‑end example demonstrates how to combine Python’s web‑scraping, database, and data‑visualization capabilities into a cohesive project that is especially useful for beginners looking to strengthen their fundamentals.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.