Backend Development 6 min read

Python Web Scraping Tutorial: Crawling QDaily, Storing in SQLite, Analyzing Data and Generating a Word Cloud

This tutorial walks through building a simple Python web crawler for the QDaily website, covering target analysis, environment setup, SQLite database creation, data extraction with requests and BeautifulSoup, storing articles and comments, performing basic analysis, and visualizing results with a word cloud.

Python Programming Learning Circle

Jul 30, 2022

Python Web Scraping Tutorial: Crawling QDaily, Storing in SQLite, Analyzing Data and Generating a Word Cloud

Python offers a wealth of libraries for beginners, and this guide presents a complete, beginner‑friendly web‑scraping project that combines data collection, storage, analysis, and visualization.

Target site : The popular Chinese site "好奇心日报" (QDaily) provides rich, high‑resolution articles and images, making it an ideal target for crawling.

Preparation : Analyze the site’s structure, check for anti‑scraping measures, and plan the workflow. The project uses the lightweight sqlite3 database for storage.

1) Create the database : A QDaily_DB class handles creation, insertion, and closing of the SQLite file qdaily.db. Two tables are defined – qdality for article metadata (id, title, likes, shares, date, comment count) and comments for storing each comment linked to its article id.

2) Web crawling : The crawler uses only requests and BeautifulSoup, avoiding complex frameworks or concurrency to keep the load gentle on QDaily’s servers. For each article id, a URL is constructed, fetched, and the HTML content is passed to a parsing function.

Construct URLs from article IDs.

Handle encoding, retrieve html_content, and feed it to a parser.

Wrap network calls in try/except/finally for robustness.

3) Parsing and comment extraction : BeautifulSoup parses the article page to extract the required fields. Comments require a custom header to bypass restrictions; they are then parsed and stored.

4) Save data to SQLite : Using the save_db method of QDaily_DB, the collected article and comment data are inserted with standard SQL INSERT statements, followed by commit().

5) Data presentation : After crawling ~50,000 articles, the data is sorted by share count and comment count to identify popular content. Visualizations show that newer articles tend to receive more shares, indicating a growing user base.

6) Word cloud generation : Using matplotlib and wordcloud, a word cloud of the most frequent comment words is created, revealing common short expressions like "哈哈哈", "是的", "呵呵", and "谢谢".

Overall, this end‑to‑end example demonstrates how to combine Python’s web‑scraping, database, and data‑visualization capabilities into a cohesive project that is especially useful for beginners looking to strengthen their fundamentals.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data analysis SQLite Web Scraping beautifulsoup word cloud

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.