Backend Development 13 min read

How to Build a Robust Python Web Crawler for Forum Comments with Scrapy & Selenium

This article walks through building a Python web crawler that extracts forum post comments into MongoDB, covering project goals, environment setup, site structure analysis, Scrapy and Selenium integration, data storage design, handling anti‑scraping measures, and performance optimization with multithreading.

Python Crawling & Data Mining

Mar 11, 2021

How to Build a Robust Python Web Crawler for Forum Comments with Scrapy & Selenium

1. Introduction

Web crawlers (also known as spiders or bots) are programs that automatically fetch web information according to certain rules. In plain language, a crawler is used to mass‑collect structured data for downstream processing in big data, finance, machine learning, etc.

2. Project Goal

The goal is to crawl every comment of forum posts into a database, support incremental updates, avoid duplicate crawling, and handle anti‑scraping measures.

3. Project Preparation

Tools: PyCharm; Libraries: Scrapy, Selenium, pymongo, user_agent, datetime; Target site: http://bbs.foodmate.net; ChromeDriver (compatible version) for Selenium.

4. Project Analysis

1) Determine site structure

Identify how the site loads (static vs dynamic) and the hierarchy needed to reach each post.

2) Choose crawling method

Two common approaches: using raw requests or the Scrapy framework. Scrapy is preferred for its asynchronous engine, XPath support, logging, middleware, and pipelines.

5. Implementation

Step 1: Identify website type

Check whether the site is static or dynamic. The target forum is static, confirmed by loading the page without JavaScript.

Step 2: Determine hierarchy

The forum has three levels: board → thread list → post detail.

Step 3: Choose crawling method

Initially tried pure Scrapy, but the site later introduced dynamic loading and rate‑limiting, causing blocks. The solution was to combine Scrapy with Selenium to render pages before extraction.

Step 4: Store data

Define an Item class (LunTanItem) with fields such as title, content_info, article_url, scrawl_time, source, type, spider_type.

class LunTanItem(scrapy.Item):
    """Forum fields"""
    title = Field()
    content_info = Field()
    article_url = Field()
    scrawl_time = Field()
    source = Field()
    type = Field()
    spider_type = Field()

Step 5: Save to MongoDB

Use a pipeline that upserts documents based on article_url to avoid duplicates.

import pymongo

class FMPipeline():
    def __init__(self):
        client = pymongo.MongoClient('localhost')
        db = client.scrapy_FM
        self.collection = db.FM

    def process_item(self, item, spider):
        query = {'article_url': item['article_url']}
        self.collection.update_one(query, {"$set": dict(item)}, upsert=True)
        return item

Step 6: Other settings

Multithreading, request headers, and pipeline order are configured in settings.py.

6. Result Demonstration

Running the spider prints logs in the console and shows crawling progress.

With 16 threads the crawler processes many tasks concurrently.

The stored MongoDB documents contain each post’s comments and user information.

7. Summary

The tutorial demonstrates end‑to‑end data collection from a food forum, including site analysis, crawler design, handling anti‑scraping, storing results in MongoDB, and performance tuning. The approach is straightforward once the data pattern is understood, and the code can be adapted to similar projects.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data extraction MongoDB Web Scraping Selenium Crawler

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.