Backend Development 12 min read

How to Build a Robust Python Scrapy + Selenium Web Crawler for Forum Data

This tutorial walks through building a Python web crawler using Scrapy and Selenium to extract forum comments, store them in MongoDB, handle anti‑scraping measures, avoid duplicate data, and demonstrates the full end‑to‑end process with code examples and results.

Python Crawling & Data Mining

Apr 30, 2021

How to Build a Robust Python Scrapy + Selenium Web Crawler for Forum Data

1. Introduction

Web crawlers (also known as spiders or bots) are programs that automatically fetch web information according to rules. They are essential for big data, finance, machine learning, and many other fields.

2. Project Goal

The objective is to crawl every comment of forum posts into a database, support data updates, prevent duplicate crawling, and handle anti‑scraping measures.

3. Project Preparation

Tools: PyCharm; Libraries: Scrapy, Selenium, pymongo, user_agent, datetime; Target site: http://bbs.foodmate.net; ChromeDriver for Selenium.

4. Project Analysis

4.1 Determine site structure

Identify the loading method (static vs. dynamic) and the hierarchical navigation needed to reach post pages.

4.2 Choose crawling method

Use Scrapy for static pages; combine Selenium when dynamic loading or anti‑scraping is present.

5. Implementation

5.1 Step 1 – Identify site type

The site is a static forum; verification is done by loading the page.

5.2 Step 2 – Determine hierarchy

Three‑level navigation: board → thread list → post page.

5.3 Step 3 – Crawling method

Initially used Scrapy alone, but encountered rate‑limit restrictions; switched to Scrapy + Selenium with headless Chrome to render pages before extraction.

5.4 Step 4 – Data storage format

Define item fields in items.py:

class LunTanItem(scrapy.Item):
    title = Field()
    content_info = Field()
    article_url = Field()
    scrawl_time = Field()
    source = Field()
    type = Field()
    spider_type = Field()

5.5 Step 5 – Database

Store results in MongoDB; use an upsert operation to avoid duplicate entries.

class FMPipeline():
    def __init__(self):
        client = pymongo.MongoClient('localhost')
        db = client.scrapy_FM
        self.collection = db.FM

    def process_item(self, item, spider):
        query = {'article_url': item['article_url']}
        self.collection.update_one(query, {"$set": dict(item)}, upsert=True)
        return item

5.6 Step 6 – Additional settings

Configure request headers, concurrency, pipelines, and other options in settings.py as needed.

6. Result Demonstration

Console output shows crawling progress; the crawler runs with 16 threads for good speed. MongoDB stores each comment with user information.

7. Conclusion

The article demonstrates an end‑to‑end workflow for extracting data from a food forum, covering site analysis, crawling strategy, implementation with Scrapy and Selenium, data storage in MongoDB, and practical tips for handling anti‑scraping and duplicate data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Scrapy web-scraping data-crawling

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.