Backend Development 13 min read

Build a Fast Scrapy Spider to Crawl Forum Posts in Minutes

This tutorial walks beginners through creating a minimal Scrapy project, writing a spider that fetches forum thread titles and content, extracting data with XPath, and extending the crawler with pipelines, middleware, and common settings for robust web scraping.

MaGe Linux Operations

Oct 1, 2017

Build a Fast Scrapy Spider to Crawl Forum Posts in Minutes

Introduction

This guide shows how to quickly build a simple Scrapy spider that grabs forum post titles and contents, aimed at newcomers who have never written a crawler before.

Setup

Install Python, Scrapy, and an IDE or any text editor. Create a new Scrapy project named miao (or any name you prefer) with the command: scrapy startproject miao The command generates the standard Scrapy directory structure (shown in the image below).

Spider Code

Create miao/spiders/miao.py with the following content:

import scrapy

class NgaSpider(scrapy.Spider):
    name = "NgaSpider"
    host = "http://bbs.ngacn.cc/"
    start_urls = ["http://bbs.ngacn.cc/thread.php?fid=406"]

    def parse(self, response):
        print(response.body)

Run the spider from the project directory:

cd miao
scrapy crawl NgaSpider

The spider prints the raw HTML of the first forum page.

Parsing with XPath

Replace the parse method to extract titles using XPath:

from scrapy import Selector

def parse(self, response):
    selector = Selector(response)
    content_list = selector.xpath("//*[@class='topic']")
    for content in content_list:
        topic = content.xpath('string(.)').extract_first()
        url = self.host + content.xpath('@href').extract_first()
        print(topic)
        print(url)

This prints each post title and its absolute URL.

Recursive Crawling

To follow each post link and scrape its pages, use yield Request with a callback:

yield Request(url=url, callback=self.parse_topic)

Define parse_topic to extract the post’s content, and optionally create Item classes ( TopicItem, ContentItem) in items.py to structure the data.

Pipelines

Implement a pipeline (e.g., FilePipeline) in pipelines.py to process items – write them to files or databases. Register the pipeline in settings.py:

ITEM_PIPELINES = {
    'miao.pipelines.FilePipeline': 400,
}

Middleware

Create middleware.py with a user‑agent middleware that randomly selects a UA string for each request, and a proxy middleware that routes traffic through a specified proxy.

import random
agents = [
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
    "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
    ...
]

class UserAgentMiddleware(object):
    def process_request(self, request, spider):
        request.headers["User-Agent"] = random.choice(agents)

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        proxy = "http://127.0.0.1:8123"
        request.meta["proxy"] = proxy

Enable them in settings.py via DOWNLOADER_MIDDLEWARES.

Common Settings

Typical Scrapy settings include download delay, retry options, and concurrency limits:

DOWNLOAD_DELAY = 5
RETRY_ENABLED = True
RETRY_HTTP_CODES = [500,502,503,504,400,403,404,408]
RETRY_TIMES = 5
CONCURRENT_ITEMS = 200
CONCURRENT_REQUESTS = 100
CONCURRENT_REQUESTS_PER_DOMAIN = 50
CONCURRENT_REQUESTS_PER_IP = 50

Running from PyCharm

Configure PyCharm to run scrapy/cmdline.py with parameters crawl NgaSpider and set the working directory to the folder containing settings.py. Then start debugging with the green arrow.

References

For a deeper dive, see the official Scrapy documentation and XPath tutorials linked at the end of the original article.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Crawler

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.