Backend Development 13 min read

Build a Fast Scrapy Spider to Crawl Forum Posts in Minutes

This tutorial walks beginners through setting up a Python Scrapy project, writing a spider to fetch forum thread titles and contents, using XPath for parsing, and enhancing the crawler with pipelines, middleware, and common settings for robust web scraping.

MaGe Linux Operations

Jun 9, 2018

Build a Fast Scrapy Spider to Crawl Forum Posts in Minutes

Introduction

This guide shows how to quickly create a simple Scrapy spider that captures forum post titles and content, aimed at newcomers who have never written a crawler before.

Prerequisites

Install Python , Scrapy , and an IDE or any text editor.

Create a Scrapy Project

Open a terminal, create a working directory, and run: scrapy startproject miao Scrapy generates a project structure (see image).

Write the Spider

Create miao/spiders/miao.py with the following content:

import scrapy

class NgaSpider(scrapy.Spider):
    name = "NgaSpider"
    host = "http://bbs.ngacn.cc/"
    # start_urls is the initial page to crawl
    start_urls = [
        "http://bbs.ngacn.cc/thread.php?fid=406",
    ]
    def parse(self, response):
        print response.body

Run a Test

From the project directory execute:

cd miao
scrapy crawl NgaSpider

The spider prints the raw HTML of the first forum page.

Parsing with XPath

Import Selector and modify parse to extract titles:

from scrapy import Selector

def parse(self, response):
    selector = Selector(response)
    # Extract all elements with class='topic'
    content_list = selector.xpath("//*[@class='topic']")
    for content in content_list:
        topic = content.xpath('string(.)').extract_first()
        print topic
        url = self.host + content.xpath('@href').extract_first()
        print url

This prints each post title and its URL.

Recursive Crawling

To fetch each post’s content, use yield Request with a callback:

yield Request(url=url, callback=self.parse_topic)

Define parse_topic to extract post bodies:

def parse_topic(self, response):
    selector = Selector(response)
    content_list = selector.xpath("//*[@class='postcontent ubbcode']")
    for content in content_list:
        content = content.xpath('string(.)').extract_first()
        print content

Pipelines – Processing Items

Create items.py:

from scrapy import Item, Field

class TopicItem(Item):
    url = Field()
    title = Field()
    author = Field()

class ContentItem(Item):
    url = Field()
    content = Field()
    author = Field()

In pipelines.py, build items and yield them:

item = ContentItem()
item["url"] = response.url
item["content"] = content
item["author"] = ""
yield item

Enable the pipeline in settings.py:

ITEM_PIPELINES = {
    'miao.pipelines.FilePipeline': 400,
}

Middleware – Custom Requests

Add a middleware file middleware.py to rotate User‑Agents:

import random
agents = [
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
    "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
    // more agents …
]

class UserAgentMiddleware(object):
    def process_request(self, request, spider):
        agent = random.choice(agents)
        request.headers["User-Agent"] = agent

Configure it in settings.py:

DOWNLOADER_MIDDLEWARES = {
    "miao.middleware.UserAgentMiddleware": 401,
    "miao.middleware.ProxyMiddleware": 402,
}

Similarly, a simple proxy middleware can be added to bypass IP bans.

Common Settings

# Delay between requests (seconds)
DOWNLOAD_DELAY = 5

# Retry on failure
RETRY_ENABLED = True
RETRY_HTTP_CODES = [500, 502, 503, 504, 400, 403, 404, 408]
RETRY_TIMES = 5

# Concurrency limits
CONCURRENT_ITEMS = 200
CONCURRENT_REQUESTS = 100
CONCURRENT_REQUESTS_PER_DOMAIN = 50
CONCURRENT_REQUESTS_PER_IP = 50

Running in PyCharm

Configure a Run/Debug configuration pointing to Scrapy’s cmdline.py, set script parameters to crawl NgaSpider, and set the working directory to the project’s settings folder.

References

Scrapy documentation: http://scrapy-chs.readthedocs.io/zh_CN/0.24/

XPath tutorial: http://www.w3school.com.cn/xpath/xpath_syntax.asp

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Middleware pipeline Scrapy XPath

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.