Build a Fast Scrapy Spider to Crawl Forum Posts in Minutes
This tutorial walks beginners through creating a minimal Scrapy project, writing a spider that fetches forum thread titles and content, extracting data with XPath, and extending the crawler with pipelines, middleware, and common settings for robust web scraping.
Introduction
This guide shows how to quickly build a simple Scrapy spider that grabs forum post titles and contents, aimed at newcomers who have never written a crawler before.
Setup
Install Python, Scrapy, and an IDE or any text editor. Create a new Scrapy project named miao (or any name you prefer) with the command: scrapy startproject miao The command generates the standard Scrapy directory structure (shown in the image below).
Spider Code
Create miao/spiders/miao.py with the following content:
import scrapy
class NgaSpider(scrapy.Spider):
name = "NgaSpider"
host = "http://bbs.ngacn.cc/"
start_urls = ["http://bbs.ngacn.cc/thread.php?fid=406"]
def parse(self, response):
print(response.body)Run the spider from the project directory:
cd miao
scrapy crawl NgaSpiderThe spider prints the raw HTML of the first forum page.
Parsing with XPath
Replace the parse method to extract titles using XPath:
from scrapy import Selector
def parse(self, response):
selector = Selector(response)
content_list = selector.xpath("//*[@class='topic']")
for content in content_list:
topic = content.xpath('string(.)').extract_first()
url = self.host + content.xpath('@href').extract_first()
print(topic)
print(url)This prints each post title and its absolute URL.
Recursive Crawling
To follow each post link and scrape its pages, use yield Request with a callback:
yield Request(url=url, callback=self.parse_topic)Define parse_topic to extract the post’s content, and optionally create Item classes ( TopicItem, ContentItem) in items.py to structure the data.
Pipelines
Implement a pipeline (e.g., FilePipeline) in pipelines.py to process items – write them to files or databases. Register the pipeline in settings.py:
ITEM_PIPELINES = {
'miao.pipelines.FilePipeline': 400,
}Middleware
Create middleware.py with a user‑agent middleware that randomly selects a UA string for each request, and a proxy middleware that routes traffic through a specified proxy.
import random
agents = [
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
"Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
...
]
class UserAgentMiddleware(object):
def process_request(self, request, spider):
request.headers["User-Agent"] = random.choice(agents)
class ProxyMiddleware(object):
def process_request(self, request, spider):
proxy = "http://127.0.0.1:8123"
request.meta["proxy"] = proxyEnable them in settings.py via DOWNLOADER_MIDDLEWARES.
Common Settings
Typical Scrapy settings include download delay, retry options, and concurrency limits:
DOWNLOAD_DELAY = 5
RETRY_ENABLED = True
RETRY_HTTP_CODES = [500,502,503,504,400,403,404,408]
RETRY_TIMES = 5
CONCURRENT_ITEMS = 200
CONCURRENT_REQUESTS = 100
CONCURRENT_REQUESTS_PER_DOMAIN = 50
CONCURRENT_REQUESTS_PER_IP = 50Running from PyCharm
Configure PyCharm to run scrapy/cmdline.py with parameters crawl NgaSpider and set the working directory to the folder containing settings.py. Then start debugging with the green arrow.
References
For a deeper dive, see the official Scrapy documentation and XPath tutorials linked at the end of the original article.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
