Master Scrapy: Build a Python Web Crawler to Extract Jokes from Qiushibaike
This tutorial walks you through installing Scrapy on Windows, creating a project and spider, configuring settings, and using XPath to crawl and extract joke titles and contents from the Qiushibaike website, providing a solid foundation for Python web scraping.
Introduction
This article introduces the powerful Python web‑scraping framework Scrapy and walks through installing it on Windows, creating a project, generating a spider, configuring settings, and extracting data from the Qiushibaike jokes site.
Scrapy Overview
Scrapy is a popular framework for crawling websites and extracting structured data. It provides asynchronous downloading, queues, distributed crawling, parsing, and persistence. Learning Scrapy focuses on its features and usage.
Installation on Windows
Install via pip. If the default command fails, use the Tsinghua mirror:
pip install scrapy pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simpleCreate a Scrapy Project
Run: scrapy startproject <project_name> Example:
scrapy startproject qiushibaikeProject structure includes scrapy.cfg and a folder named after the project with items.py, middlewares.py, pipelines.py, settings.py, spiders/, etc.
Generate a Spider
Use:
scrapy genspider <spider_name> <start_url>Example for Qiushibaike jokes:
scrapy genspider duanzi ww.comConfigure Settings
In settings.py set ROBOTSTXT_OBEY = False to ignore robots.txt and define a realistic USER_AGENT, e.g.:
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'Extract Joke Links
Use XPath to locate each joke block and the a tag with class contentHerf:
//div[contains(@class,"article") and contains(@class,"block")]//a[@class="contentHerf"]/@hrefIn the spider’s parse method:
def parse(self, response):
a_href_list = response.xpath('//div[contains(@class,"article") and contains(@class,"block")]//a[@class="contentHerf"]/@href').extract()
base_url = "https://www.qiushibaike.com"
for a_href in a_href_list:
url = f"{base_url}{a_href}"
yield scrapy.Request(url=url, callback=self.detail)Parse Detail Page
Extract title and content with XPath:
//h1[@class="article-title"]/text() //div[@class="content"]//text()Detail method:
def detail(self, response):
title = response.xpath('//h1[@class="article-title"]/text()').extract()
content = response.xpath('//div[@class="content"]//text()').extract()
print("Title:", title)
print("Content:", content)Run the Spider
Start crawling with: scrapy crawl duanzi --nolog The spider prints each joke’s URL, title, and content.
Summary of Commands
scrapy startproject <project_name>
scrapy genspider <spider_name> <start_url>
scrapy crawl <spider_name> [--nolog]
Conclusion
After following these steps you should have a basic Scrapy crawler that fetches jokes from Qiushibaike. Continue exploring Scrapy’s features to handle pagination, pipelines, and data storage.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
