Backend Development 9 min read

Master Scrapy: Build a Python Web Crawler to Extract Jokes from Qiushibaike

This tutorial walks you through installing Scrapy on Windows, creating a project and spider, configuring settings, and using XPath to crawl and extract joke titles and contents from the Qiushibaike website, providing a solid foundation for Python web scraping.

Python Crawling & Data Mining

Jan 8, 2021

Master Scrapy: Build a Python Web Crawler to Extract Jokes from Qiushibaike

Introduction

This article introduces the powerful Python web‑scraping framework Scrapy and walks through installing it on Windows, creating a project, generating a spider, configuring settings, and extracting data from the Qiushibaike jokes site.

Scrapy Overview

Scrapy is a popular framework for crawling websites and extracting structured data. It provides asynchronous downloading, queues, distributed crawling, parsing, and persistence. Learning Scrapy focuses on its features and usage.

Installation on Windows

Install via pip. If the default command fails, use the Tsinghua mirror:

pip install scrapy

pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple

Create a Scrapy Project

Run: scrapy startproject <project_name> Example:

scrapy startproject qiushibaike

Project structure includes scrapy.cfg and a folder named after the project with items.py, middlewares.py, pipelines.py, settings.py, spiders/, etc.

Generate a Spider

Use:

scrapy genspider <spider_name> <start_url>

Example for Qiushibaike jokes:

scrapy genspider duanzi ww.com

Configure Settings

In settings.py set ROBOTSTXT_OBEY = False to ignore robots.txt and define a realistic USER_AGENT, e.g.:

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'

Extract Joke Links

Use XPath to locate each joke block and the a tag with class contentHerf:

//div[contains(@class,"article") and contains(@class,"block")]//a[@class="contentHerf"]/@href

In the spider’s parse method:

def parse(self, response):
    a_href_list = response.xpath('//div[contains(@class,"article") and contains(@class,"block")]//a[@class="contentHerf"]/@href').extract()
    base_url = "https://www.qiushibaike.com"
    for a_href in a_href_list:
        url = f"{base_url}{a_href}"
        yield scrapy.Request(url=url, callback=self.detail)

Parse Detail Page

Extract title and content with XPath:

//h1[@class="article-title"]/text()

//div[@class="content"]//text()

Detail method:

def detail(self, response):
    title = response.xpath('//h1[@class="article-title"]/text()').extract()
    content = response.xpath('//div[@class="content"]//text()').extract()
    print("Title:", title)
    print("Content:", content)

Run the Spider

Start crawling with: scrapy crawl duanzi --nolog The spider prints each joke’s URL, title, and content.

Summary of Commands

scrapy startproject <project_name>

scrapy genspider <spider_name> <start_url>

scrapy crawl <spider_name> [--nolog]

Conclusion

After following these steps you should have a basic Scrapy crawler that fetches jokes from Qiushibaike. Continue exploring Scrapy’s features to handle pagination, pipelines, and data storage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data extraction Web Scraping Crawler

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.