Big Data 13 min read

A Beginner's Guide to Using Scrapy for Web Crawling

This beginner‑friendly guide walks readers through installing Scrapy, creating a project and spider, running and debugging crawlers, implementing parsing with CSS/XPath, and overcoming common hurdles such as JavaScript rendering, user‑agent spoofing, and proxy rotation via configurable middlewares, enabling quick start of web‑crawling projects.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
A Beginner's Guide to Using Scrapy for Web Crawling

This article introduces the Scrapy framework, a popular Python library for building web crawlers, and provides a step‑by‑step guide for beginners.

1. Install Scrapy pip install scrapy 2. Create a Scrapy project scrapy startproject your_project_name The command creates a directory structure like:

your_project_name
|    scrapy.cfg
|----your_project_name
    __init_.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
----spiders
    __init__.py

3. Generate a spider scrapy genspider example www.qq.com This creates example.py in the spiders folder. A minimal spider looks like:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['qq.com']
    start_urls = ['http://qq.com/']

    def parse(self, response):
        pass

4. Run the spider scrapy crawl example The framework uses the start_urls and the parse method to fetch and process pages.

5. Debugging

Use the interactive shell to inspect responses: scrapy shell www.qq.com 6. Implement the parsing logic

def parse(self, response):
    pass

Inside parse, you can extract data from response using CSS or XPath selectors. To follow additional links, use: yield scrapy.Request(url_str, callback=self.parse) Additional request metadata can be passed via response.meta.

7. Common issues and solutions

Dynamic pages : Scrapy cannot render JavaScript. Use Selenium with a headless Chrome browser:

from selenium import webdriver
from scrapy.http import HtmlResponse

class JavaScriptMiddleware:
    def process_request(self, request, spider):
        option = webdriver.ChromeOptions()
        option.add_argument('--headless')
        option.add_argument('--no-sandbox')
        option.add_argument('--disable-gpu')
        driver = webdriver.Chrome(options=option, executable_path=chrome_driver_path_str)
        driver.get(request.url)
        js = 'var q=document.documentElement.scrollTop=10000'
        driver.execute_script(js)
        body = driver.page_source
        return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)

Enable the middleware in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'your_project_name.middlewares.JavaScriptMiddleware': 543,
}

Header modification : Set a realistic User‑Agent to avoid basic anti‑scraping blocks.

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'

IP pool : Use a proxy service to rotate IPs. Example middleware:

from w3lib.http import basic_auth_header
import requests

class ProxyDownloaderMiddleware:
    username = 'your_username'
    password = 'your_password'
    api_url = 'https://dps.kdlapi.com/api/getdps/?orderid=your_orderid&num=1&pt=1&dedup=1&sep=1'
    proxy_ip_list = []
    list_max_len = 20

    def update_ip(self):
        if len(self.proxy_ip_list) != self.list_max_len:
            ip_str = requests.get('https://dps.kdlapi.com/api/getdps/?orderid=your_orderid&num={}&pt=1&dedup=1&sep=3'.format(self.list_max_len)).text
            self.proxy_ip_list = ip_str.split(' ')
        while True:
            try:
                proxy_ip = self.proxy_ip_list.pop(0)
                proxies = {
                    'http': 'http://{}:{}@{}'.format(self.username, self.password, proxy_ip),
                    'https': 'http://{}:{}@{}'.format(self.username, self.password, proxy_ip)
                }
                requests.get('http://www.baidu.com', proxies=proxies, timeout=3.05)
                self.proxy_ip_list.append(proxy_ip)
                return
            except Exception as e:
                self.proxy_ip_list.append(requests.get(self.api_url).text)

    def process_request(self, request, spider):
        self.update_ip()
        request.meta['proxy'] = 'http://{}'.format(self.proxy_ip_list[-1])
        request.headers['Proxy-Authorization'] = basic_auth_header(self.username, self.password)
        return None

Activate the proxy middleware:

DOWNLOADER_MIDDLEWARES = {
    'your_project_name.middlewares.ProxyDownloaderMiddleware': 100,
}

If both dynamic rendering and proxy rotation are needed, combine the settings:

DOWNLOADER_MIDDLEWARES = {
    'your_project_name.middlewares.JavaScriptMiddleware': 543,
    'your_project_name.middlewares.ProxyDownloaderMiddleware': 100,
}

Note that lower numeric values are processed first, so requests pass through the proxy middleware before the JavaScript middleware.

Conclusion

The article provides a concise overview of Scrapy’s core workflow, common pitfalls, and practical solutions such as headless browsing, header spoofing, and IP rotation, helping readers quickly get started with web crawling projects.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ProxyPythonmiddlewareData ExtractionScrapyWeb Crawling
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.