A Beginner's Guide to Using Scrapy for Web Crawling
This beginner‑friendly guide walks readers through installing Scrapy, creating a project and spider, running and debugging crawlers, implementing parsing with CSS/XPath, and overcoming common hurdles such as JavaScript rendering, user‑agent spoofing, and proxy rotation via configurable middlewares, enabling quick start of web‑crawling projects.
This article introduces the Scrapy framework, a popular Python library for building web crawlers, and provides a step‑by‑step guide for beginners.
1. Install Scrapy pip install scrapy 2. Create a Scrapy project scrapy startproject your_project_name The command creates a directory structure like:
your_project_name
| scrapy.cfg
|----your_project_name
__init_.py
items.py
middlewares.py
pipelines.py
settings.py
----spiders
__init__.py3. Generate a spider scrapy genspider example www.qq.com This creates example.py in the spiders folder. A minimal spider looks like:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['qq.com']
start_urls = ['http://qq.com/']
def parse(self, response):
pass4. Run the spider scrapy crawl example The framework uses the start_urls and the parse method to fetch and process pages.
5. Debugging
Use the interactive shell to inspect responses: scrapy shell www.qq.com 6. Implement the parsing logic
def parse(self, response):
passInside parse, you can extract data from response using CSS or XPath selectors. To follow additional links, use: yield scrapy.Request(url_str, callback=self.parse) Additional request metadata can be passed via response.meta.
7. Common issues and solutions
Dynamic pages : Scrapy cannot render JavaScript. Use Selenium with a headless Chrome browser:
from selenium import webdriver
from scrapy.http import HtmlResponse
class JavaScriptMiddleware:
def process_request(self, request, spider):
option = webdriver.ChromeOptions()
option.add_argument('--headless')
option.add_argument('--no-sandbox')
option.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=option, executable_path=chrome_driver_path_str)
driver.get(request.url)
js = 'var q=document.documentElement.scrollTop=10000'
driver.execute_script(js)
body = driver.page_source
return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)Enable the middleware in settings.py:
DOWNLOADER_MIDDLEWARES = {
'your_project_name.middlewares.JavaScriptMiddleware': 543,
}Header modification : Set a realistic User‑Agent to avoid basic anti‑scraping blocks.
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'IP pool : Use a proxy service to rotate IPs. Example middleware:
from w3lib.http import basic_auth_header
import requests
class ProxyDownloaderMiddleware:
username = 'your_username'
password = 'your_password'
api_url = 'https://dps.kdlapi.com/api/getdps/?orderid=your_orderid&num=1&pt=1&dedup=1&sep=1'
proxy_ip_list = []
list_max_len = 20
def update_ip(self):
if len(self.proxy_ip_list) != self.list_max_len:
ip_str = requests.get('https://dps.kdlapi.com/api/getdps/?orderid=your_orderid&num={}&pt=1&dedup=1&sep=3'.format(self.list_max_len)).text
self.proxy_ip_list = ip_str.split(' ')
while True:
try:
proxy_ip = self.proxy_ip_list.pop(0)
proxies = {
'http': 'http://{}:{}@{}'.format(self.username, self.password, proxy_ip),
'https': 'http://{}:{}@{}'.format(self.username, self.password, proxy_ip)
}
requests.get('http://www.baidu.com', proxies=proxies, timeout=3.05)
self.proxy_ip_list.append(proxy_ip)
return
except Exception as e:
self.proxy_ip_list.append(requests.get(self.api_url).text)
def process_request(self, request, spider):
self.update_ip()
request.meta['proxy'] = 'http://{}'.format(self.proxy_ip_list[-1])
request.headers['Proxy-Authorization'] = basic_auth_header(self.username, self.password)
return NoneActivate the proxy middleware:
DOWNLOADER_MIDDLEWARES = {
'your_project_name.middlewares.ProxyDownloaderMiddleware': 100,
}If both dynamic rendering and proxy rotation are needed, combine the settings:
DOWNLOADER_MIDDLEWARES = {
'your_project_name.middlewares.JavaScriptMiddleware': 543,
'your_project_name.middlewares.ProxyDownloaderMiddleware': 100,
}Note that lower numeric values are processed first, so requests pass through the proxy middleware before the JavaScript middleware.
Conclusion
The article provides a concise overview of Scrapy’s core workflow, common pitfalls, and practical solutions such as headless browsing, header spoofing, and IP rotation, helping readers quickly get started with web crawling projects.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
