Integrating Playwright with Scrapy Using GerapyPlaywright: Installation, Configuration, and Usage
This article introduces the GerapyPlaywright package, explains how to install it, configure Scrapy to use Playwright via middleware and PlaywrightRequest, and provides a complete example spider with code snippets and logging output for JavaScript‑rendered page crawling.
In this technical note, the author presents GerapyPlaywright, a Python package that bridges Scrapy and Playwright, enabling Scrapy projects to render JavaScript‑heavy pages using Playwright.
The package, named GerapyPlaywright , is available on GitHub and PyPI .
Install it with:
pip3 install gerapy-playwrightAdd the Playwright downloader middleware to settings.py :
DOWNLOADER_MIDDLEWARES = {
'gerapy_playwright.downloadermiddlewares.PlaywrightMiddleware': 543,
}Replace a normal scrapy.Request with PlaywrightRequest to let Playwright fetch the page and return the rendered HTML:
yield PlaywrightRequest(url, callback=self.parse_detail)Several global configuration options can be set in settings.py , for example:
# headless mode
GERAPY_PLAYWRIGHT_HEADLESS = True
# request timeout (seconds)
GERAPY_PLAYWRIGHT_DOWNLOAD_TIMEOUT = 30
# hide WebDriver detection
GERAPY_PLAYWRIGHT_PRETEND = True
# proxy configuration
GERAPY_PLAYWRIGHT_PROXY = 'http://tps254.kdlapi.com:15818'
GERAPY_PLAYWRIGHT_PROXY_CREDENTIAL = {
'username': 'xxx',
'password': 'xxxx'
}
# automatic screenshot
GERAPY_PLAYWRIGHT_SCREENSHOT = {
'type': 'png',
'full_page': True
}The PlaywrightRequest class also supports per‑request parameters that override the global settings, such as url , callback , wait_until , wait_for , script , actions , proxy , proxy_credential , sleep , timeout , and pretend .
Example spider:
class MovieSpider(scrapy.Spider):
name = 'movie'
allowed_domains = ['antispider1.scrape.center']
base_url = 'https://antispider1.scrape.center'
max_page = 10
custom_settings = {
'GERAPY_PLAYWRIGHT_PRETEND': True,
}
def start_requests(self):
for page in range(1, self.max_page + 1):
url = f'{self.base_url}/page/{page}'
logger.debug('start url %s', url)
yield PlaywrightRequest(url, callback=self.parse_index, priority=10, wait_for='.item')
def parse_index(self, response):
items = response.css('.item')
for item in items:
href = item.css('a::attr(href)').extract_first()
detail_url = response.urljoin(href)
logger.info('detail url %s', detail_url)
yield PlaywrightRequest(detail_url, callback=self.parse_detail, wait_for='.item')Running the spider produces logs similar to the following, showing the middleware activation, Playwright options, request processing, and screenshots taken:
2021-12-27 16:54:14 [gerapy.playwright] INFO: playwright libraries already installed
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled downloader middlewares: ... 'gerapy_playwright.downloadermiddlewares.PlaywrightMiddleware' ...
2021-12-27 16:54:14 [gerapy.playwright] DEBUG: processing request
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: waiting for .item
2021-12-27 16:54:19 [gerapy.playwright] DEBUG: taking screenshot using args {'type': 'png', 'full_page': True}
...For more examples and test code, refer to the example directory . Users are encouraged to try the package, provide feedback, and star the repository.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.