Backend Development 13 min read

Integrating Playwright with Scrapy Using GerapyPlaywright: Installation, Configuration, and Usage

This article introduces the GerapyPlaywright package, explains how to install it, configure Scrapy to use Playwright via middleware and PlaywrightRequest, and provides a complete example spider with code snippets and logging output for JavaScript‑rendered page crawling.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Integrating Playwright with Scrapy Using GerapyPlaywright: Installation, Configuration, and Usage

In this technical note, the author presents GerapyPlaywright, a Python package that bridges Scrapy and Playwright, enabling Scrapy projects to render JavaScript‑heavy pages using Playwright.

The package, named GerapyPlaywright , is available on GitHub and PyPI .

Install it with:

pip3 install gerapy-playwright

Add the Playwright downloader middleware to settings.py :

DOWNLOADER_MIDDLEWARES = {
    'gerapy_playwright.downloadermiddlewares.PlaywrightMiddleware': 543,
}

Replace a normal scrapy.Request with PlaywrightRequest to let Playwright fetch the page and return the rendered HTML:

yield PlaywrightRequest(url, callback=self.parse_detail)

Several global configuration options can be set in settings.py , for example:

# headless mode
GERAPY_PLAYWRIGHT_HEADLESS = True
# request timeout (seconds)
GERAPY_PLAYWRIGHT_DOWNLOAD_TIMEOUT = 30
# hide WebDriver detection
GERAPY_PLAYWRIGHT_PRETEND = True
# proxy configuration
GERAPY_PLAYWRIGHT_PROXY = 'http://tps254.kdlapi.com:15818'
GERAPY_PLAYWRIGHT_PROXY_CREDENTIAL = {
    'username': 'xxx',
    'password': 'xxxx'
}
# automatic screenshot
GERAPY_PLAYWRIGHT_SCREENSHOT = {
    'type': 'png',
    'full_page': True
}

The PlaywrightRequest class also supports per‑request parameters that override the global settings, such as url , callback , wait_until , wait_for , script , actions , proxy , proxy_credential , sleep , timeout , and pretend .

Example spider:

class MovieSpider(scrapy.Spider):
    name = 'movie'
    allowed_domains = ['antispider1.scrape.center']
    base_url = 'https://antispider1.scrape.center'
    max_page = 10
    custom_settings = {
        'GERAPY_PLAYWRIGHT_PRETEND': True,
    }

    def start_requests(self):
        for page in range(1, self.max_page + 1):
            url = f'{self.base_url}/page/{page}'
            logger.debug('start url %s', url)
            yield PlaywrightRequest(url, callback=self.parse_index, priority=10, wait_for='.item')

    def parse_index(self, response):
        items = response.css('.item')
        for item in items:
            href = item.css('a::attr(href)').extract_first()
            detail_url = response.urljoin(href)
            logger.info('detail url %s', detail_url)
            yield PlaywrightRequest(detail_url, callback=self.parse_detail, wait_for='.item')

Running the spider produces logs similar to the following, showing the middleware activation, Playwright options, request processing, and screenshots taken:

2021-12-27 16:54:14 [gerapy.playwright] INFO: playwright libraries already installed
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled downloader middlewares: ... 'gerapy_playwright.downloadermiddlewares.PlaywrightMiddleware' ...
2021-12-27 16:54:14 [gerapy.playwright] DEBUG: processing request
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: waiting for .item
2021-12-27 16:54:19 [gerapy.playwright] DEBUG: taking screenshot using args {'type': 'png', 'full_page': True}
...

For more examples and test code, refer to the example directory . Users are encouraged to try the package, provide feedback, and star the repository.

PythonautomationWeb ScrapingPlaywrightScrapyGerapyPlaywright
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.