Backend Development 13 min read

Integrating Playwright with Scrapy Using GerapyPlaywright: Installation, Configuration, and Usage

This article introduces the GerapyPlaywright package, explains how to install it, configure Scrapy to use Playwright via middleware and PlaywrightRequest, and provides a complete example spider with code snippets and logging output for JavaScript‑rendered page crawling.

Sohu Tech Products

Feb 9, 2022

Integrating Playwright with Scrapy Using GerapyPlaywright: Installation, Configuration, and Usage

In this technical note, the author presents GerapyPlaywright, a Python package that bridges Scrapy and Playwright, enabling Scrapy projects to render JavaScript‑heavy pages using Playwright.

The package, named GerapyPlaywright, is available on GitHub and PyPI .

Install it with: pip3 install gerapy-playwright Add the Playwright downloader middleware to settings.py:

DOWNLOADER_MIDDLEWARES = {
    'gerapy_playwright.downloadermiddlewares.PlaywrightMiddleware': 543,
}

Replace a normal scrapy.Request with PlaywrightRequest to let Playwright fetch the page and return the rendered HTML:

yield PlaywrightRequest(url, callback=self.parse_detail)

Several global configuration options can be set in settings.py, for example:

# headless mode
GERAPY_PLAYWRIGHT_HEADLESS = True
# request timeout (seconds)
GERAPY_PLAYWRIGHT_DOWNLOAD_TIMEOUT = 30
# hide WebDriver detection
GERAPY_PLAYWRIGHT_PRETEND = True
# proxy configuration
GERAPY_PLAYWRIGHT_PROXY = 'http://tps254.kdlapi.com:15818'
GERAPY_PLAYWRIGHT_PROXY_CREDENTIAL = {
    'username': 'xxx',
    'password': 'xxxx'
}
# automatic screenshot
GERAPY_PLAYWRIGHT_SCREENSHOT = {
    'type': 'png',
    'full_page': True
}

The PlaywrightRequest class also supports per‑request parameters that override the global settings, such as url, callback, wait_until, wait_for, script, actions, proxy, proxy_credential, sleep, timeout, and pretend.

Example spider:

class MovieSpider(scrapy.Spider):
    name = 'movie'
    allowed_domains = ['antispider1.scrape.center']
    base_url = 'https://antispider1.scrape.center'
    max_page = 10
    custom_settings = {
        'GERAPY_PLAYWRIGHT_PRETEND': True,
    }

    def start_requests(self):
        for page in range(1, self.max_page + 1):
            url = f'{self.base_url}/page/{page}'
            logger.debug('start url %s', url)
            yield PlaywrightRequest(url, callback=self.parse_index, priority=10, wait_for='.item')

    def parse_index(self, response):
        items = response.css('.item')
        for item in items:
            href = item.css('a::attr(href)').extract_first()
            detail_url = response.urljoin(href)
            logger.info('detail url %s', detail_url)
            yield PlaywrightRequest(detail_url, callback=self.parse_detail, wait_for='.item')

Running the spider produces logs similar to the following, showing the middleware activation, Playwright options, request processing, and screenshots taken:

2021-12-27 16:54:14 [gerapy.playwright] INFO: playwright libraries already installed
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled downloader middlewares: ... 'gerapy_playwright.downloadermiddlewares.PlaywrightMiddleware' ...
2021-12-27 16:54:14 [gerapy.playwright] DEBUG: processing request <GET https://antispider1.scrape.center/page/1>
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: waiting for .item
2021-12-27 16:54:19 [gerapy.playwright] DEBUG: taking screenshot using args {'type': 'png', 'full_page': True}
...

For more examples and test code, refer to the example directory . Users are encouraged to try the package, provide feedback, and star the repository.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python automation Web Scraping Playwright Scrapy GerapyPlaywright

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.