How to Master Web Scraping with Playwright: A Step‑by‑Step Python Guide
This article walks you through using Playwright in Python to scrape the Xinfadi market website, showing how to capture URLs, handle request payloads, extract JSON data, and process results with a complete, runnable code example.
1. Introduction
In this tutorial I share a practical Playwright‑based web crawler for Python. After a brief introduction I demonstrate the tool on the Xinfadi market website, showing that the approach is easy to pick up and can greatly improve scraping efficiency.
2. Implementation Process
We start with the Xinfadi homepage, which originally used a GET request and later switched to POST. By inspecting the network traffic we locate the data endpoint http://www.xinfadi.com.cn/getPriceData.html. The request URL remains constant while the payload changes, especially the current parameter that controls pagination.
By updating only the url and response.url in the Playwright script, we can iterate through pages and collect the JSON payload.
Below is the complete Playwright script. It launches a Chromium browser, intercepts requests and responses, filters the target URL, parses the JSON, and prints selected fields (id, product name, category, place).
from playwright.sync_api import Playwright, sync_playwright
import datetime
from pprint import pprint
import traceback
import logging
from tqdm import tqdm
import json
# pip install playwright && playwright install
"""
First write a normal login script, then add:
page.on("request", lambda request: handle(request=request, response=None))
page.on("response", lambda response: handle(response=response, request=None))
"""
logging.basicConfig(format='%(asctime)s | %(levelname)s : %(message)s', level=logging.INFO)
def handle_json(json):
# process our json data
for i in range(20):
data_list = json['list'][i]
id = data_list['id']
prodName = data_list['prodName']
prodCat = data_list['prodCat']
place = data_list['place']
print(id, prodName, prodCat, place)
def handle(request, response):
if response is not None:
# response url is the data endpoint
if response.url == 'http://www.xinfadi.com.cn/getPriceData.html':
handle_json(response.json())
def run(playwright: Playwright) -> None:
browser = playwright.chromium.launch(headless=False)
context = browser.new_context(ignore_https_errors=True)
page = context.new_page()
page.on("request", lambda request: handle(request=request, response=None))
page.on("response", lambda response: handle(response=response, request=None))
url = 'http://www.xinfadi.com.cn/index.html'
page.goto(url)
# scroll to load more data
page.mouse.wheel(0, 300)
page.wait_for_timeout(50000)
context.close()
page.close()
browser.close()
with sync_playwright() as playwright:
run(playwright)Running the script prints the extracted records, confirming that the data has been successfully retrieved.
3. Conclusion
The tutorial demonstrates a functional Playwright web‑scraping case, covering request inspection, payload handling, and data extraction. Readers can adapt the script by changing the target url and the JSON processing logic to suit other sites.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
