Scraping Frontend Job Listings from Boss Zhipin with Puppeteer and Storing in MySQL using NestJS
This tutorial demonstrates how to use Puppeteer to crawl frontend job postings from the Boss Zhipin website, extract details and job descriptions, and persist the data into a MySQL database through a NestJS application with TypeORM, including Docker setup and optional SSE streaming.
We start by creating a new Node.js project, installing puppeteer , and writing a test.js script that launches a visible browser, navigates to the Boss Zhipin search page for "frontend" positions, and waits for the job list to load.
The script then clicks the city selector, enters the keyword, and triggers the search. By directly using the URL with query and city parameters (e.g., https://www.zhipin.com/web/geek/job?query=前端&city=100010000 ), we can skip the UI interactions.
To determine the total number of pages, we evaluate the second‑last pagination link ( .options-pages a:nth-last-child(2) ) and parse its text content. We loop through each page, loading the list and extracting job information (name, area, salary, link, company) via DOM queries.
After gathering the list, we visit each job's detail page, wait for the description selector ( .job-sec-text ), and capture the full job description, handling possible timeouts with a try/catch block.
Next, we set up a NestJS project ( nest new boss-jd-spider ) and run MySQL in Docker, creating a database named boss-spider . The TypeOrmModule.forRoot configuration connects Nest to MySQL using the mysql2 driver.
We define a Job entity with columns for name, area, salary, link, company, and description, using appropriate lengths and a text type for the description field.
In AppService , we implement startSpider() that repeats the earlier Puppeteer logic (now with headless: false for debugging, later switched to true ), iterates over all pages, collects job data, fetches each description, and saves each record via EntityManager.save(Job, job) .
An HTTP GET endpoint @Get('start-spider') triggers the spider, returning a simple confirmation message.
We also note that Boss Zhipin may present a captcha if the request frequency is high; the tutorial suggests manually solving it.
Finally, the stored data can be queried directly, e.g., SELECT * FROM `boss-spider`.job WHERE `desc` LIKE "%React%"; , and the article mentions that SSE could be used to stream results to a frontend in real time.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.