Backend Development 10 min read

Scraping Frontend Job Listings from Boss Zhipin with Puppeteer and Storing in MySQL using NestJS

This tutorial demonstrates how to use Puppeteer to crawl frontend job postings from the Boss Zhipin website, extract details and job descriptions, and persist the data into a MySQL database through a NestJS application with TypeORM, including Docker setup and optional SSE streaming.

Rare Earth Juejin Tech Community

Jul 11, 2024

Scraping Frontend Job Listings from Boss Zhipin with Puppeteer and Storing in MySQL using NestJS

We start by creating a new Node.js project, installing puppeteer, and writing a test.js script that launches a visible browser, navigates to the Boss Zhipin search page for "frontend" positions, and waits for the job list to load.

The script then clicks the city selector, enters the keyword, and triggers the search. By directly using the URL with query and city parameters (e.g.,

https://www.zhipin.com/web/geek/job?query=前端&city=100010000

), we can skip the UI interactions.

To determine the total number of pages, we evaluate the second‑last pagination link ( .options-pages a:nth-last-child(2)) and parse its text content. We loop through each page, loading the list and extracting job information (name, area, salary, link, company) via DOM queries.

After gathering the list, we visit each job's detail page, wait for the description selector ( .job-sec-text), and capture the full job description, handling possible timeouts with a try/catch block.

Next, we set up a NestJS project ( nest new boss-jd-spider) and run MySQL in Docker, creating a database named boss-spider. The TypeOrmModule.forRoot configuration connects Nest to MySQL using the mysql2 driver.

We define a Job entity with columns for name, area, salary, link, company, and description, using appropriate lengths and a text type for the description field.

In AppService, we implement startSpider() that repeats the earlier Puppeteer logic (now with headless: false for debugging, later switched to true), iterates over all pages, collects job data, fetches each description, and saves each record via EntityManager.save(Job, job).

An HTTP GET endpoint @Get('start-spider') triggers the spider, returning a simple confirmation message.

We also note that Boss Zhipin may present a captcha if the request frequency is high; the tutorial suggests manually solving it.

Finally, the stored data can be queried directly, e.g., SELECT * FROM `boss-spider`.job WHERE `desc` LIKE "%React%";, and the article mentions that SSE could be used to stream results to a frontend in real time.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Puppeteer nodejs TypeORM

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.