How to Build a Full‑Stack Web Scraper with Puppeteer, Node & MySQL
This tutorial walks through creating a headless‑browser crawler using Puppeteer, extracting data from Juejin, storing it in a MySQL database, and exposing the results through an Express web service with user authentication, subscription, and a simple front‑end UI.
Background
The author needs to automate daily report submission, which normally requires opening the weekly report system, logging in, copying and pasting data. To avoid repetitive manual steps, a front‑end crawler based on Puppeteer is explored using Juejin as a test target.
Goal
Knowledge Points
Front‑end crawling basics: traditional back‑end crawlers rely on APIs, but pages rendered by the browser need a headless browser to fetch data.
Efficiency tools: simulating human actions to improve productivity.
Practical Output
Login authentication
Data persistence: scraped data is saved into a MySQL database via Navicat remote connection.
Subscription updates: when a subscribed tag receives new articles, an email notification is sent.
Project Pre‑setup
Dependency Lock
{
"name": "crawl",
"version": "1.0.0",
"description": "",
"main": "index.js",
"scripts": { "test": "echo \"Error: no test specified\" && exit 1" },
"keywords": [],
"author": "",
"license": "ISC",
"dependencies": {
"axios": "^0.27.2",
"bluebird": "^3.7.2",
"body-parser": "^1.20.0",
"chalk": "^5.0.1",
"cheerio": "^1.0.0-rc.11",
"child_process": "^1.0.2",
"cron": "^2.0.0",
"ejs": "^3.1.8",
"express": "^4.18.1",
"express-session": "^1.17.3",
"iconv-lite": "^0.6.3",
"mysql": "^2.18.1",
"nodemailer": "^6.7.5",
"puppeteer": "^14.1.1",
"request": "^2.88.2",
"request-promise": "^4.2.6",
"urijs": "^1.19.11"
}
}Directory Structure
.
├── 1. puppertee
│ ├── 1.js
│ ├── 2.js
│ ├── 3.js
│ ├── 4.js
│ ├── 5. 爬取京东.js
│ ├── baidu.png
│ └── items-0.png
├── 2. request
│ ├── 1.request-json.js
│ ├── 2.request-form.js
│ ├── 3.request-file.js
│ └── avatar.jpeg
├── 3. cheerio
│ ├── 1.cheerio.js
│ ├── 2.cheerio-selector.js
│ ├── 3.cheerio-attr.js
│ ├── 4.cheerio-props.js
│ └── 5.cheerio-find.js
├── 4. dependens
│ ├── 1. cron.js
│ ├── 2. error.js
│ ├── 3.debug.js
│ ├── 4. pm2.js
│ ├── 5. iconv-lite.js
│ ├── 6.mail.js
│ ├── 7.read.js
│ └── my-debug.js
├── bdyp.js
├── crawl-server
│ ├── app.js
│ ├── bin
│ │ └── www
│ ├── package-lock.json
│ ├── package.json
│ ├── public
│ │ ├── images
│ │ ├── javascripts
│ │ └── stylesheets
│ ├── routes
│ │ ├── index.js
│ │ └── users.js
│ └── uploads
│ ├── 50d33a30f74fd55ffc0f3c0aaea989b6
│ └── f24715d08bab6243f62bbe9f16a52d05
├── crawl.sql
├── db.js
├── mail.js
├── main.js
├── package-lock.json
├── package.json
├── read
│ ├── article-detail.js
│ ├── articles.js
│ ├── index.js
│ ├── tags.js
│ └── text.html
├── readme.md
├── utils
│ ├── domain-util.js
│ └── puppeteer-utils.js
├── web
│ ├── middleware
│ │ └── auth.js
│ ├── public
│ │ └── css
│ ├── router
│ │ └── bdyp.js
│ ├── server.js
│ ├── update
│ └── views
│ ├── detail.html
│ ├── footer.html
│ ├── header.html
│ ├── index.html
│ ├── login.html
│ └── subscribe.html
└── write
├── articles.js
├── index.js
└── tags.jsStep 1: Crawl Basics
Expected Output
Traditional crawler workflow: use the request package to fetch Juejin front‑end tag page HTML and extract article titles with a regular expression.
Front‑end crawler problem: pages rendered by JavaScript cannot be scraped with simple HTTP requests.
Front‑end crawler solution: use Puppeteer to control a headless Chrome instance.
Traditional Crawler Using Request
Goal
Fetch all article titles from Juejin’s front‑end tag page and save them to titles.txt.
Implementation
let request = require("request");
let url = "https://juejin.cn/tag/%E5%89%8D%E7%AB%AF";
let fs = require("fs");
let regexp = /class="title" data-v-\w+>(.+?)<\/a>/g;
request(url, (err, response, body) => {
let titles = [];
body.replace(regexp, (matched, title) => {
titles.push(title);
});
console.log(titles);
fs.writeFileSync("titles.txt", titles);
});Result
Running node 1.js prints the list of titles and creates titles.txt with the extracted titles.
Front‑end Crawler Using Puppeteer – Screenshot Baidu
Goal
Open Baidu homepage with Puppeteer and capture a screenshot saved as baidu.png.
Implementation
let puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
let page = await browser.newPage();
await page.goto("http://baidu.com");
await page.screenshot({ path: "baidu.png" });
await browser.close();
})();Front‑end Crawler Using Puppeteer – Crawl JD Mobile List
Goal
Search for “手机” on JD.com and retrieve the list of product links.
Implementation
const puppeteer = require("puppeteer");
(async function () {
const browser = await puppeteer.launch({ headless: false });
let page = await browser.newPage();
await page.setJavaScriptEnabled(true);
await page.goto("https://www.jd.com/");
const searchInput = await page.$("#key");
await searchInput.focus();
await page.keyboard.type("手机");
const searchBtn = await page.$(".button");
await searchBtn.click();
await page.waitForSelector(".gl-item");
const links = await page.$$eval(
".gl-item > .gl-i-wrap > .p-img > a",
(links) => {
return links.map((a) => ({ href: a.href.trim(), title: a.title }));
}
);
console.log(links);
})();Step 2: Crawl Juejin Data
Implement a Node service that fetches Juejin tags, articles, and article details, then stores them in MySQL.
Database Service
Use MySQL with Navicat for remote connection. Create a database named crawl-db and import the provided .sql schema.
Node Service (Express)
Provide /tag endpoint to fetch all Juejin tags and write them to the tags table.
Provide /article endpoint to fetch articles for a given tag, deduplicate, and write to articles, article_detail, and article_tag tables.
Server Entry (web/server.js)
let express = require("express");
let bodyParser = require("body-parser");
let session = require("express-session");
let { checkLogin } = require("./middleware/auth");
const path = require("path");
const { query } = require("../db");
const CronJob = require("cron").CronJob;
const debug = require("debug")("crawl:server");
const { spawn } = require("child_process");
let app = express();
app.use(express.static("web/public"));
app.use(bodyParser.urlencoded({ extended: true }));
app.use(bodyParser.json());
app.use(session({ resave: true, saveUninitialized: true, secret: "wzyan" }));
app.use((req, res, next) => { res.locals.user = req.session.user; next(); });
app.set("view engine", "html");
app.set("views", path.resolve("web/views"));
app.engine("html", require("ejs").__express);
app.get("/", async function (req, res) {
let { tagId } = req.query;
let tags = await query(`SELECT * FROM tags`);
tagId = tagId || tags[0].id;
let articles = await query(
`SELECT a.* FROM articles a INNER JOIN article_tag t ON a.id = t.article_id WHERE t.tag_id = ?`,
[tagId]
);
res.render("index", { tags, articles });
});
app.get("/login", async function (req, res) {
res.render("login", { title: "登录" });
});
app.post("/login", async function (req, res) {
let { email, password } = req.body;
let oldUsers = await query(`SELECT * FROM users WHERE email=?`, [email]);
let user;
if (Array.isArray(oldUsers) && oldUsers.length > 0) {
user = oldUsers[0];
} else {
let result = await query(`INSERT INTO users(email,password) VALUES(?,?)`, [email, password]);
user = { id: result.insertId, email, password };
}
req.session.user = user;
res.redirect("/");
});
app.get("/subscribe", checkLogin, async function (req, res) {
let tags = await query(`SELECT * FROM tags`);
let user = req.session.user;
let selectedTags = await query(`SELECT tag_id FROM user_tag WHERE user_id = ?`, [user.id]);
let selectTagIds = selectedTags.map((item) => item["tag_id"]);
tags.forEach((item) => { item.subscribe = selectTagIds.includes(item.id); });
res.render("subscribe", { title: "请订阅你感兴趣的标签", tags });
});
app.post("/subscribe", checkLogin, async function (req, res) {
let { tags } = req.body;
if (!tags) tags = [];
if (typeof tags === "string") tags = [tags];
const getNum = (str) => str.replace(/[^0-9]/gi, "");
tags = tags.map(getNum);
let user = req.session.user;
await query(`DELETE FROM user_tag WHERE user_id=?`, [user.id]);
for (let i = 0; i < tags.length; i++) {
await query(`INSERT INTO user_tag(user_id,tag_id) VALUES(?,?)`, [user.id, parseInt(tags[i])]);
}
res.redirect("/");
});
app.get("/detail/:id", async function (req, res) {
let id = req.params.id;
let articles = await query(`SELECT * FROM articles WHERE id=?`, [id]);
res.render("detail", { article: articles[0] });
});
app.listen(8082);
process.on("uncaughtException", function (err) {
console.error("uncaughtException: %s", err.stack);
});Utility Modules
domain-util.js extracts hostname from a URL.
const URI = require("urijs");
function getHostName(uri) {
uri = new URI(uri);
return uri.hostname();
}
module.exports = { getHostName };puppeteer-utils.js provides functions to launch a browser, open pages, scroll to the bottom, retrieve HTML with Cheerio, add cookies, and perform login actions.
let puppeteer = require("puppeteer");
const cheerio = require("cheerio");
const domainUtil = require("./domain-util");
let browser;
async function getHTML(uri, isAutoScrollToBottom = true) {
let page = await openPage(uri);
if (isAutoScrollToBottom) await autoScrollToBottom(page);
let html = await page.content();
const $ = cheerio.load(html);
return $;
}
async function autoScrollToBottom(page) {
await page.evaluate(async () => {
await new Promise((resolve) => {
let totalHeight = 0;
function exec() {
totalHeight = document.body.scrollHeight;
window.scrollBy(0, totalHeight);
setTimeout(() => {
if (document.body.scrollHeight > totalHeight) exec();
else resolve();
}, 3000);
}
exec();
});
});
}
async function getPage(opts = { headless: false, devtools: true }) {
if (!browser) browser = await puppeteer.launch(opts);
let page = await browser.newPage();
return page;
}
async function openPage(uri, opts = { headless: false, devtools: true }) {
let page = await getPage(opts);
await page.goto(uri, { waitUntil: "networkidle2" });
return page;
}
async function login(url, cookies) {
let page = await getPage({ ignoreHTTPSErrors: true, headless: false, args: ["--no-sandbox", "--disable-setuid-sandbox"] });
await addCookies(page, cookies, domainUtil.getHostName(url));
await page.setViewport({ width: 1920, height: 1080 });
await page.goto(url, { timeout: 600000, waitUntil: "networkidle2" });
return page;
}
async function addCookies(page, cookies, domain) {
if (typeof cookies === "string") {
cookies = cookies.split(";").map((pair) => {
let name = pair.trim().slice(0, pair.trim().indexOf("="));
let value = pair.trim().slice(pair.trim().indexOf("=") + 1);
return { name, value, domain };
});
}
await Promise.all(cookies.map((c) => page.setCookie(c)));
}
module.exports = { getHTML, openPage, login, addCookies };Data Readers
read/tags.js extracts tag information from Juejin.
const debug = require("debug")("juejin:task:read");
const puppeteerUtils = require("../puppeteer-utils");
function get(owner, props) { return owner ? owner[props] : ""; }
exports.tags = async function (uri) {
debug("读取文章标签列表");
let $ = await puppeteerUtils.getHTML(uri);
let tags = [];
$("li.item").each((i, item) => {
let tag = $(item);
let image = tag.find("img.thumb").first();
let title = tag.find(".title").first();
let subscribe = tag.find(".subscribe").first();
let article = tag.find(".article").first();
let name = title.text().trim();
tags.push({
image: image.data("src")?.trim() || image.data("src"),
name,
url: `https://juejin.im/tag/${encodeURIComponent(title.text().trim())}`,
subscribe: get(Number(subscribe.text().match(/(\d+)/), [1])),
article: get(Number(article.text().match(/(\d+)/), [1]))
});
debug(`读取文章标签:${name}`);
});
return tags.filter((item) => item.name);
};read/articles.js fetches article list and details.
const debug = require("debug")("juejin:task:read");
const puppeteerUtils = require("../puppeteer-utils");
const { readArticle } = require("./article-detail");
function removeEmoji(content) {
return (content || "").replace(/(?:[\u2700-\u27bf]|(?:\ud83c[\udde6-\uddff]){2}|[\ud800-\udbff][\udc00-\udfff]|[\u0023-\u0039]\ufe0f?\u20e3|\u3299|\u3297|\u303d|\u3030|\u24c2|\ud83c[\udd70-\udd71]|\ud83c[\udd7e-\udd7f]|\ud83c\udd8e|\ud83c[\udd91-\udd9a]|\ud83c[\ude01-\ude02]|\ud83c\ude1a|\ud83c\ude2f|\ud83c[\ude32-\ude3a]|\ud83c[\ude50-\ude51]|\u203c|\u2049|[\u25aa-\u25ab]|\u25b6|\u25c0|[\u25fb-\u25fe]|\u00a9|\u00ae|\u2122|\u2139|\ud83c\udc04|[\u2600-\u26FF]|\u2b05|\u2b06|\u2b07|\u2b1b|\u2b1c|\u2b50|\u2b55|\u231a|\u231b|\u2328|\u23cf|[\u23e9-\u23f3]|[\u23f8-\u23fa]|\ud83c\udccf|\u2934|\u2935|[\u2190-\u21ff])/g, "");
}
exports.articles = async function (uri, maxNum = 0) {
let $ = await puppeteerUtils.getHTML(uri, false);
let articleList = [];
let items = $(".item.title");
let articleNum = maxNum || items.length;
for (let i = 0; i < articleNum; i++) {
let article = $(items[i]);
let href = article.attr("href").trim();
let title = article.text().trim();
let id = href.match(/(\w+)$/)[1];
href = "https://juejin.im" + href;
let articleDetail = await readArticle(id, href);
articleList.push({
href,
title: removeEmoji(title),
id,
content: removeEmoji(articleDetail.content),
tags: articleDetail.tags
});
debug(`读取文章列表:${title}`);
}
return articleList;
};read/article-detail.js extracts the article page content and tags.
const debug = require("debug")("juejin:task:read-detail");
const puppeteerUtils = require("../puppeteer-utils");
async function readArticle(id, uri) {
debug("读取博文");
let $ = await puppeteerUtils.getHTML(uri, false);
let article = $(".main-container");
let title = article.find("h1").text().trim();
let content = article.find(".article-content").html();
let metas = article.find("meta");
let tags = "";
for (let i = 0; i < metas.length; i++) {
const meta = metas[i];
if (meta.attribs && meta.attribs.itemprop === "keywords") {
tags = meta.attribs.content;
}
}
tags = tags.split(",");
debug(`读取文章详情:${title}`);
return { id, title, content, tags };
}
module.exports = { readArticle };Data Writers
write/tags.js inserts or updates tag records.
const { query } = require("../../../db");
const debug = require("debug")("juejin:task:write");
exports.tags = async function (tagList) {
debug("保存文章标签列表");
for (let tag of tagList) {
let oldTags = await query(`SELECT * FROM tags WHERE name=? LIMIT 1`, [tag.name]);
if (Array.isArray(oldTags) && oldTags.length > 0) {
let oldTag = oldTags[0];
await query(`UPDATE tags SET name=?,image=?,url=? WHERE id=?`, [tag.name, tag.image, tag.url, oldTag.id]);
} else {
await query(`INSERT INTO tags(name,image,url) VALUES(?,?,?)`, [tag.name, tag.image, tag.url]);
}
}
};write/articles.js inserts or updates articles and sends email notifications to subscribed users.
const { query } = require("../../../db");
const debug = require("debug")("juejin:task:write");
const sendMail = require("../../../mail");
exports.articles = async function (articleList) {
debug("写入博文列表");
for (let article of articleList) {
let oldArticles = await query(`SELECT * FROM articles WHERE id=? LIMIT 1`, article.id);
if (Array.isArray(oldArticles) && oldArticles.length > 0) {
let oldArticle = oldArticles[0];
await query(`UPDATE articles SET title=?,content=?,href=? WHERE id=?`, [article.title, article.content, article.href, oldArticle.id]);
} else {
await query(`INSERT INTO articles(id,title,href,content) VALUES(?,?,?,?)`, [article.id, article.title, article.href, article.content]);
}
await query(`DELETE FROM article_tag WHERE article_id=?`, [article.id]);
const where = "('" + article.tags.join("','") + "')";
const sql = `SELECT id FROM tags WHERE name IN ${where}`;
let tagIds = await query(sql);
for (let row of tagIds) {
await query(`INSERT INTO article_tag(article_id,tag_id) VALUES(?,?)`, [article.id, row.id]);
}
let tagIDsString = tagIds.map((item) => item.id).join(",");
let emailSQL = `SELECT DISTINCT users.email FROM user_tag INNER JOIN users ON user_tag.user_id = users.id WHERE tag_id IN (${tagIDsString})`;
let emails = await query(emailSQL);
for (let emailInfo of emails) {
sendMail(
emailInfo.email,
`您订阅的文章更新了`,
`<a href="http://localhost:8080/detail/${article.id}">${article.title}</a>`
);
}
}
};Step 3: Display Data in a Web Application
Build the front‑end UI with EJS templates and Bootstrap. The views include index.html (home page with tag list and article list), login.html, header.html (navigation bar), detail.html (article detail), and subscribe.html (tag subscription).
Key Templates
<%- include('header.html') %>
<div class="container">
<div class="row">
<div class="col-md-2">
<ul class="list-group">
<% tags.forEach(tag=>{ %>
<li class="list-group-item text-center">
<a href="/?tagId=<%=tag.id%>">
<img style="width:25px;height:25px" src="<%=tag.image%>" />
<%=tag.name%>
</a>
</li>
<% }) %>
</ul>
</div>
<div class="col-md-10">
<ul class="list-group">
<% articles.forEach(article=>{ %>
<li class="list-group-item">
<a href="/detail/<%=article.id%>"> <%=article.title%> </a>
</li>
<% }) %>
</ul>
</div>
</div>
</div> <%- include('header.html') %>
<div class="row">
<div class="col-md-4 col-md-offset-4">
<form method="POST">
<input type="email" name="email" class="form-control" placeholder="请输入邮箱进行登录" />
<input type="password" name="password" class="form-control" placeholder="密码" />
<button type="submit" class="btn btn-default">提交</button>
</form>
</div>
</div> <head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/3.4.1/css/bootstrap.min.css" crossorigin="anonymous">
<title>博客列表</title>
</head>
<body>
<nav class="navbar navbar-default">
... navigation with conditional login/subscribe links ...
</nav>
</body> <%- include('header.html') %>
<div class="container">
<div class="row">
<div class="col-md-12">
<div class="panel">
<div class="panel-heading">
<h1 class="text-center"><%- article.title %></h1>
</div>
<div class="panel-body"><%- article.content %></div>
</div>
</div>
</div>
</div> <%- include('header.html') %>
<style>
.tag { display:flex; flex-direction:column; justify-content:center; align-items:center; }
.tag img { width:50px; margin-bottom:20px; }
</style>
<div class="row">
<form method="POST">
<input type="submit" class="btn btn-primary" />
<% for(let i=0;i<tags.length;i++) { let tag=tags[i]; %>
<div class="col-md-3 tag">
<img src="<%=tag.image%>" />
<p><%=tag.name%></p>
<p><%=tag.subscribe% 关注 <%=tag.article% 文章</p>
<div class="checkbox">
<label>
<input <%=tag.subscribe? "checked" : "" %> type="checkbox" name="tags" value="<%=tag.id%>"> 关注
</label>
</div>
</div>
<% } %>
</form>
</div>References
W3CSchool – Puppeteer Manual
Puppeteer Chinese Documentation
Qikegu Tutorial
Aliyun Community – Puppeteer API v1.11
F2E Chinese Docs – ElementHandle
Missyou Blog – Puppeteer page.setContent
Jianshu – Node + Puppeteer crawling experience
Zhihu – Discussing Puppeteer in projects
IT2048 – Puppeteer performance optimization
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
WeDoctor Frontend Technology
Official WeDoctor Group frontend public account, sharing original tech articles, events, job postings, and occasional daily updates from our tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
