Backend Development 38 min read

How to Build a Full‑Stack Web Scraper with Puppeteer, Node & MySQL

This tutorial walks through creating a headless‑browser crawler using Puppeteer, extracting data from Juejin, storing it in a MySQL database, and exposing the results through an Express web service with user authentication, subscription, and a simple front‑end UI.

WeDoctor Frontend Technology

Jun 9, 2022

How to Build a Full‑Stack Web Scraper with Puppeteer, Node & MySQL

Background

The author needs to automate daily report submission, which normally requires opening the weekly report system, logging in, copying and pasting data. To avoid repetitive manual steps, a front‑end crawler based on Puppeteer is explored using Juejin as a test target.

Goal

Knowledge Points

Front‑end crawling basics: traditional back‑end crawlers rely on APIs, but pages rendered by the browser need a headless browser to fetch data.

Efficiency tools: simulating human actions to improve productivity.

Practical Output

Data persistence: scraped data is saved into a MySQL database via Navicat remote connection.

Subscription updates: when a subscribed tag receives new articles, an email notification is sent.

Project Pre‑setup

Dependency Lock

{
  "name": "crawl",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": { "test": "echo \"Error: no test specified\" && exit 1" },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "dependencies": {
    "axios": "^0.27.2",
    "bluebird": "^3.7.2",
    "body-parser": "^1.20.0",
    "chalk": "^5.0.1",
    "cheerio": "^1.0.0-rc.11",
    "child_process": "^1.0.2",
    "cron": "^2.0.0",
    "ejs": "^3.1.8",
    "express": "^4.18.1",
    "express-session": "^1.17.3",
    "iconv-lite": "^0.6.3",
    "mysql": "^2.18.1",
    "nodemailer": "^6.7.5",
    "puppeteer": "^14.1.1",
    "request": "^2.88.2",
    "request-promise": "^4.2.6",
    "urijs": "^1.19.11"
  }
}

Directory Structure

.
├── 1. puppertee
│   ├── 1.js
│   ├── 2.js
│   ├── 3.js
│   ├── 4.js
│   ├── 5. 爬取京东.js
│   ├── baidu.png
│   └── items-0.png
├── 2. request
│   ├── 1.request-json.js
│   ├── 2.request-form.js
│   ├── 3.request-file.js
│   └── avatar.jpeg
├── 3. cheerio
│   ├── 1.cheerio.js
│   ├── 2.cheerio-selector.js
│   ├── 3.cheerio-attr.js
│   ├── 4.cheerio-props.js
│   └── 5.cheerio-find.js
├── 4. dependens
│   ├── 1. cron.js
│   ├── 2. error.js
│   ├── 3.debug.js
│   ├── 4. pm2.js
│   ├── 5. iconv-lite.js
│   ├── 6.mail.js
│   ├── 7.read.js
│   └── my-debug.js
├── bdyp.js
├── crawl-server
│   ├── app.js
│   ├── bin
│   │   └── www
│   ├── package-lock.json
│   ├── package.json
│   ├── public
│   │   ├── images
│   │   ├── javascripts
│   │   └── stylesheets
│   ├── routes
│   │   ├── index.js
│   │   └── users.js
│   └── uploads
│       ├── 50d33a30f74fd55ffc0f3c0aaea989b6
│       └── f24715d08bab6243f62bbe9f16a52d05
├── crawl.sql
├── db.js
├── mail.js
├── main.js
├── package-lock.json
├── package.json
├── read
│   ├── article-detail.js
│   ├── articles.js
│   ├── index.js
│   ├── tags.js
│   └── text.html
├── readme.md
├── utils
│   ├── domain-util.js
│   └── puppeteer-utils.js
├── web
│   ├── middleware
│   │   └── auth.js
│   ├── public
│   │   └── css
│   ├── router
│   │   └── bdyp.js
│   ├── server.js
│   ├── update
│   └── views
│       ├── detail.html
│       ├── footer.html
│       ├── header.html
│       ├── index.html
│       ├── login.html
│       └── subscribe.html
└── write
    ├── articles.js
    ├── index.js
    └── tags.js

Step 1: Crawl Basics

Expected Output

Traditional crawler workflow: use the request package to fetch Juejin front‑end tag page HTML and extract article titles with a regular expression.

Front‑end crawler problem: pages rendered by JavaScript cannot be scraped with simple HTTP requests.

Front‑end crawler solution: use Puppeteer to control a headless Chrome instance.

Traditional Crawler Using Request

Goal

Fetch all article titles from Juejin’s front‑end tag page and save them to titles.txt.

Implementation

let request = require("request");
let url = "https://juejin.cn/tag/%E5%89%8D%E7%AB%AF";
let fs = require("fs");
let regexp = /class="title" data-v-\w+>(.+?)<\/a>/g;

request(url, (err, response, body) => {
  let titles = [];
  body.replace(regexp, (matched, title) => {
    titles.push(title);
  });
  console.log(titles);
  fs.writeFileSync("titles.txt", titles);
});

Result

Running node 1.js prints the list of titles and creates titles.txt with the extracted titles.

Front‑end Crawler Using Puppeteer – Screenshot Baidu

Goal

Open Baidu homepage with Puppeteer and capture a screenshot saved as baidu.png.

Implementation

let puppeteer = require("puppeteer");
(async () => {
  const browser = await puppeteer.launch();
  let page = await browser.newPage();
  await page.goto("http://baidu.com");
  await page.screenshot({ path: "baidu.png" });
  await browser.close();
})();

Front‑end Crawler Using Puppeteer – Crawl JD Mobile List

Goal

Search for “手机” on JD.com and retrieve the list of product links.

Implementation

const puppeteer = require("puppeteer");
(async function () {
  const browser = await puppeteer.launch({ headless: false });
  let page = await browser.newPage();
  await page.setJavaScriptEnabled(true);
  await page.goto("https://www.jd.com/");
  const searchInput = await page.$("#key");
  await searchInput.focus();
  await page.keyboard.type("手机");
  const searchBtn = await page.$(".button");
  await searchBtn.click();
  await page.waitForSelector(".gl-item");
  const links = await page.$$eval(
    ".gl-item > .gl-i-wrap > .p-img > a",
    (links) => {
      return links.map((a) => ({ href: a.href.trim(), title: a.title }));
    }
  );
  console.log(links);
})();

Step 2: Crawl Juejin Data

Implement a Node service that fetches Juejin tags, articles, and article details, then stores them in MySQL.

Database Service

Use MySQL with Navicat for remote connection. Create a database named crawl-db and import the provided .sql schema.

Node Service (Express)

Provide /tag endpoint to fetch all Juejin tags and write them to the tags table.

Provide /article endpoint to fetch articles for a given tag, deduplicate, and write to articles, article_detail, and article_tag tables.

Server Entry (web/server.js)

let express = require("express");
let bodyParser = require("body-parser");
let session = require("express-session");
let { checkLogin } = require("./middleware/auth");
const path = require("path");
const { query } = require("../db");
const CronJob = require("cron").CronJob;
const debug = require("debug")("crawl:server");
const { spawn } = require("child_process");
let app = express();
app.use(express.static("web/public"));
app.use(bodyParser.urlencoded({ extended: true }));
app.use(bodyParser.json());
app.use(session({ resave: true, saveUninitialized: true, secret: "wzyan" }));
app.use((req, res, next) => { res.locals.user = req.session.user; next(); });
app.set("view engine", "html");
app.set("views", path.resolve("web/views"));
app.engine("html", require("ejs").__express);

app.get("/", async function (req, res) {
  let { tagId } = req.query;
  let tags = await query(`SELECT * FROM tags`);
  tagId = tagId || tags[0].id;
  let articles = await query(
    `SELECT a.* FROM articles a INNER JOIN article_tag t ON a.id = t.article_id WHERE t.tag_id = ?`,
    [tagId]
  );
  res.render("index", { tags, articles });
});

app.get("/login", async function (req, res) {
  res.render("login", { title: "登录" });
});

app.post("/login", async function (req, res) {
  let { email, password } = req.body;
  let oldUsers = await query(`SELECT * FROM users WHERE email=?`, [email]);
  let user;
  if (Array.isArray(oldUsers) && oldUsers.length > 0) {
    user = oldUsers[0];
  } else {
    let result = await query(`INSERT INTO users(email,password) VALUES(?,?)`, [email, password]);
    user = { id: result.insertId, email, password };
  }
  req.session.user = user;
  res.redirect("/");
});

app.get("/subscribe", checkLogin, async function (req, res) {
  let tags = await query(`SELECT * FROM tags`);
  let user = req.session.user;
  let selectedTags = await query(`SELECT tag_id FROM user_tag WHERE user_id = ?`, [user.id]);
  let selectTagIds = selectedTags.map((item) => item["tag_id"]);
  tags.forEach((item) => { item.subscribe = selectTagIds.includes(item.id); });
  res.render("subscribe", { title: "请订阅你感兴趣的标签", tags });
});

app.post("/subscribe", checkLogin, async function (req, res) {
  let { tags } = req.body;
  if (!tags) tags = [];
  if (typeof tags === "string") tags = [tags];
  const getNum = (str) => str.replace(/[^0-9]/gi, "");
  tags = tags.map(getNum);
  let user = req.session.user;
  await query(`DELETE FROM user_tag WHERE user_id=?`, [user.id]);
  for (let i = 0; i < tags.length; i++) {
    await query(`INSERT INTO user_tag(user_id,tag_id) VALUES(?,?)`, [user.id, parseInt(tags[i])]);
  }
  res.redirect("/");
});

app.get("/detail/:id", async function (req, res) {
  let id = req.params.id;
  let articles = await query(`SELECT * FROM articles WHERE id=?`, [id]);
  res.render("detail", { article: articles[0] });
});

app.listen(8082);
process.on("uncaughtException", function (err) {
  console.error("uncaughtException: %s", err.stack);
});

Utility Modules

domain-util.js extracts hostname from a URL.

const URI = require("urijs");
function getHostName(uri) {
  uri = new URI(uri);
  return uri.hostname();
}
module.exports = { getHostName };

puppeteer-utils.js provides functions to launch a browser, open pages, scroll to the bottom, retrieve HTML with Cheerio, add cookies, and perform login actions.

let puppeteer = require("puppeteer");
const cheerio = require("cheerio");
const domainUtil = require("./domain-util");
let browser;
async function getHTML(uri, isAutoScrollToBottom = true) {
  let page = await openPage(uri);
  if (isAutoScrollToBottom) await autoScrollToBottom(page);
  let html = await page.content();
  const $ = cheerio.load(html);
  return $;
}
async function autoScrollToBottom(page) {
  await page.evaluate(async () => {
    await new Promise((resolve) => {
      let totalHeight = 0;
      function exec() {
        totalHeight = document.body.scrollHeight;
        window.scrollBy(0, totalHeight);
        setTimeout(() => {
          if (document.body.scrollHeight > totalHeight) exec();
          else resolve();
        }, 3000);
      }
      exec();
    });
  });
}
async function getPage(opts = { headless: false, devtools: true }) {
  if (!browser) browser = await puppeteer.launch(opts);
  let page = await browser.newPage();
  return page;
}
async function openPage(uri, opts = { headless: false, devtools: true }) {
  let page = await getPage(opts);
  await page.goto(uri, { waitUntil: "networkidle2" });
  return page;
}
async function login(url, cookies) {
  let page = await getPage({ ignoreHTTPSErrors: true, headless: false, args: ["--no-sandbox", "--disable-setuid-sandbox"] });
  await addCookies(page, cookies, domainUtil.getHostName(url));
  await page.setViewport({ width: 1920, height: 1080 });
  await page.goto(url, { timeout: 600000, waitUntil: "networkidle2" });
  return page;
}
async function addCookies(page, cookies, domain) {
  if (typeof cookies === "string") {
    cookies = cookies.split(";").map((pair) => {
      let name = pair.trim().slice(0, pair.trim().indexOf("="));
      let value = pair.trim().slice(pair.trim().indexOf("=") + 1);
      return { name, value, domain };
    });
  }
  await Promise.all(cookies.map((c) => page.setCookie(c)));
}
module.exports = { getHTML, openPage, login, addCookies };

Data Readers

read/tags.js extracts tag information from Juejin.

const debug = require("debug")("juejin:task:read");
const puppeteerUtils = require("../puppeteer-utils");
function get(owner, props) { return owner ? owner[props] : ""; }
exports.tags = async function (uri) {
  debug("读取文章标签列表");
  let $ = await puppeteerUtils.getHTML(uri);
  let tags = [];
  $("li.item").each((i, item) => {
    let tag = $(item);
    let image = tag.find("img.thumb").first();
    let title = tag.find(".title").first();
    let subscribe = tag.find(".subscribe").first();
    let article = tag.find(".article").first();
    let name = title.text().trim();
    tags.push({
      image: image.data("src")?.trim() || image.data("src"),
      name,
      url: `https://juejin.im/tag/${encodeURIComponent(title.text().trim())}`,
      subscribe: get(Number(subscribe.text().match(/(\d+)/), [1])),
      article: get(Number(article.text().match(/(\d+)/), [1]))
    });
    debug(`读取文章标签:${name}`);
  });
  return tags.filter((item) => item.name);
};

read/articles.js fetches article list and details.

const debug = require("debug")("juejin:task:read");
const puppeteerUtils = require("../puppeteer-utils");
const { readArticle } = require("./article-detail");
function removeEmoji(content) {
  return (content || "").replace(/(?:[\u2700-\u27bf]|(?:\ud83c[\udde6-\uddff]){2}|[\ud800-\udbff][\udc00-\udfff]|[\u0023-\u0039]\ufe0f?\u20e3|\u3299|\u3297|\u303d|\u3030|\u24c2|\ud83c[\udd70-\udd71]|\ud83c[\udd7e-\udd7f]|\ud83c\udd8e|\ud83c[\udd91-\udd9a]|\ud83c[\ude01-\ude02]|\ud83c\ude1a|\ud83c\ude2f|\ud83c[\ude32-\ude3a]|\ud83c[\ude50-\ude51]|\u203c|\u2049|[\u25aa-\u25ab]|\u25b6|\u25c0|[\u25fb-\u25fe]|\u00a9|\u00ae|\u2122|\u2139|\ud83c\udc04|[\u2600-\u26FF]|\u2b05|\u2b06|\u2b07|\u2b1b|\u2b1c|\u2b50|\u2b55|\u231a|\u231b|\u2328|\u23cf|[\u23e9-\u23f3]|[\u23f8-\u23fa]|\ud83c\udccf|\u2934|\u2935|[\u2190-\u21ff])/g, "");
}
exports.articles = async function (uri, maxNum = 0) {
  let $ = await puppeteerUtils.getHTML(uri, false);
  let articleList = [];
  let items = $(".item.title");
  let articleNum = maxNum || items.length;
  for (let i = 0; i < articleNum; i++) {
    let article = $(items[i]);
    let href = article.attr("href").trim();
    let title = article.text().trim();
    let id = href.match(/(\w+)$/)[1];
    href = "https://juejin.im" + href;
    let articleDetail = await readArticle(id, href);
    articleList.push({
      href,
      title: removeEmoji(title),
      id,
      content: removeEmoji(articleDetail.content),
      tags: articleDetail.tags
    });
    debug(`读取文章列表:${title}`);
  }
  return articleList;
};

read/article-detail.js extracts the article page content and tags.

const debug = require("debug")("juejin:task:read-detail");
const puppeteerUtils = require("../puppeteer-utils");
async function readArticle(id, uri) {
  debug("读取博文");
  let $ = await puppeteerUtils.getHTML(uri, false);
  let article = $(".main-container");
  let title = article.find("h1").text().trim();
  let content = article.find(".article-content").html();
  let metas = article.find("meta");
  let tags = "";
  for (let i = 0; i < metas.length; i++) {
    const meta = metas[i];
    if (meta.attribs && meta.attribs.itemprop === "keywords") {
      tags = meta.attribs.content;
    }
  }
  tags = tags.split(",");
  debug(`读取文章详情:${title}`);
  return { id, title, content, tags };
}
module.exports = { readArticle };

Data Writers

write/tags.js inserts or updates tag records.

const { query } = require("../../../db");
const debug = require("debug")("juejin:task:write");
exports.tags = async function (tagList) {
  debug("保存文章标签列表");
  for (let tag of tagList) {
    let oldTags = await query(`SELECT * FROM tags WHERE name=? LIMIT 1`, [tag.name]);
    if (Array.isArray(oldTags) && oldTags.length > 0) {
      let oldTag = oldTags[0];
      await query(`UPDATE tags SET name=?,image=?,url=? WHERE id=?`, [tag.name, tag.image, tag.url, oldTag.id]);
    } else {
      await query(`INSERT INTO tags(name,image,url) VALUES(?,?,?)`, [tag.name, tag.image, tag.url]);
    }
  }
};

write/articles.js inserts or updates articles and sends email notifications to subscribed users.

const { query } = require("../../../db");
const debug = require("debug")("juejin:task:write");
const sendMail = require("../../../mail");
exports.articles = async function (articleList) {
  debug("写入博文列表");
  for (let article of articleList) {
    let oldArticles = await query(`SELECT * FROM articles WHERE id=? LIMIT 1`, article.id);
    if (Array.isArray(oldArticles) && oldArticles.length > 0) {
      let oldArticle = oldArticles[0];
      await query(`UPDATE articles SET title=?,content=?,href=? WHERE id=?`, [article.title, article.content, article.href, oldArticle.id]);
    } else {
      await query(`INSERT INTO articles(id,title,href,content) VALUES(?,?,?,?)`, [article.id, article.title, article.href, article.content]);
    }
    await query(`DELETE FROM article_tag WHERE article_id=?`, [article.id]);
    const where = "('" + article.tags.join("','") + "')";
    const sql = `SELECT id FROM tags WHERE name IN ${where}`;
    let tagIds = await query(sql);
    for (let row of tagIds) {
      await query(`INSERT INTO article_tag(article_id,tag_id) VALUES(?,?)`, [article.id, row.id]);
    }
    let tagIDsString = tagIds.map((item) => item.id).join(",");
    let emailSQL = `SELECT DISTINCT users.email FROM user_tag INNER JOIN users ON user_tag.user_id = users.id WHERE tag_id IN (${tagIDsString})`;
    let emails = await query(emailSQL);
    for (let emailInfo of emails) {
      sendMail(
        emailInfo.email,
        `您订阅的文章更新了`,
        `<a href="http://localhost:8080/detail/${article.id}">${article.title}</a>`
      );
    }
  }
};

Step 3: Display Data in a Web Application

Build the front‑end UI with EJS templates and Bootstrap. The views include index.html (home page with tag list and article list), login.html, header.html (navigation bar), detail.html (article detail), and subscribe.html (tag subscription).

Key Templates

<%- include('header.html') %>
<div class="container">
  <div class="row">
    <div class="col-md-2">
      <ul class="list-group">
        <% tags.forEach(tag=>{ %>
          <li class="list-group-item text-center">
            <a href="/?tagId=<%=tag.id%>">
              <img style="width:25px;height:25px" src="<%=tag.image%>" />
              <%=tag.name%>
            </a>
          </li>
        <% }) %>
      </ul>
    </div>
    <div class="col-md-10">
      <ul class="list-group">
        <% articles.forEach(article=>{ %>
          <li class="list-group-item">
            <a href="/detail/<%=article.id%>"> <%=article.title%> </a>
          </li>
        <% }) %>
      </ul>
    </div>
  </div>
</div>

<%- include('header.html') %>
<div class="row">
  <div class="col-md-4 col-md-offset-4">
    <form method="POST">
      <input type="email" name="email" class="form-control" placeholder="请输入邮箱进行登录" />
      <input type="password" name="password" class="form-control" placeholder="密码" />
      <button type="submit" class="btn btn-default">提交</button>
    </form>
  </div>
</div>

<head>
  <meta charset="UTF-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/3.4.1/css/bootstrap.min.css" crossorigin="anonymous">
  <title>博客列表</title>
</head>
<body>
  <nav class="navbar navbar-default">
    ... navigation with conditional login/subscribe links ...
  </nav>
</body>

<%- include('header.html') %>
<div class="container">
  <div class="row">
    <div class="col-md-12">
      <div class="panel">
        <div class="panel-heading">
          <h1 class="text-center"><%- article.title %></h1>
        </div>
        <div class="panel-body"><%- article.content %></div>
      </div>
    </div>
  </div>
</div>

<%- include('header.html') %>
<style>
  .tag { display:flex; flex-direction:column; justify-content:center; align-items:center; }
  .tag img { width:50px; margin-bottom:20px; }
</style>
<div class="row">
  <form method="POST">
    <input type="submit" class="btn btn-primary" />
    <% for(let i=0;i<tags.length;i++) { let tag=tags[i]; %>
      <div class="col-md-3 tag">
        <img src="<%=tag.image%>" />
        <p><%=tag.name%></p>
        <p><%=tag.subscribe%&nbsp;关注&nbsp;<%=tag.article%&nbsp;文章</p>
        <div class="checkbox">
          <label>
            <input <%=tag.subscribe? "checked" : "" %> type="checkbox" name="tags" value="<%=tag.id%>"> 关注
          </label>
        </div>
      </div>
    <% } %>
  </form>
</div>

References

W3CSchool – Puppeteer Manual

Puppeteer Chinese Documentation

Qikegu Tutorial

Aliyun Community – Puppeteer API v1.11

F2E Chinese Docs – ElementHandle

Missyou Blog – Puppeteer page.setContent

Jianshu – Node + Puppeteer crawling experience

Zhihu – Discussing Puppeteer in projects

IT2048 – Puppeteer performance optimization

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Puppeteer Node.js MySQL Express Web Scraping frontend crawling

Written by

WeDoctor Frontend Technology

Official WeDoctor Group frontend public account, sharing original tech articles, events, job postings, and occasional daily updates from our tech team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.