Backend Development 6 min read

How to Automatically Extract Publication Dates from WeChat Articles with Groovy

The article explains how the author built a Groovy‑based scraper that reads a Markdown list of WeChat links, fetches each article’s HTML, extracts the hidden publication timestamp with a regex, and rewrites the Markdown file to include the dates, using simple HTTP calls and a brief pause to avoid anti‑scraping measures.

FunTester

Feb 8, 2022

How to Automatically Extract Publication Dates from WeChat Articles with Groovy

The author realized that a collection of FunTester articles stored in a Markdown file lacked publication dates, which made the reading experience suboptimal. To enrich the list, they created a small Groovy scraper that retrieves the exact publish time for each WeChat public article.

Finding the Date in the Page

By inspecting the HTML of a WeChat article, they discovered that the page contains a single date string formatted like 2021-08-15 12:34. This makes extraction straightforward.

Core Extraction Method

static def test(String url) {
    def key = url.substring(url.lastIndexOf('/') + 1)
    def get = getHttpGet(url)
    def response = getHttpResponse(get)
    def res = response.getString("content")
    def all = Regex.regexAll(res, "20[1,2]\\d-\\d{2}-\\d{2} \\d{2}:\\d{2}")
    def s = all[0]
    output(key + PART + s)
}

The method extracts the article identifier from the URL, performs an HTTP GET, reads the response body, applies a regular expression to locate the date, and outputs a string that combines the identifier with the found timestamp.

Processing the Markdown List

static def spider() {
    String path = "/Users/oker/IdeaProjects/funtester/document/directory.markdown"
    def line = RWUtil.readByLine(path)
    def key = false
    line.each {
        if (key && it.startsWith("- [") && it.contains("weixin.qq")) {
            String url = it.substring(it.lastIndexOf("]") + 2 - 1)
            test(url)
            sleep(3.0)
        }
    }
}

This script reads the Markdown file line by line, identifies lines that contain WeChat article links, calls test(url) for each, and pauses three seconds between requests to avoid triggering anti‑scraping rules.

Rewriting the Markdown with Dates

public static void main(String[] args) {
    def string = RWUtil.readByString(getLongFile("wx"))
    def info = parse(string)
    String path = "/Users/oker/IdeaProjects/funtester/document/directory.markdown"
    def line = RWUtil.readByLine(path)
    line.each {
        if (it.startsWith("- [") && it.contains("weixin.qq")) {
            String url = it.substring(it.lastIndexOf("]") + 2 - 1)
            def key = url.substring(url.lastIndexOf("/") + 1)
            output("$it $TAB 发表于${info.get(key)}")
        } else {
            output(LINE + it + LINE)
        }
    }
}

The scraped dates are stored in a com.alibaba.fastjson.JSONObject. The script reads the original Markdown again, replaces each link line with the same line followed by the extracted date (e.g., "发表于2021-08-15 12:34"), and writes the updated content back.

Overall, the solution demonstrates a lightweight, one‑time scraping approach without persisting data in a database; results are logged directly, and a three‑second delay ensures the process remains safe against WeChat’s anti‑scraping mechanisms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Automation Groovy WeChat Web Scraping date extraction

Written by

FunTester

10k followers, 1k articles | completely useless

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.