How to Automatically Extract Publication Dates from WeChat Articles with Groovy
The article explains how the author built a Groovy‑based scraper that reads a Markdown list of WeChat links, fetches each article’s HTML, extracts the hidden publication timestamp with a regex, and rewrites the Markdown file to include the dates, using simple HTTP calls and a brief pause to avoid anti‑scraping measures.
The author realized that a collection of FunTester articles stored in a Markdown file lacked publication dates, which made the reading experience suboptimal. To enrich the list, they created a small Groovy scraper that retrieves the exact publish time for each WeChat public article.
Finding the Date in the Page
By inspecting the HTML of a WeChat article, they discovered that the page contains a single date string formatted like 2021-08-15 12:34. This makes extraction straightforward.
Core Extraction Method
static def test(String url) {
def key = url.substring(url.lastIndexOf('/') + 1)
def get = getHttpGet(url)
def response = getHttpResponse(get)
def res = response.getString("content")
def all = Regex.regexAll(res, "20[1,2]\\d-\\d{2}-\\d{2} \\d{2}:\\d{2}")
def s = all[0]
output(key + PART + s)
}The method extracts the article identifier from the URL, performs an HTTP GET, reads the response body, applies a regular expression to locate the date, and outputs a string that combines the identifier with the found timestamp.
Processing the Markdown List
static def spider() {
String path = "/Users/oker/IdeaProjects/funtester/document/directory.markdown"
def line = RWUtil.readByLine(path)
def key = false
line.each {
if (key && it.startsWith("- [") && it.contains("weixin.qq")) {
String url = it.substring(it.lastIndexOf("]") + 2 - 1)
test(url)
sleep(3.0)
}
}
}This script reads the Markdown file line by line, identifies lines that contain WeChat article links, calls test(url) for each, and pauses three seconds between requests to avoid triggering anti‑scraping rules.
Rewriting the Markdown with Dates
public static void main(String[] args) {
def string = RWUtil.readByString(getLongFile("wx"))
def info = parse(string)
String path = "/Users/oker/IdeaProjects/funtester/document/directory.markdown"
def line = RWUtil.readByLine(path)
line.each {
if (it.startsWith("- [") && it.contains("weixin.qq")) {
String url = it.substring(it.lastIndexOf("]") + 2 - 1)
def key = url.substring(url.lastIndexOf("/") + 1)
output("$it $TAB 发表于${info.get(key)}")
} else {
output(LINE + it + LINE)
}
}
}The scraped dates are stored in a com.alibaba.fastjson.JSONObject. The script reads the original Markdown again, replaces each link line with the same line followed by the extracted date (e.g., "发表于2021-08-15 12:34"), and writes the updated content back.
Overall, the solution demonstrates a lightweight, one‑time scraping approach without persisting data in a database; results are logged directly, and a three‑second delay ensures the process remains safe against WeChat’s anti‑scraping mechanisms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
