Building a Fast Historical‑Today Crawler with Java and MySQL
An open‑source Java crawler that fetches historical‑today events from a public API is presented, detailing three practical challenges—GET request length limits, ambiguous JSON value types, and month string construction—along with a full code example and a GitHub repository link for reference.
Overview
An open‑source Java program crawls "historical today" events from a public data source. The crawler iterates over every month‑day combination of a year, builds a request URL, fetches a JavaScript file that contains JSON data, extracts event fields, and stores them in a MySQL table. The complete source code and sample data are hosted on GitHub.
Key Issues and Solutions
GET request length limit : The original implementation sent a full SQL statement in a GET request. When the request URL exceeded the server’s maximum length, the request failed. The fix replaces the GET call with a POST request while keeping the previous GET interface for backward compatibility.
Inconsistent JSON value types : For a given date the API may return either a JSON array (multiple events) or a single JSON object (one event). The code uses a regular‑expression pattern \{"title.+?\} to locate each event block regardless of the container type, then parses each block with JSONObject.fromObject.
Month‑day string formatting : The program needs zero‑padded month and day strings (e.g., 01, 09). A helper expression builds these strings by checking the numeric value and prefixing a "0" when necessary.
Implementation Details
static void main(String[] args) {
DEFAULT_CHARSET = GBK;
for (int i in 1..12) {
for (int j in 1..31) {
if (i == 2 && (j == 30 || j == 31)) continue;
if ((i in [4, 6, 9, 11]) && j == 31) continue;
def month = i > 9 ? i + EMPTY : "0" + i;
def day = j > 9 ? j + EMPTY : "0" + j;
def date = month + "-" + day;
getInfo(date);
}
}
testOver();
}
static getInfo(String date) {
def url = "http://tools.example.com/his/" + date.replace("-", EMPTY) + "_c.js";
def all = FanRequest.isGet()
.setUri(url)
.getResponse()
.getString("content")
.substring(8)
.replace(";", EMPTY)
.replaceAll("( )+", EMPTY)
.replaceAll("\\t", EMPTY)
.replace("##", EMPTY)
.replaceAll(SPACE_1, EMPTY);
def json = JSONObject.fromObject(all);
def keys = json.keySet();
keys.each { key ->
def s = json.get(key).toString();
def all1 = Regex.regexAll(s, "\\{\"title.+?\"");
for (int i in 0..all1.size() - 1) {
def info = all1.get(i);
def inf = JSONObject.fromObject(info.toString());
def title = inf.getString("title");
def keyword = inf.getString("keyword");
def content = inf.getString("content");
def alt = inf.getString("alt");
String sql = "INSERT INTO today_histroy (date,title,keyword,content,alt) VALUES (\"%s\",\"%s\",\"%s\",\"%s\",\"%s\");";
sql = String.format(sql, key + "-" + date, title, keyword, content.replace(" ", EMPTY), alt);
MySqlTest.sendWork(sql);
}
}
}Source code and related data are available at https://github.com/Fhaohaizi/fan
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
