Building a Fast Historical‑Today Crawler with Java and MySQL

An open‑source Java crawler that fetches historical‑today events from a public API is presented, detailing three practical challenges—GET request length limits, ambiguous JSON value types, and month string construction—along with a full code example and a GitHub repository link for reference.

FunTester
FunTester
FunTester
Building a Fast Historical‑Today Crawler with Java and MySQL

Overview

An open‑source Java program crawls "historical today" events from a public data source. The crawler iterates over every month‑day combination of a year, builds a request URL, fetches a JavaScript file that contains JSON data, extracts event fields, and stores them in a MySQL table. The complete source code and sample data are hosted on GitHub.

Key Issues and Solutions

GET request length limit : The original implementation sent a full SQL statement in a GET request. When the request URL exceeded the server’s maximum length, the request failed. The fix replaces the GET call with a POST request while keeping the previous GET interface for backward compatibility.

Inconsistent JSON value types : For a given date the API may return either a JSON array (multiple events) or a single JSON object (one event). The code uses a regular‑expression pattern \{"title.+?\} to locate each event block regardless of the container type, then parses each block with JSONObject.fromObject.

Month‑day string formatting : The program needs zero‑padded month and day strings (e.g., 01, 09). A helper expression builds these strings by checking the numeric value and prefixing a "0" when necessary.

Implementation Details

static void main(String[] args) {
    DEFAULT_CHARSET = GBK;
    for (int i in 1..12) {
        for (int j in 1..31) {
            if (i == 2 && (j == 30 || j == 31)) continue;
            if ((i in [4, 6, 9, 11]) && j == 31) continue;
            def month = i > 9 ? i + EMPTY : "0" + i;
            def day   = j > 9 ? j + EMPTY : "0" + j;
            def date  = month + "-" + day;
            getInfo(date);
        }
    }
    testOver();
}

static getInfo(String date) {
    def url = "http://tools.example.com/his/" + date.replace("-", EMPTY) + "_c.js";
    def all = FanRequest.isGet()
        .setUri(url)
        .getResponse()
        .getString("content")
        .substring(8)
        .replace(";", EMPTY)
        .replaceAll("( )+", EMPTY)
        .replaceAll("\\t", EMPTY)
        .replace("##", EMPTY)
        .replaceAll(SPACE_1, EMPTY);
    def json = JSONObject.fromObject(all);
    def keys = json.keySet();
    keys.each { key ->
        def s = json.get(key).toString();
        def all1 = Regex.regexAll(s, "\\{\"title.+?\"");
        for (int i in 0..all1.size() - 1) {
            def info = all1.get(i);
            def inf = JSONObject.fromObject(info.toString());
            def title   = inf.getString("title");
            def keyword = inf.getString("keyword");
            def content = inf.getString("content");
            def alt     = inf.getString("alt");
            String sql = "INSERT INTO today_histroy (date,title,keyword,content,alt) VALUES (\"%s\",\"%s\",\"%s\",\"%s\",\"%s\");";
            sql = String.format(sql, key + "-" + date, title, keyword, content.replace("  ", EMPTY), alt);
            MySqlTest.sendWork(sql);
        }
    }
}

Source code and related data are available at https://github.com/Fhaohaizi/fan

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

mysqlHTTPData ExtractionGitHubWeb Crawling
FunTester
Written by

FunTester

10k followers, 1k articles | completely useless

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.