How to Scrape 7.2 Million Historical Weather Records with Groovy

This article explains how to use a Groovy script to crawl over 7 million historical weather entries for 3,200 cities spanning 2011‑2019, process the JSON responses, and store the cleaned data into a MySQL table, while sharing practical tips and code snippets.

FunTester
FunTester
FunTester
How to Scrape 7.2 Million Historical Weather Records with Groovy

The author completed a massive weather‑data crawl that collected 7.2 million records covering 3,200 regions from 2011 to 2019, amounting to roughly 950 MB of raw JSON. The original dataset was lost, but the provided Groovy script can reproduce the entire collection by running overnight.

Implementation overview : The script is written in Groovy, a JVM‑based scripting language that simplifies Java boilerplate. It defines a Weather class extending ApiLibrary with three static methods: getCityAll(int cityId) iterates over the years 2011‑2018, calling getCityYear for each year and pausing 1‑2 seconds between requests. getCityYear(int cityId, int year) loops through the 12 months, skipping invalid months for 2019, and invokes getMonth. getMonth(int cityId, int year, int month) builds the request URL (different format before and after 2016), fetches the JavaScript file via FanRequest.isGet(), strips the leading characters, and parses the JSON with net.sf.json.JSONObject.

For each day entry the script extracts fields such as ymd (date), low/high temperature ( bWendu, yWendu), weather description, wind direction, wind level, and optional air‑quality indices ( aqi, aqiInfo, aqiLevel). Temperature strings are cleaned by removing the "℃" symbol.

After processing, an INSERT INTO weather SQL statement is constructed with String.format and executed via MySqlTest.sendWork(sql). Helper functions like changeStringToInt convert temperature strings to integers, and placeholders EMPTY and TEST_ERROR_CODE handle missing data.

The full source code (with English comments) is shown below:

package com.fan
import com.fission.source.httpclient.ApiLibrary
import com.fission.source.httpclient.FanRequest
import com.fission.source.mysql.MySqlTest
import com.fission.source.source.WriteRead
import com.fission.source.utils.Log
import net.sf.json.JSONException
import net.sf.json.JSONObject

class Weather extends ApiLibrary {
    // Get all data for a city from 2011 to 2018
    static getCityAll(int cityId) {
        for (int j in 2011..2018) {
            getCityYear(cityId, j)
            sleep(1000 + getRandomInt(1000))
        }
    }

    // Get data for a specific year
    static getCityYear(int cityId, int year) {
        for (int i in 1..12) {
            if (year == 2019 && i > 9) continue
            getMonth(cityId, year, i)
            sleep(1000 + getRandomInt(1000))
        }
    }

    // Get data for a specific month
    static getMonth(int cityId, int year, int month) {
        def yyyymm
        def uri
        if (year > 2016) {
            yyyymm = year * 100 + month
            uri = "http://tianqi.***.com/t/wea_history/js/" + yyyymm + "/" + cityId + "_" + yyyymm + ".js"
        } else {
            yyyymm = year + EMPTY + month
            uri = "http://tianqi.***.com/t/wea_history/js/" + cityId + "_" + yyyymm + ".js"
        }
        output(uri)
        def response = FanRequest.isGet()
                .setUri(uri)
                .getResponse()
                .getString("content")
                .substring(16)
                .replace(";", EMPTY)
        def weather = JSONObject.fromObject(response)
        def city = weather.getString("city")
        def array = weather.getJSONArray("tqInfo")
        output(array.size())
        for (int i in 0..array.size() - 1) {
            JSONObject info = array.get(i)
            if (!info.containsKey("ymd")) continue
            def date = info.getString("ymd")
            def low = info.getString("bWendu").replace("℃", EMPTY)
            def high = info.getString("yWendu").replace("℃", EMPTY)
            def wea = info.getString("tianqi")
            def wind = info.getString("fengxiang")
            def fengli = info.getString("fengli")
            def aqi = TEST_ERROR_CODE, aqiInfo = EMPTY, aqiLevel = TEST_ERROR_CODE
            if (info.containsKey("aqi")) {
                aqi = info.getInt("aqi")
                aqiInfo = info.getString("aqiInfo")
                aqiLevel = info.getInt("aqiLevel")
            }
            String sql = "INSERT INTO weather (city,low,high,date,wind,windsize,weather,aqi,aqilevel,aqiinfo) VALUES (\"%s\",%d,%d,\"%s\",\"%s\",\"%s\",\"%s\",%d,%d,\"%s\");"
            sql = String.format(sql, city, changeStringToInt(low), changeStringToInt(high), date, wind, fengli, wea, aqi, aqiLevel, aqiInfo)
            output(sql)
            MySqlTest.sendWork(sql)
        }
    }
}

A screenshot of the resulting MySQL table is included below to illustrate the final data layout.

Database screenshot
Database screenshot

The author notes that many “pitfalls” were encountered during development and promises a future, more detailed blog post describing those challenges.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data engineeringJavamysqlGroovyWeb ScrapingWeather Data
FunTester
Written by

FunTester

10k followers, 1k articles | completely useless

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.