How to Scrape 7.2 Million Historical Weather Records with Groovy
This article explains how to use a Groovy script to crawl over 7 million historical weather entries for 3,200 cities spanning 2011‑2019, process the JSON responses, and store the cleaned data into a MySQL table, while sharing practical tips and code snippets.
The author completed a massive weather‑data crawl that collected 7.2 million records covering 3,200 regions from 2011 to 2019, amounting to roughly 950 MB of raw JSON. The original dataset was lost, but the provided Groovy script can reproduce the entire collection by running overnight.
Implementation overview : The script is written in Groovy, a JVM‑based scripting language that simplifies Java boilerplate. It defines a Weather class extending ApiLibrary with three static methods: getCityAll(int cityId) iterates over the years 2011‑2018, calling getCityYear for each year and pausing 1‑2 seconds between requests. getCityYear(int cityId, int year) loops through the 12 months, skipping invalid months for 2019, and invokes getMonth. getMonth(int cityId, int year, int month) builds the request URL (different format before and after 2016), fetches the JavaScript file via FanRequest.isGet(), strips the leading characters, and parses the JSON with net.sf.json.JSONObject.
For each day entry the script extracts fields such as ymd (date), low/high temperature ( bWendu, yWendu), weather description, wind direction, wind level, and optional air‑quality indices ( aqi, aqiInfo, aqiLevel). Temperature strings are cleaned by removing the "℃" symbol.
After processing, an INSERT INTO weather SQL statement is constructed with String.format and executed via MySqlTest.sendWork(sql). Helper functions like changeStringToInt convert temperature strings to integers, and placeholders EMPTY and TEST_ERROR_CODE handle missing data.
The full source code (with English comments) is shown below:
package com.fan
import com.fission.source.httpclient.ApiLibrary
import com.fission.source.httpclient.FanRequest
import com.fission.source.mysql.MySqlTest
import com.fission.source.source.WriteRead
import com.fission.source.utils.Log
import net.sf.json.JSONException
import net.sf.json.JSONObject
class Weather extends ApiLibrary {
// Get all data for a city from 2011 to 2018
static getCityAll(int cityId) {
for (int j in 2011..2018) {
getCityYear(cityId, j)
sleep(1000 + getRandomInt(1000))
}
}
// Get data for a specific year
static getCityYear(int cityId, int year) {
for (int i in 1..12) {
if (year == 2019 && i > 9) continue
getMonth(cityId, year, i)
sleep(1000 + getRandomInt(1000))
}
}
// Get data for a specific month
static getMonth(int cityId, int year, int month) {
def yyyymm
def uri
if (year > 2016) {
yyyymm = year * 100 + month
uri = "http://tianqi.***.com/t/wea_history/js/" + yyyymm + "/" + cityId + "_" + yyyymm + ".js"
} else {
yyyymm = year + EMPTY + month
uri = "http://tianqi.***.com/t/wea_history/js/" + cityId + "_" + yyyymm + ".js"
}
output(uri)
def response = FanRequest.isGet()
.setUri(uri)
.getResponse()
.getString("content")
.substring(16)
.replace(";", EMPTY)
def weather = JSONObject.fromObject(response)
def city = weather.getString("city")
def array = weather.getJSONArray("tqInfo")
output(array.size())
for (int i in 0..array.size() - 1) {
JSONObject info = array.get(i)
if (!info.containsKey("ymd")) continue
def date = info.getString("ymd")
def low = info.getString("bWendu").replace("℃", EMPTY)
def high = info.getString("yWendu").replace("℃", EMPTY)
def wea = info.getString("tianqi")
def wind = info.getString("fengxiang")
def fengli = info.getString("fengli")
def aqi = TEST_ERROR_CODE, aqiInfo = EMPTY, aqiLevel = TEST_ERROR_CODE
if (info.containsKey("aqi")) {
aqi = info.getInt("aqi")
aqiInfo = info.getString("aqiInfo")
aqiLevel = info.getInt("aqiLevel")
}
String sql = "INSERT INTO weather (city,low,high,date,wind,windsize,weather,aqi,aqilevel,aqiinfo) VALUES (\"%s\",%d,%d,\"%s\",\"%s\",\"%s\",\"%s\",%d,%d,\"%s\");"
sql = String.format(sql, city, changeStringToInt(low), changeStringToInt(high), date, wind, fengli, wea, aqi, aqiLevel, aqiInfo)
output(sql)
MySqlTest.sendWork(sql)
}
}
}A screenshot of the resulting MySQL table is included below to illustrate the final data layout.
The author notes that many “pitfalls” were encountered during development and promises a future, more detailed blog post describing those challenges.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
