Java HttpClient Based Crawler for Nationwide School Names
The author presents a Java HttpClient crawler that efficiently retrieves over 60,000 nationwide school names in about 16 minutes by replacing slow Selenium UI scripts with direct API calls, storing the results in a database, and shares the complete source code for reference.
While using HttpClient, the author realized the opportunity to scrape data such as nationwide middle school names. Previously, a Selenium‑based UI script was used, which was slow and unstable; switching to direct API calls dramatically improved performance, fetching more than 60,000 records in roughly 16 minutes, including database storage.
The following Java code implements the crawler, defining static maps for provinces, cities, counties, and schools, and iteratively requesting each level of the hierarchy to build a list of full school names.
package practise;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.http.client.methods.HttpGet;
import net.sf.json.JSONObject;
import source.ApiLibrary;
import source.Concurrent;
public class Crawler extends ApiLibrary {
public static String host = "";
public static Map<String, Integer> countrys = new HashMap<>();
public static Map<String, Integer> citys = new HashMap<>();
public static Map<String, Integer> address = new HashMap<>();
public static Map<String, Integer> school = new HashMap<>();
public static List<String> total = new ArrayList<>();
public static void main(String[] args) {
Crawler crawler = new Crawler();
crawler.getCountry1(); // provinces
Set<String> countryId = countrys.keySet();
for (String name : countryId) {
int id = countrys.get(name);
crawler.getCountry2(id); // cities
Set<String> cityId = citys.keySet();
for (String city : cityId) {
int cid = citys.get(city);
crawler.getCountry3(cid); // counties
Set<String> adresss = address.keySet();
for (String adres : adresss) {
int aid = address.get(adres);
crawler.getCountry4(aid); // schools
Set<String> schol = school.keySet();
for (String sch : schol) {
String line = name + PART + city + PART + adres + PART + sch;
total.add(line);
}
}
}
}
Concurrent.saveRequestTimes(total);
testOver();
}
/**
* Query provinces
*/
public void getCountry1() {
String url = host + "/user/editinfo/getSchollCountryList";
HttpGet httpGet = getHttpGet(url);
JSONObject response = getHttpResponseEntityByJson(httpGet);
String[] country = response.getString("content").split("</a>");
for (int i = 0; i < country.length; i++) {
String msg = country[i];
int code = getCode(msg);
String name = getName(msg);
countrys.put(name, code);
}
}
// ... (other methods getCountry2, getCountry3, getCountry4, getCode, getName) ...
}The article includes a screenshot of the collected data (image omitted here) and notes that sensitive information has been redacted, leaving only the overall approach for readers.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
