Java HttpClient Based Crawler for Nationwide School Names

The author presents a Java HttpClient crawler that efficiently retrieves over 60,000 nationwide school names in about 16 minutes by replacing slow Selenium UI scripts with direct API calls, storing the results in a database, and shares the complete source code for reference.

FunTester
FunTester
FunTester
Java HttpClient Based Crawler for Nationwide School Names

While using HttpClient, the author realized the opportunity to scrape data such as nationwide middle school names. Previously, a Selenium‑based UI script was used, which was slow and unstable; switching to direct API calls dramatically improved performance, fetching more than 60,000 records in roughly 16 minutes, including database storage.

The following Java code implements the crawler, defining static maps for provinces, cities, counties, and schools, and iteratively requesting each level of the hierarchy to build a list of full school names.

package practise;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.http.client.methods.HttpGet;
import net.sf.json.JSONObject;
import source.ApiLibrary;
import source.Concurrent;

public class Crawler extends ApiLibrary {
    public static String host = "";
    public static Map<String, Integer> countrys = new HashMap<>();
    public static Map<String, Integer> citys = new HashMap<>();
    public static Map<String, Integer> address = new HashMap<>();
    public static Map<String, Integer> school = new HashMap<>();
    public static List<String> total = new ArrayList<>();

    public static void main(String[] args) {
        Crawler crawler = new Crawler();
        crawler.getCountry1(); // provinces
        Set<String> countryId = countrys.keySet();
        for (String name : countryId) {
            int id = countrys.get(name);
            crawler.getCountry2(id); // cities
            Set<String> cityId = citys.keySet();
            for (String city : cityId) {
                int cid = citys.get(city);
                crawler.getCountry3(cid); // counties
                Set<String> adresss = address.keySet();
                for (String adres : adresss) {
                    int aid = address.get(adres);
                    crawler.getCountry4(aid); // schools
                    Set<String> schol = school.keySet();
                    for (String sch : schol) {
                        String line = name + PART + city + PART + adres + PART + sch;
                        total.add(line);
                    }
                }
            }
        }
        Concurrent.saveRequestTimes(total);
        testOver();
    }

    /**
     * Query provinces
     */
    public void getCountry1() {
        String url = host + "/user/editinfo/getSchollCountryList";
        HttpGet httpGet = getHttpGet(url);
        JSONObject response = getHttpResponseEntityByJson(httpGet);
        String[] country = response.getString("content").split("</a>");
        for (int i = 0; i < country.length; i++) {
            String msg = country[i];
            int code = getCode(msg);
            String name = getName(msg);
            countrys.put(name, code);
        }
    }

    // ... (other methods getCountry2, getCountry3, getCountry4, getCode, getName) ...
}

The article includes a screenshot of the collected data (image omitted here) and notes that sensitive information has been redacted, leaving only the overall approach for readers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Backenddata-crawlerweb-scraping
FunTester
Written by

FunTester

10k followers, 1k articles | completely useless

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.