Backend Development 6 min read

Building a Multithreaded Java Web Scraper to Harvest 100k Records

After uncovering an unprotected API that allowed unlimited resource access, the author created a rough Java program that uses a fixed-size thread pool and CountDownLatch to fetch 100 000 items in parallel, retrieving 10 000 records per thread via HTTP GET requests.

FunTester

Sep 17, 2019

Building a Multithreaded Java Web Scraper to Harvest 100k Records

Background

The author discovered that a certain website exposed an API without proper request validation, effectively removing the daily limit on resource retrieval. To exploit the large amount of data (estimated 100,000 records), a multithreaded crawler was implemented in Java.

Main Entry Point

The main method simply instantiates the LoginDz class and calls excuteTreads() to start the concurrent fetching process, then invokes testOver() (presumably a helper to signal completion).

Fetching a Single Page

The getTi method builds a request URL, assembles query parameters (the ID_List generated by getTiId), sends an HTTP GET request using a helper getHttpGet, and parses the JSON response. If the response is not an empty success payload, it logs the result and returns the JSON object.

Generating Query Parameters

getTiId

receives a variable‑length list of integer IDs and concatenates them into a filter string of the form filter[where][origDocID][inq]=ID&. The resulting string is used as the ID_List parameter for the API call.

Multithreaded Execution

The excuteTreads method creates a fixed thread pool with 10 threads and a CountDownLatch initialized to the same count. It records the start time, submits 10 instances of the inner More runnable to the executor, and waits for the latch to reach zero before shutting down the pool and printing the elapsed time.

Worker Runnable

The inner class More implements Runnable. Each instance receives its thread index ( num) and calculates a bound offset ( num * 10000). Inside run(), it iterates over a range of 10,000 IDs in steps of 100, builds an array of 100 IDs per iteration, and calls getTi for each batch. After processing its segment, it decrements the latch.

package practise;

import java.util.Date;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import org.apache.http.client.methods.HttpGet;
import net.sf.json.JSONObject;
import source.ApiLibrary;

public class LoginDz extends ApiLibrary {
    public static void main(String[] args) {
        LoginDz loginDz = new LoginDz();
        loginDz.excuteTreads();
        testOver();
    }

    public JSONObject getTi(int[] code, String name) {
        JSONObject response = null;
        String url = "***********";
        JSONObject args = new JSONObject();
        args.put("ID_List", getTiId(code));
        HttpGet httpGet = getHttpGet(url, args);
        response = getHttpResponseEntityByJson(httpGet);
        String text = response.toString();
        if (!text.equals("{\"success_response\":[]}"))
            logLog("name", response.toString());
        output(response);
        return response;
    }

    public String getTiId(int... id) {
        StringBuffer result = new StringBuffer();
        int length = id.length;
        for (int i = 0; i < length; i++) {
            String abc = "filter[where][origDocID][inq]=" + id[i] + "&";
            result.append(abc);
        }
        return result.toString();
    }

    /**
     * Execute multithreaded tasks
     */
    public void excuteTreads() {
        int threads = 10;
        ExecutorService executorService = Executors.newFixedThreadPool(threads);
        CountDownLatch countDownLatch = new CountDownLatch(threads);
        Date start = new Date();
        for (int i = 0; i < threads; i++) {
            executorService.execute(new More(countDownLatch, i));
        }
        try {
            countDownLatch.await();
            executorService.shutdown();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        Date end = new Date();
        outputTimeDiffer(start, end);
    }

    /**
     * Worker thread
     */
    class More implements Runnable {
        public CountDownLatch countDownLatch;
        public int num;

        public More(CountDownLatch countDownLatch, int num) {
            this.countDownLatch = countDownLatch;
            this.num = num;
        }

        @Override
        public void run() {
            int bound = num * 10000;
            try {
                for (int i = bound; i < bound + 10000; i += 100) {
                    int[] ids = new int[100];
                    for (int k = 0; k < 100; k++) {
                        ids[i] = i + k;
                        getTi(ids, bound + "");
                    }
                }
            } finally {
                countDownLatch.countDown();
            }
        }
    }
}

The script is intentionally simple and not intended to complete the entire data set; it serves as a reference for building a multithreaded HTTP GET crawler in Java.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Concurrency HTTP Multithreading Web Scraping

Written by

FunTester

10k followers, 1k articles | completely useless

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.