Big Data 11 min read

Processing 10GB Age Data on a 4GB Memory Machine Using Java: Single‑Threaded and Multi‑Threaded Solutions

This article demonstrates how to generate, read, and analyze a 10 GB file of age statistics on a 4 GB RAM, 2‑core machine using Java, comparing a single‑threaded counting method with a producer‑consumer multi‑threaded approach that dramatically improves CPU utilization and reduces processing time.

Selected Java Interview Questions

Aug 8, 2023

Processing 10GB Age Data on a 4GB Memory Machine Using Java: Single‑Threaded and Multi‑Threaded Solutions

Scenario description : A 10 GB file contains integers representing ages between 18 and 70 for a large user base. The task is to find the age that appears most frequently on a computer with only 4 GB of memory and a dual‑core CPU.

Data generation : The following Java program creates roughly 3 billion random age values (each 4 bytes) and writes them to D:\User.dat in append mode, one million records per line (about 4 MB per line), resulting in approximately 2 500 lines to reach 10 GB.

package bigdata;
import java.io.*;
import java.util.Random;

public class GenerateData {
    private static Random random = new Random();
    public static int generateRandomData(int start, int end) {
        return random.nextInt(end - start + 1) + start;
    }
    public void generateData() throws IOException {
        File file = new File("D:\\User.dat");
        if (!file.exists()) file.createNewFile();
        int start = 18, end = 70;
        long startTime = System.currentTimeMillis();
        BufferedWriter bos = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file, true)));
        for (long i = 1; i < Integer.MAX_VALUE * 1.7; i++) {
            String data = generateRandomData(start, end) + ",";
            bos.write(data);
            if (i % 1000000 == 0) bos.write("
");
        }
        System.out.println("Write completed! Time: " + (System.currentTimeMillis() - startTime) / 1000 + " s");
        bos.close();
    }
    public static void main(String[] args) {
        GenerateData gd = new GenerateData();
        try { gd.generateData(); } catch (IOException e) { e.printStackTrace(); }
    }
}

Reading the data : Using BufferedReader.readLine() to stream the file line by line avoids loading the whole file into memory. A test shows that reading the entire 10 GB takes about 20 seconds, roughly 1 second per 100 million records.

private static void readData() throws IOException {
    BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(FILE_NAME), "utf-8"));
    String line;
    long start = System.currentTimeMillis();
    int count = 1;
    while ((line = br.readLine()) != null) {
        if (count % 100 == 0) {
            System.out.println("Read 100 lines, elapsed: " + (System.currentTimeMillis() - start) / 1000 + " s");
            System.gc();
        }
        count++;
    }
    br.close();
}

Single‑threaded processing : Each line is split by commas and stored in a Map<String, AtomicInteger> (countMap). After scanning all lines, the map is traversed to locate the age with the highest count. This approach finishes in about 3 minutes, consumes 2–2.5 GB of RAM, but CPU utilization stays low (20‑25%).

public static void splitLine(String lineData) {
    String[] arr = lineData.split(",");
    for (String str : arr) {
        if (StringUtils.isEmpty(str)) continue;
        countMap.computeIfAbsent(str, s -> new AtomicInteger(0)).getAndIncrement();
    }
}

private static void findMostAge() {
    int targetValue = 0;
    String targetKey = null;
    for (Map.Entry<String, AtomicInteger> entry : countMap.entrySet()) {
        int value = entry.getValue().get();
        if (value > targetValue) {
            targetValue = value;
            targetKey = entry.getKey();
        }
    }
    System.out.println("Most frequent age: " + targetKey + " count: " + targetValue);
}

Multi‑threaded approach (producer‑consumer) : To increase CPU usage, the author introduces a pool of LinkedBlockingQueue<String> objects (one per consumer thread). The producer reads lines and distributes them round‑robin across the queues. Each consumer thread takes strings from its dedicated queue, splits them, and updates the shared countMap. This design raises CPU utilization to over 90% and cuts total processing time from 180 seconds to 103 seconds (≈ 75 % faster) while keeping the final result identical.

private static List<LinkedBlockingQueue<String>> blockQueueLists = new LinkedList<>();
static {
    for (int i = 0; i < threadNums; i++) {
        blockQueueLists.add(new LinkedBlockingQueue<>(256));
    }
}

private static AtomicLong count = new AtomicLong(0);

static class SplitData {
    public static void splitLine(String lineData) {
        String[] arr = lineData.split("
");
        for (String str : arr) {
            if (StringUtils.isEmpty(str)) continue;
            long index = count.get() % threadNums;
            try { blockQueueLists.get((int) index).put(str); } catch (InterruptedException e) { e.printStackTrace(); }
            count.getAndIncrement();
        }
    }
}

private static void startConsumer() throws FileNotFoundException, UnsupportedEncodingException {
    for (int i = 0; i < threadNums; i++) {
        final int index = i;
        new Thread(() -> {
            while (consumerRunning) {
                try {
                    String str = blockQueueLists.get(index).take();
                    countNum(str);
                } catch (InterruptedException e) { e.printStackTrace(); }
            }
        }).start();
    }
}

Issues and solutions : During execution the JVM may experience GC pauses causing sudden memory spikes. The author suggests pausing the main thread after processing a batch and invoking System.gc() manually, and recommends using a thread pool instead of manually creating threads in production.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Performance Optimization Big Data Memory Management multithreading Producer Consumer

Written by

Selected Java Interview Questions

A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.