Big Data 21 min read

Processing 10 GB Age Data on a 4 GB PC: Java Multithreaded Solution

This article demonstrates how to generate, read, and analyze a 10 GB file containing age statistics on a machine with only 4 GB RAM, comparing single‑threaded and multithreaded Java implementations, measuring performance, memory usage, and addressing GC and concurrency challenges.

Java High-Performance Architecture

May 26, 2022

Processing 10 GB Age Data on a 4 GB PC: Java Multithreaded Solution

Scenario Description

The problem is to process a 10 GB file that stores integers representing ages between 18 and 70, with each number indicating the count of people of that age; the goal is to find the age that appears most frequently using a computer with 4 GB memory and a dual‑core CPU.

Simulated Data

In Java, an integer occupies 4 bytes, so 10 GB corresponds to roughly 3 billion records. The data is written to disk in append mode, with one line per one million records (about 4 MB per line), resulting in approximately 2 500 lines.

package bigdata;
import java.io.*;
import java.util.Random;
/**
 * @Desc: Generate 10 GB of random age data
 */
public class GenerateData {
    private static Random random = new Random();
    public static int generateRandomData(int start, int end) {
        return random.nextInt(end - start + 1) + start;
    }
    public void generateData() throws IOException {
        File file = new File("D:\User.dat");
        if (!file.exists()) file.createNewFile();
        int start = 18, end = 70;
        long startTime = System.currentTimeMillis();
        BufferedWriter bos = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file, true)));
        for (long i = 1; i < Integer.MAX_VALUE * 1.7; i++) {
            String data = generateRandomData(start, end) + ",";
            bos.write(data);
            if (i % 1_000_000 == 0) bos.write("
");
        }
        System.out.println("Write completed! Time: " + (System.currentTimeMillis() - startTime) / 1000 + " s");
        bos.close();
    }
    public static void main(String[] args) throws IOException {
        new GenerateData().generateData();
    }
}

Running the generator twice produces the required 10 GB file.

Scenario Analysis

Because the data size far exceeds available memory, the file must be processed line by line using BufferedReader.readLine() rather than loading it entirely into memory.

Read Data

A single‑thread method reads the file and prints progress every 100 lines:

private static void readData() throws IOException {
    BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(FILE_NAME), "utf-8"));
    String line;
    long start = System.currentTimeMillis();
    int count = 1;
    while ((line = br.readLine()) != null) {
        if (count % 100 == 0) {
            System.out.println("Read 100 lines, elapsed: " + (System.currentTimeMillis() - start) / 1000 + " s");
            System.gc();
        }
        count++;
    }
    br.close();
}

Reading the full 10 GB file takes about 20 seconds; each 100‑line batch (≈1 E) costs roughly 1 second.

Process Data

Approach 1: Single‑Threaded Processing

A map countMap stores age as key and occurrence count as value. Each line is split by commas, and the map is updated accordingly.

for (int i = start; i <= end; i++) {
    countMap.computeIfAbsent(i + "", k -> new AtomicInteger(0));
}

public static void splitLine(String lineData) {
    String[] arr = lineData.split(",");
    for (String str : arr) {
        if (StringUtils.isEmpty(str)) continue;
        countMap.computeIfAbsent(str, k -> new AtomicInteger(0)).getAndIncrement();
    }
}

After processing, the most frequent age is found by iterating over the map:

private static void findMostAge() {
    int targetValue = 0;
    String targetKey = null;
    for (Map.Entry<String, AtomicInteger> e : countMap.entrySet()) {
        int v = e.getValue().get();
        if (v > targetValue) {
            targetValue = v;
            targetKey = e.getKey();
        }
    }
    System.out.println("Most frequent age: " + targetKey + ", count: " + targetValue);
}

Approach 2: Divide‑and‑Conquer Multithreading

To improve CPU utilization, a producer‑consumer model with a list of LinkedBlockingQueue instances is used. The producer reads lines and distributes them across queues based on a round‑robin index; multiple consumer threads each take from a dedicated queue, split the strings in parallel, and update the shared countMap.

private static List<LinkedBlockingQueue<String>> blockQueueLists = new LinkedList<>();
private static AtomicLong count = new AtomicLong(0);
private static final int threadNums = 20;

static {
    for (int i = 0; i < threadNums; i++) {
        blockQueueLists.add(new LinkedBlockingQueue<>(256));
    }
}

public static void splitLine(String lineData) {
    String[] arr = lineData.split("
");
    for (String str : arr) {
        if (StringUtils.isEmpty(str)) continue;
        long index = count.get() % threadNums;
        blockQueueLists.get((int) index).put(str);
        count.getAndIncrement();
    }
}

private static void startConsumer() {
    for (int i = 0; i < threadNums; i++) {
        final int idx = i;
        new Thread(() -> {
            while (consumerRunning) {
                try {
                    String str = blockQueueLists.get(idx).take();
                    countNum(str);
                } catch (InterruptedException e) { e.printStackTrace(); }
            }
        }).start();
    }
}

private static void countNum(String str) {
    int[] arr = new int[]{0, str.length() / 3};
    for (int i = 0; i < 3; i++) {
        String part = splitStr(str, arr);
        new Thread(() -> {
            for (String s : part.split(",")) {
                countMap.computeIfAbsent(s, k -> new AtomicInteger(0)).getAndIncrement();
            }
        }).start();
    }
}

The splitStr method recursively adjusts split boundaries to ensure commas are not broken.

public static String splitStr(String line, int[] arr) {
    int startIdx = arr[0];
    int endIdx = arr[1];
    char start = line.charAt(startIdx);
    char end = line.charAt(endIdx);
    if ((startIdx == 0 || start == ',') && end == ',') {
        arr[0] = endIdx + 1;
        arr[1] = arr[0] + line.length() / 3;
        if (arr[1] >= line.length()) arr[1] = line.length() - 1;
        return line.substring(startIdx, endIdx);
    }
    if (startIdx != 0 && start != ',') startIdx--;
    if (end != ',') endIdx++;
    arr[0] = startIdx;
    arr[1] = endIdx;
    if (arr[1] >= line.length()) arr[1] = line.length() - 1;
    return splitStr(line, arr);
}

Test Results

Single‑thread processing took about 180 seconds, using 2 – 2.5 GB memory and low CPU (20‑25%). The multithreaded version reduced total time to roughly 103 seconds, achieving over 90 % CPU utilization and stable memory consumption around 11.7 GB.

Encountered Issues

During execution, occasional GC pauses caused the program to stall, indicating heap pressure. Inserting manual System.gc() calls after processing a certain number of lines mitigated the problem. In production, a thread pool should replace the manual thread creation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Big Data multithreading File I/O

Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.