Big Data 20 min read

Processing 10 GB Age Data on a 4 GB Memory Machine Using Java: Single‑Threaded and Multi‑Threaded Approaches

The article presents a complete Java solution for generating, reading, and analyzing a 10 GB file of age values (18‑70) on a 4 GB RAM, 2‑core PC, comparing single‑threaded counting with a producer‑consumer multithreaded design that dramatically improves CPU utilization and reduces total processing time.

Architect

May 24, 2022

Processing 10 GB Age Data on a 4 GB Memory Machine Using Java: Single‑Threaded and Multi‑Threaded Approaches

This guide describes how to handle a 10 GB file that stores integer ages between 18 and 70, representing population statistics, on a computer with only 4 GB of memory and a dual‑core CPU.

Data generation : A Java program ( GenerateData) creates the file by writing random ages (4 bytes per integer) in batches of one million records per line, resulting in roughly 2500 lines (≈4 MB each) to reach 10 GB.

package bigdata;

import java.io.*;
import java.util.Random;

public class GenerateData {
    private static Random random = new Random();
    public static int generateRandomData(int start, int end) {
        return random.nextInt(end - start + 1) + start;
    }
    public void generateData() throws IOException {
        File file = new File("D:\\User.dat");
        if (!file.exists()) file.createNewFile();
        BufferedWriter bos = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file, true)));
        int start = 18, end = 70;
        long startTime = System.currentTimeMillis();
        for (long i = 1; i < Integer.MAX_VALUE * 1.7; i++) {
            String data = generateRandomData(start, end) + ",";
            bos.write(data);
            if (i % 1_000_000 == 0) bos.write("
");
        }
        System.out.println("写入完成! 共花费时间:" + (System.currentTimeMillis() - startTime) / 1000 + " s");
        bos.close();
    }
    public static void main(String[] args) throws IOException {
        new GenerateData().generateData();
    }
}

Single‑threaded reading : Using BufferedReader.readLine() the file is read line by line; processing 100 lines takes about 1 second, and the whole read completes in roughly 20 seconds.

private static void readData() throws IOException {
    BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(FILE_NAME), "utf-8"));
    String line;
    long start = System.currentTimeMillis();
    int count = 1;
    while ((line = br.readLine()) != null) {
        // Split and count later
        if (count % 100 == 0) {
            System.out.println("读取100行, 总耗时间: " + (System.currentTimeMillis() - start) / 1000 + " s");
            System.gc();
        }
        count++;
    }
    br.close();
}

Single‑threaded processing : Each line is split by commas, and a concurrent map ( countMap) records the frequency of each age. After the file is fully read, the map is scanned to find the age with the highest count.

public static void splitLine(String lineData) {
    String[] arr = lineData.split(",");
    for (String str : arr) {
        if (StringUtils.isEmpty(str)) continue;
        countMap.computeIfAbsent(str, s -> new AtomicInteger(0)).getAndIncrement();
    }
}

private static void findMostAge() {
    int targetValue = 0;
    String targetKey = null;
    for (Map.Entry<String, AtomicInteger> entry : countMap.entrySet()) {
        int value = entry.getValue().get();
        if (value > targetValue) {
            targetValue = value;
            targetKey = entry.getKey();
        }
    }
    System.out.println("数量最多的年龄为:" + targetKey + " 数量为：" + targetValue);
}

Performance of the single‑thread version : Total execution time is about 3 minutes, memory consumption stabilises at 2‑2.5 GB, and CPU usage stays low (20‑25 %).

Multi‑threaded approach : To increase CPU utilisation, a producer‑consumer pattern is introduced. A list of LinkedBlockingQueue<String> (one per consumer thread) stores lines; the producer reads the file and distributes lines round‑robin. Each consumer parses its queue, splits strings in parallel, and updates the same countMap.

private static List<LinkedBlockingQueue<String>> blockQueueLists = new LinkedList<>();
private static final int THREAD_NUMS = 20;
private static AtomicLong count = new AtomicLong(0);

static {
    for (int i = 0; i < THREAD_NUMS; i++) {
        blockQueueLists.add(new LinkedBlockingQueue<>(256));
    }
}

public static void splitLine(String lineData) {
    String[] arr = lineData.split("
");
    for (String str : arr) {
        long index = count.getAndIncrement() % THREAD_NUMS;
        try {
            blockQueueLists.get((int) index).put(str);
        } catch (InterruptedException e) { e.printStackTrace(); }
    }
}

private static void startConsumer() {
    for (int i = 0; i < THREAD_NUMS; i++) {
        final int idx = i;
        new Thread(() -> {
            while (consumerRunning) {
                try {
                    String str = blockQueueLists.get(idx).take();
                    countNum(str);
                } catch (InterruptedException e) { e.printStackTrace(); }
            }
        }).start();
    }
}

private static void countNum(String str) {
    int[] arr = new int[]{0, str.length() / 3};
    for (int i = 0; i < 3; i++) {
        final String part = splitStr(str, arr);
        new Thread(() -> {
            for (String s : part.split(",")) {
                countMap.computeIfAbsent(s, k -> new AtomicInteger(0)).getAndIncrement();
            }
        }).start();
    }
}

Results of the multithreaded version : CPU utilisation rises above 90 %, total processing time drops from ~180 s to ~103 s (≈75 % speed‑up), and the final most‑frequent age matches the single‑thread result.

Issues and fixes : During long runs the JVM may experience GC stalls; inserting short sleeps and explicit System.gc() after processing a batch mitigates the problem. The demo creates threads manually; in production a thread‑pool should be used.

Conclusion : By combining line‑by‑line streaming, a concurrent counting map, and a producer‑consumer multithreaded pipeline, massive 10 GB datasets can be processed efficiently on modest hardware without exceeding memory limits.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

java performance optimization algorithm multithreading Producer Consumer

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.