How to Process 10 GB of Age Data on a 4 GB Machine Using Java
This article walks through generating a 10 GB file of age values, reading it line‑by‑line on a 4 GB RAM, 2‑core machine, measuring single‑thread performance, then redesigning the pipeline with a producer‑consumer model, blocking queues and multithreaded string splitting to dramatically boost CPU utilization and cut processing time while managing memory consumption.
Problem Statement
Given a 10 GB file that stores ages (integers 18‑70) in a comma‑separated format, find the age that occurs most frequently. The target machine has 4 GB RAM and a dual‑core CPU, so the file cannot be loaded entirely into memory.
Data Generation (Java)
package bigdata;
import java.io.*;
import java.util.Random;
public class GenerateData {
private static final Random RANDOM = new Random();
private static int randomAge(int start, int end) {
return RANDOM.nextInt(end - start + 1) + start;
}
public void generate() throws IOException {
File file = new File("D:/User.dat");
if (!file.exists()) file.createNewFile();
int start = 18, end = 70;
long startTime = System.currentTimeMillis();
try (BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file, true)))) {
for (long i = 1; i < Integer.MAX_VALUE * 1.7; i++) {
bw.write(randomAge(start, end) + ",");
if (i % 1_000_000 == 0) bw.write("
");
}
}
System.out.println("Write completed in " + (System.currentTimeMillis() - startTime) / 1000 + " s");
}
public static void main(String[] args) throws IOException { new GenerateData().generate(); }
}The program writes roughly 2 500 lines (≈4 MB per line) to reach 10 GB.
Single‑Threaded Reading
private static void readData() throws IOException {
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(FILE_NAME), "utf-8"));
String line;
long start = System.currentTimeMillis();
int count = 1;
while ((line = br.readLine()) != null) {
if (count % 100 == 0) {
System.out.println("Read 100 lines, elapsed: " + (System.currentTimeMillis() - start) / 1000 + " s");
System.gc();
}
count++;
}
br.close();
}Reading the whole file sequentially takes ~20 seconds; each 100‑line batch (≈1 × 10⁸ records) costs ~1 second. Memory stays at 2‑2.5 GB, CPU usage is low (20‑25 %).
Single‑Threaded Counting
private static final Map<String, AtomicInteger> countMap = new ConcurrentHashMap<>();
public static void splitLine(String lineData) {
String[] arr = lineData.split(",");
for (String s : arr) {
if (StringUtils.isEmpty(s)) continue;
countMap.computeIfAbsent(s, k -> new AtomicInteger(0)).getAndIncrement();
}
}
private static void findMostAge() {
int max = 0;
String age = null;
for (Map.Entry<String, AtomicInteger> e : countMap.entrySet()) {
int v = e.getValue().get();
if (v > max) { max = v; age = e.getKey(); }
}
System.out.println("Most frequent age: " + age + ", count: " + max);
}The counting thread becomes the bottleneck because the producer (file reader) fills the queue faster than the consumer processes the data, leading to poor CPU utilization.
Multithreaded Producer‑Consumer Design
Use a set of LinkedBlockingQueue<String> instances, one per consumer thread, to balance the workload.
Queue Initialization
private static final int THREAD_NUMS = 20; // example
private static final List<LinkedBlockingQueue<String>> queues = new ArrayList<>();
static {
for (int i = 0; i < THREAD_NUMS; i++) {
queues.add(new LinkedBlockingQueue<>(256)); // capacity 256
}
}Producer (reading thread)
private static final AtomicLong lineCounter = new AtomicLong(0);
static void produce(String line) {
String[] parts = line.split("
");
for (String s : parts) {
if (StringUtils.isEmpty(s)) continue;
long idx = lineCounter.getAndIncrement() % THREAD_NUMS;
try {
queues.get((int) idx).put(s); // blocks when full
} catch (InterruptedException e) { Thread.currentThread().interrupt(); }
}
}Consumer Threads
private static volatile boolean running = true;
private static void startConsumers() {
System.out.println("Start consuming...");
for (int i = 0; i < THREAD_NUMS; i++) {
final int idx = i;
new Thread(() -> {
while (running) {
try {
String data = queues.get(idx).take();
countNum(data);
} catch (InterruptedException e) { Thread.currentThread().interrupt(); }
}
}).start();
}
}Parallel String Splitting
private static void countNum(String str) {
int[] range = new int[2];
range[1] = str.length() / 3; // initial split size
for (int i = 0; i < 3; i++) {
final String segment = SplitData.splitStr(str, range);
new Thread(() -> {
for (String token : segment.split(",")) {
countMap.computeIfAbsent(token, k -> new AtomicInteger(0)).getAndIncrement();
}
}).start();
}
}
public static String splitStr(String line, int[] arr) {
int start = arr[0];
int end = arr[1];
char startChar = line.charAt(start);
char endChar = line.charAt(end);
if ((start == 0 || startChar == ',') && endChar == ',') {
arr[0] = end + 1;
arr[1] = Math.min(arr[0] + line.length() / 3, line.length() - 1);
return line.substring(start, end);
}
if (start != 0 && startChar != ',') start--;
if (endChar != ',') end++;
arr[0] = start;
arr[1] = Math.min(end, line.length() - 1);
return splitStr(line, arr);
}Performance Results
Peak memory usage rises to ~11.7 GB (still within a 12 GB test machine).
CPU utilization stays above 90 %.
Total processing time drops from ~180 seconds (single‑thread) to ~103 seconds, a 75 % speed‑up.
Result correctness matches the single‑thread baseline.
Practical Considerations
Long‑running jobs may suffer GC pauses. A simple mitigation is to pause the main thread periodically and invoke System.gc(), or better, replace manual thread creation with an ExecutorService to control task lifecycles and reduce thread‑creation overhead.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Architect Essentials
Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
