How to Find the Most Frequent Age in a 10 GB File Using Java Multithreading
This article explains how to generate a 10 GB file of age data, read it efficiently on a machine with limited memory, and use both single‑threaded and multithreaded Java techniques—including a producer‑consumer model and divide‑and‑conquer—to identify the age that appears most frequently, while analyzing performance, memory usage, and CPU utilization.
Scenario Description
We have a 10 GB file containing integers between 18 and 70, each representing the count of people of that age. The task is to find the number that appears most frequently using a machine with 4 GB RAM and a dual‑core CPU.
Data Generation
Java code generates roughly 3 billion integers (≈10 GB) by writing 1 million records per line, resulting in about 2500 lines.
package bigdata;
import java.io.*;
import java.util.Random;
/**
* @Desc:
* @Author: bingbing
* @Date: 2022/5/4 0004 19:05
*/
public class GenerateData {
private static Random random = new Random();
public static int generateRandomData(int start, int end) {
return random.nextInt(end - start + 1) + start;
}
/**
* 产生10G的 1-1000的数据在D盘
*/
public void generateData() throws IOException {
File file = new File("D:\\ User.dat");
if (!file.exists()) {
try { file.createNewFile(); } catch (IOException e) { e.printStackTrace(); }
}
int start = 18;
int end = 70;
long startTime = System.currentTimeMillis();
BufferedWriter bos = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file, true)));
for (long i = 1; i < Integer.MAX_VALUE * 1.7; i++) {
String data = generateRandomData(start, end) + ",";
bos.write(data);
// 每100万条记录成一行,100万条数据大概4M
if (i % 1000000 == 0) { bos.write("
"); }
}
System.out.println("写入完成! 共花费时间:" + (System.currentTimeMillis() - startTime) / 1000 + " s");
bos.close();
}
public static void main(String[] args) {
GenerateData generateData = new GenerateData();
try { generateData.generateData(); } catch (IOException e) { e.printStackTrace(); }
}
}Reading the Data
A single‑threaded method reads the file line by line with BufferedReader and prints progress every 100 lines. Reading the whole file takes about 20 seconds.
private static void readData() throws IOException {
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(FILE_NAME), "utf-8"));
String line;
long start = System.currentTimeMillis();
int count = 1;
while ((line = br.readLine()) != null) {
if (count % 100 == 0) {
System.out.println("读取100行,总耗时间: " + (System.currentTimeMillis() - start) / 1000 + " s");
System.gc();
}
count++;
}
running = false;
br.close();
}Single‑Threaded Processing (Idea 1)
Initialize a map countMap where the key is the age and the value is the occurrence count. Split each line by commas and update the map. After processing, iterate the map to locate the age with the highest count.
// Initialize countMap
for (int i = start; i <= end; i++) {
try {
File subFile = new File(dir + "\\" + i + ".dat");
if (!subFile.exists()) { subFile.createNewFile(); }
countMap.computeIfAbsent(i + "", k -> new AtomicInteger(0));
} catch (Exception e) { e.printStackTrace(); }
}
public static void splitLine(String lineData) {
String[] arr = lineData.split(",");
for (String str : arr) {
if (StringUtils.isEmpty(str)) { continue; }
countMap.computeIfAbsent(str, s -> new AtomicInteger(0)).getAndIncrement();
}
}
private static void findMostAge() {
int targetValue = 0;
String targetKey = null;
for (Map.Entry<String, AtomicInteger> entry : countMap.entrySet()) {
int value = entry.getValue().get();
String key = entry.getKey();
if (value > targetValue) { targetValue = value; targetKey = key; }
}
System.out.println("数量最多的年龄为:" + targetKey + "数量为:" + targetValue);
}Result: processing 10 GB takes about 3 minutes, memory usage 2–2.5 GB, CPU utilization only 20‑25 %.
Multithreaded Processing (Idea 2 – Divide and Conquer)
Use a producer‑consumer pattern with a list of LinkedBlockingQueue objects. The producer reads lines, distributes each value to a queue based on a round‑robin index, and the consumer threads take values from their dedicated queues and update the shared countMap.
// Queue list
private static List<LinkedBlockingQueue<String>> blockQueueLists = new LinkedList<>();
static {
for (int i = 0; i < threadNums; i++) {
blockQueueLists.add(new LinkedBlockingQueue<>(256));
}
}
private static AtomicLong count = new AtomicLong(0);
static class SplitData {
public static void splitLine(String lineData) {
String[] arr = lineData.split("
");
for (String str : arr) {
if (StringUtils.isEmpty(str)) { continue; }
long index = count.get() % threadNums;
try { blockQueueLists.get((int) index).put(str); } catch (InterruptedException e) { e.printStackTrace(); }
count.getAndIncrement();
}
}
}
private static void startConsumer() throws Exception {
System.out.println("开始消费...");
for (int i = 0; i < threadNums; i++) {
final int index = i;
new Thread(() -> {
while (consumerRunning) {
try {
String str = blockQueueLists.get(index).take();
countNum(str);
} catch (InterruptedException e) { e.printStackTrace(); }
}
}).start();
}
}
private static void countNum(String str) {
int[] arr = new int[2];
arr[1] = str.length() / 3;
for (int i = 0; i < 3; i++) {
final String innerStr = SplitData.splitStr(str, arr);
new Thread(() -> {
String[] strArray = innerStr.split(",");
for (String s : strArray) {
countMap.computeIfAbsent(s, k -> new AtomicInteger(0)).getAndIncrement();
}
}).start();
}
}
/**
* Split string by approximate coordinates, adjusting to commas.
*/
public static String splitStr(String line, int[] arr) {
int startIndex = arr[0];
int endIndex = arr[1];
char start = line.charAt(startIndex);
char end = line.charAt(endIndex);
if ((startIndex == 0 || start == ',') && end == ',') {
arr[0] = endIndex + 1;
arr[1] = arr[0] + line.length() / 3;
if (arr[1] >= line.length()) { arr[1] = line.length() - 1; }
return line.substring(startIndex, endIndex);
}
if (startIndex != 0 && start != ',') { startIndex--; }
if (end != ',') { endIndex++; }
arr[0] = startIndex;
arr[1] = endIndex;
if (arr[1] >= line.length()) { arr[1] = line.length() - 1; }
return splitStr(line, arr);
}After optimization, memory stabilizes around 11.7 GB, CPU utilization exceeds 90 %, and total execution time drops from 180 seconds to 103 seconds (≈75 % speed‑up) while producing the same result.
Problems Encountered
Frequent full‑GC pauses can cause memory spikes. A simple mitigation is to pause the main thread periodically and invoke System.gc(). In production, a thread pool should replace manual thread creation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Backend Technology
Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
