Efficient Large File Reading in Java: Memory‑Friendly Approaches and Concurrency
This article explains how to read large files in Java without running out of memory by comparing full‑file loading, line‑by‑line reading using BufferedReader, Apache Commons IO, and Java 8 streams, and shows how to boost throughput with batch processing and multithreaded file splitting.
When a Java application needs to read data from a file and store it into a database, loading the entire file into memory works for small files but quickly leads to Out‑Of‑Memory (OOM) errors for large files.
Memory Reading
The initial implementation reads all lines into a List<String> using Apache Commons IO's FileUtils.readLines , then processes each line. This approach can consume far more memory than the file size, causing OOM for files like a 740 MB test file with 2 million lines.
Stopwatch stopwatch = Stopwatch.createStarted();
// 将全部行数读取的内存中
List<String> lines = FileUtils.readLines(new File("temp/test.txt"), Charset.defaultCharset());
for (String line : lines) {
// pass
}
stopwatch.stop();
System.out.println("read all lines spend " + stopwatch.elapsed(TimeUnit.SECONDS) + " s");
logMemory();The memory‑logging method uses MemoryMXBean to print heap usage:
MemoryMXBean memoryMXBean = ManagementFactory.getMemoryMXBean();
MemoryUsage memoryUsage = memoryMXBean.getHeapMemoryUsage();
long totalMemorySize = memoryUsage.getInit();
long usedMemorySize = memoryUsage.getUsed();
System.out.println("Total Memory: " + totalMemorySize / (1024 * 1024) + " Mb");
System.out.println("Free Memory: " + usedMemorySize / (1024 * 1024) + " Mb");Line‑by‑Line Reading
To avoid OOM, the article introduces three line‑by‑line techniques.
BufferedReader
try (BufferedReader fileBufferReader = new BufferedReader(new FileReader("temp/test.txt"))) {
String fileLineContent;
while ((fileLineContent = fileBufferReader.readLine()) != null) {
// process the line.
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}Apache Commons IO
Stopwatch stopwatch = Stopwatch.createStarted();
LineIterator fileContents = FileUtils.lineIterator(new File("temp/test.txt"), StandardCharsets.UTF_8.name());
while (fileContents.hasNext()) {
fileContents.nextLine();
// pass
}
logMemory();
fileContents.close();
stopwatch.stop();
System.out.println("read all lines spend " + stopwatch.elapsed(TimeUnit.SECONDS) + " s");Java 8 Stream
Stopwatch stopwatch = Stopwatch.createStarted();
try (Stream<String> inputStream = Files.lines(Paths.get("temp/test.txt"), StandardCharsets.UTF_8)) {
inputStream
.filter(str -> str.length() > 5) // filter data
.forEach(o -> {
// pass do sample logic
});
}
logMemory();
stopwatch.stop();
System.out.println("read all lines spend " + stopwatch.elapsed(TimeUnit.SECONDS) + " s");Concurrent Reading
Processing lines sequentially can be slow for massive files, so two parallel strategies are presented.
Batch Packaging with ThreadPool
@SneakyThrows
public static void readInApacheIOWithThreadPool() {
ThreadPoolExecutor threadPoolExecutor = new ThreadPoolExecutor(10, 10, 60L, TimeUnit.SECONDS, new LinkedBlockingDeque<>(100));
LineIterator fileContents = FileUtils.lineIterator(new File("temp/test.txt"), StandardCharsets.UTF_8.name());
List<String> lines = Lists.newArrayList();
while (fileContents.hasNext()) {
String nextLine = fileContents.nextLine();
lines.add(nextLine);
if (lines.size() == 100000) {
List<List<String>> partition = Lists.partition(lines, 50000);
List<Future> futureList = Lists.newArrayList();
for (List<String> strings : partition) {
Future<?> future = threadPoolExecutor.submit(() -> {
processTask(strings);
});
futureList.add(future);
}
for (Future future : futureList) {
future.get();
}
lines.clear();
}
}
if (!lines.isEmpty()) {
processTask(lines);
}
threadPoolExecutor.shutdown();
}
private static void processTask(List<String> strings) {
for (String line : strings) {
try { TimeUnit.MILLISECONDS.sleep(10L); } catch (InterruptedException e) { e.printStackTrace(); }
}
}Splitting Large File into Small Files
public static void splitFileAndRead() throws Exception {
List<File> fileList = splitLargeFile("temp/test.txt");
ThreadPoolExecutor threadPoolExecutor = new ThreadPoolExecutor(10, 10, 60L, TimeUnit.SECONDS, new LinkedBlockingDeque<>(100));
List<Future> futureList = Lists.newArrayList();
for (File file : fileList) {
Future<?> future = threadPoolExecutor.submit(() -> {
try (Stream inputStream = Files.lines(file.toPath(), StandardCharsets.UTF_8)) {
inputStream.forEach(o -> {
try { TimeUnit.MILLISECONDS.sleep(10L); } catch (InterruptedException e) { e.printStackTrace(); }
});
} catch (IOException e) { e.printStackTrace(); }
});
futureList.add(future);
}
for (Future future : futureList) { future.get(); }
threadPoolExecutor.shutdown();
}
private static List<File> splitLargeFile(String largeFileName) throws IOException {
LineIterator fileContents = FileUtils.lineIterator(new File(largeFileName), StandardCharsets.UTF_8.name());
List<String> lines = Lists.newArrayList();
int num = 1;
List<File> files = Lists.newArrayList();
while (fileContents.hasNext()) {
String nextLine = fileContents.nextLine();
lines.add(nextLine);
if (lines.size() == 100000) {
createSmallFile(lines, num, files);
num++;
}
}
if (!lines.isEmpty()) { createSmallFile(lines, num, files); }
return files;
}Alternatively, a simple shell command can split the file: split -l 100000 test.txt .
Conclusion
For modest‑size files, loading the entire file into memory is acceptable and fast. For large files, line‑by‑line reading prevents OOM, and combining it with multithreading—either by batching lines or by splitting the file—significantly improves processing speed.
Full-Stack Internet Architecture
Introducing full-stack Internet architecture technologies centered on Java
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.