Root Cause Analysis of a Backend Out‑Of‑Memory Incident and Proper Use of ExecutorCompletionService
The article analyzes a production outage caused by misuse of ExecutorCompletionService, explains why missing take/poll leads to memory leaks, demonstrates correct and incorrect Java code examples, compares ExecutorService with ExecutorCompletionService, and provides practical guidelines to avoid similar OOM problems in backend services.
The incident started at 06:32 when a small number of users experienced homepage access errors, escalating to a full outage by 07:20 and resolved at 07:36 after a code rollback.
Root cause analysis revealed that the code used ExecutorCompletionService but never called take() or poll() , so completed tasks remained in the internal queue, causing a gradual Out‑Of‑Memory (OOM) situation.
Faulty code example:
public static void test() throws InterruptedException, ExecutionException {
Executor executor = Executors.newFixedThreadPool(3);
CompletionService
service = new ExecutorCompletionService<>(executor);
service.submit(new Callable
() {
@Override
public String call() throws Exception {
return "HelloWorld--" + Thread.currentThread().getName();
}
});
// missing service.take() or service.poll()
}The correct usage must retrieve the completed future, e.g.:
public static void test() throws InterruptedException, ExecutionException {
Executor executor = Executors.newFixedThreadPool(3);
CompletionService
service = new ExecutorCompletionService<>(executor);
service.submit(new Callable
() {
@Override
public String call() throws Exception {
return "HelloWorld--" + Thread.currentThread().getName();
}
});
service.take().get(); // retrieve and remove completed task
}To illustrate the difference, the article provides two sets of examples. The first uses ExecutorService with Future.get() , which blocks on each task in submission order, causing the longest task to delay all others:
public static void test1() throws Exception {
ExecutorService executorService = Executors.newCachedThreadPool();
List
> futureList = new ArrayList<>();
// three tasks with different sleep times (10s, 3s, 6s)
Future
f1 = executorService.submit(() -> { TimeUnit.SECONDS.sleep(10); return "president"; });
Future
f2 = executorService.submit(() -> { TimeUnit.SECONDS.sleep(3); return "dev"; });
Future
f3 = executorService.submit(() -> { TimeUnit.SECONDS.sleep(6); return "manager"; });
futureList.add(f1); futureList.add(f2); futureList.add(f3);
System.out.println("All notified, waiting for results");
for (Future
f : futureList) {
System.out.println(f.get() + ", go pick them up"); // blocks on each future
}
Thread.currentThread().join();
}The second example replaces ExecutorService with ExecutorCompletionService , allowing results to be taken as soon as any task finishes, thus avoiding the long‑task bottleneck:
public static void test2() throws Exception {
ExecutorService executorService = Executors.newCachedThreadPool();
ExecutorCompletionService
completionService = new ExecutorCompletionService<>(executorService);
System.out.println("All notified, waiting for results");
completionService.submit(() -> { TimeUnit.SECONDS.sleep(10); return "president"; });
completionService.submit(() -> { TimeUnit.SECONDS.sleep(3); return "dev"; });
completionService.submit(() -> { TimeUnit.SECONDS.sleep(6); return "manager"; });
for (int i = 0; i < 3; i++) {
String result = completionService.take().get(); // returns as soon as a task completes
System.out.println(result + ", go pick them up");
}
Thread.currentThread().join();
}The article explains the internal mechanism: ExecutorCompletionService wraps each task in a QueueingFuture (a subclass of FutureTask ) whose done() method enqueues the completed future into a completionQueue . Calls to take() block until a completed future is available, guaranteeing that only finished tasks are retrieved.
Key take‑aways for backend developers:
Always retrieve completed futures from ExecutorCompletionService (using take() or poll() ) to prevent the internal queue from retaining references and causing OOM.
Prefer ExecutorCompletionService for scenarios where multiple downstream RPC calls have varying latencies, so the fastest responses can be processed without waiting for the slowest.
Maintain strict code‑review practices, record rollback versions, and monitor memory, CPU, GC, and latency metrics after deployment.
By following these guidelines, teams can avoid similar memory‑leak incidents and improve the reliability of their backend services.
Selected Java Interview Questions
A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.