Backend Development 11 min read

Root Cause Analysis of a Backend Out‑Of‑Memory Incident and Proper Use of ExecutorCompletionService

The article analyzes a production outage caused by misuse of ExecutorCompletionService, explains why missing take/poll leads to memory leaks, demonstrates correct and incorrect Java code examples, compares ExecutorService with ExecutorCompletionService, and provides practical guidelines to avoid similar OOM problems in backend services.

Selected Java Interview Questions

Aug 1, 2022

Root Cause Analysis of a Backend Out‑Of‑Memory Incident and Proper Use of ExecutorCompletionService

The incident started at 06:32 when a small number of users experienced homepage access errors, escalating to a full outage by 07:20 and resolved at 07:36 after a code rollback.

Root cause analysis revealed that the code used ExecutorCompletionService but never called take() or poll(), so completed tasks remained in the internal queue, causing a gradual Out‑Of‑Memory (OOM) situation.

Faulty code example:

public static void test() throws InterruptedException, ExecutionException {
    Executor executor = Executors.newFixedThreadPool(3);
    CompletionService<String> service = new ExecutorCompletionService<>(executor);
    service.submit(new Callable<String>() {
        @Override
        public String call() throws Exception {
            return "HelloWorld--" + Thread.currentThread().getName();
        }
    });
    // missing service.take() or service.poll()
}

The correct usage must retrieve the completed future, e.g.:

public static void test() throws InterruptedException, ExecutionException {
    Executor executor = Executors.newFixedThreadPool(3);
    CompletionService<String> service = new ExecutorCompletionService<>(executor);
    service.submit(new Callable<String>() {
        @Override
        public String call() throws Exception {
            return "HelloWorld--" + Thread.currentThread().getName();
        }
    });
    service.take().get(); // retrieve and remove completed task
}

To illustrate the difference, the article provides two sets of examples. The first uses ExecutorService with Future.get(), which blocks on each task in submission order, causing the longest task to delay all others:

public static void test1() throws Exception {
    ExecutorService executorService = Executors.newCachedThreadPool();
    List<Future<String>> futureList = new ArrayList<>();
    // three tasks with different sleep times (10s, 3s, 6s)
    Future<String> f1 = executorService.submit(() -> { TimeUnit.SECONDS.sleep(10); return "president"; });
    Future<String> f2 = executorService.submit(() -> { TimeUnit.SECONDS.sleep(3);  return "dev"; });
    Future<String> f3 = executorService.submit(() -> { TimeUnit.SECONDS.sleep(6);  return "manager"; });
    futureList.add(f1); futureList.add(f2); futureList.add(f3);
    System.out.println("All notified, waiting for results");
    for (Future<String> f : futureList) {
        System.out.println(f.get() + ", go pick them up"); // blocks on each future
    }
    Thread.currentThread().join();
}

The second example replaces ExecutorService with ExecutorCompletionService, allowing results to be taken as soon as any task finishes, thus avoiding the long‑task bottleneck:

public static void test2() throws Exception {
    ExecutorService executorService = Executors.newCachedThreadPool();
    ExecutorCompletionService<String> completionService = new ExecutorCompletionService<>(executorService);
    System.out.println("All notified, waiting for results");
    completionService.submit(() -> { TimeUnit.SECONDS.sleep(10); return "president"; });
    completionService.submit(() -> { TimeUnit.SECONDS.sleep(3);  return "dev"; });
    completionService.submit(() -> { TimeUnit.SECONDS.sleep(6);  return "manager"; });
    for (int i = 0; i < 3; i++) {
        String result = completionService.take().get(); // returns as soon as a task completes
        System.out.println(result + ", go pick them up");
    }
    Thread.currentThread().join();
}

The article explains the internal mechanism: ExecutorCompletionService wraps each task in a QueueingFuture (a subclass of FutureTask) whose done() method enqueues the completed future into a completionQueue. Calls to take() block until a completed future is available, guaranteeing that only finished tasks are retrieved.

Key take‑aways for backend developers:

Always retrieve completed futures from ExecutorCompletionService (using take() or poll()) to prevent the internal queue from retaining references and causing OOM.

Prefer ExecutorCompletionService for scenarios where multiple downstream RPC calls have varying latencies, so the fastest responses can be processed without waiting for the slowest.

Maintain strict code‑review practices, record rollback versions, and monitor memory, CPU, GC, and latency metrics after deployment.

By following these guidelines, teams can avoid similar memory‑leak incidents and improve the reliability of their backend services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

ThreadPool OutOfMemory ExecutorCompletionService

Written by

Selected Java Interview Questions

A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.