How a Misconfigured ThreadPool Caused Data Loss in Production – A Step‑by‑Step Troubleshooting Guide
The article recounts a real production incident where a recent deployment triggered loss of third‑party order data, traces the root cause to an improperly configured ThreadPoolExecutor that used a local memory queue, and walks through the systematic investigation and remediation steps.
Problem
The service had been stable for six months, but after a redeployment one night, more than 100 order records received from a third‑party API disappeared from the database. The third‑party confirmed it received a successful response, and the timestamps matched the deployment time, indicating the release caused the loss.
Initial Investigation
Team members first checked the code version and the new feature, but found no direct link to the missing data. The developer who owned the feature was puzzled because the functionality had passed multiple tests and the system had run without issues for a long period.
Further Investigation
DBAs, operations engineers, and the team lead performed a joint review of the interface code. The processing flow is:
Receive third‑party data → validate → business processing → persist to database.
The implementation validates the payload, immediately returns success to the caller, and then performs business logic and persistence asynchronously using a thread pool.
The use of a thread pool introduces a blocking queue that resides in local memory. If the application restarts before the queued tasks finish, any data held in that memory is lost.
Because the thread pool parameters were configured arbitrarily, the queue filled up when over 100 records arrived in a single burst, causing tasks to be dropped after a restart.
In many projects, thread‑pool settings are loosely tuned and rarely cause problems—until a scenario like this occurs.
Key ThreadPoolExecutor constructor parameters (with English comments):
public ThreadPoolExecutor(
int corePoolSize, // number of core threads
int maximumPoolSize, // maximum number of threads
long keepAliveTime, // idle time for temporary threads
TimeUnit unit, // time unit for keepAliveTime
BlockingQueue<Runnable> workQueue, // task waiting queue
ThreadFactory threadFactory, // optional thread factory
RejectedExecutionHandler handler // optional rejection policy
)Thread‑pool behavior recap:
Task submission : via execute(Runnable) or submit(Callable).
Core thread check : Core threads (corePoolSize) are used first; idle cores create new threads or reuse existing ones.
Task queuing : If all core threads are busy, new tasks are placed into the work queue.
Temporary thread creation : When the queue is full and the current thread count is below maximumPoolSize, non‑core threads are created.
Rejection policy : If the queue is full and the pool has reached maximumPoolSize, new tasks are rejected and the configured RejectedExecutionHandler is invoked.
A core pool size that is too small makes tasks wait in the queue; if the service restarts, those queued tasks are lost.
Conclusion
The incident illustrates how an ill‑configured thread pool can become a hidden source of data loss in production. Properly sizing core and maximum threads, monitoring queue depth, and ensuring graceful shutdown of pending tasks are essential safeguards for reliable backend services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Backend Full-Stack
Provides technical guidance, interview coaching, and tech sharing. Follow and reply '77' to receive our self-made 'Interview Cheat Sheet' and interview resources.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
