Common Intermittent Bugs in Production: Scenarios, Cases, and Prevention
Production teams often face intermittent bugs that slip through local and test environments, typically caused by concurrency issues, cache inconsistencies, mutable shared templates, improper thread‑local cleanup, unsynchronized async tasks, race conditions, and resource failures, so writing thread‑safe code, simulating real traffic, logging clearly, and ensuring graceful shutdowns are essential for prevention.
In daily development, many teams encounter intermittent bugs that only appear under specific conditions in production, despite passing local, test, and pre‑release environments.
This article categorises common scenarios that lead to such issues and provides concrete code examples.
Scenario categories include:
Concurrent access, asynchronous programming, resource contention
Cache consistency problems
Dirty data and data skew
Boundary values, time‑outs, rate limiting
Server and hardware failures
Incompatible code changes
Network and other external factors
Case 1 – Non‑thread‑safe collection with parallelStream
List<XXXDO> dataList = /* fetch from DB */;
List<XXXDO> successList = new ArrayList();
List<XXXDO> failList = new ArrayList();
dataList.parallelStream().forEach(vo -> {
// ...
if (/* success */) {
successList.add(vo);
} else {
failList.add(vo);
}
});This works with small data sets but can produce wrong results when the volume grows.
Case 2 – ThreadLocal not removed
// Correct usage
try {
// business logic
} finally {
threadLocalUser.remove();
}
// Incorrect usage (remove may be skipped on exception)
try {
// business logic
threadLocalUser.remove(); // may never run
} catch (Exception e) {
// handle
}Failing to clean up ThreadLocal leads to stale data when threads are reused.
Case 3 – Modifying member variables of a configuration template
Map
cardParamVO = new HashMap<>();
// ... load template from Nacos
String contentDescStr = Optional.ofNullable(stable.getContentDesc())
.map(contentDesc -> contentDesc.replace("$userName$", params.get("userName")))
.orElse(stable.getContentDesc());
stable.setContentDesc(contentDescStr); // mutates shared templateIf multiple requests share the same template instance, concurrent modifications cause incorrect placeholders.
Case 4 – Asynchronous dependency with thread pool
List<XXXDO> successList = new ArrayList();
List<XXXDO> failList = new ArrayList();
for (XXXDO vo : dataList) {
ThreadUtil.execute(() -> {
// call external API
if (/* ok */) {
successList.add(vo);
} else {
failList.add(vo);
}
});
}
// Method may return before the lists are populatedMissing synchronization or latch leads to empty results under load.
Case 5 – Counter++ race condition
public class UnsafeConcurrencyExample {
private static int counter = 0;
public static void main(String[] args) throws InterruptedException {
Thread t1 = new Thread(() -> {
for (int i = 0; i < 1000; i++) counter++;
});
Thread t2 = new Thread(() -> {
for (int i = 0; i < 1000; i++) counter++;
});
t1.start();
t2.start();
t1.join();
t2.join();
System.out.println("Counter: " + counter);
}
}Without atomic operations the final count is often less than expected.
Case 6 – Cache inconsistency
Cache
cache = CacheBuilder.newBuilder()
.expireAfterWrite(10, TimeUnit.MINUTES)
.build();
Object data = cache.get(key, () -> fetchDataFromDB());
// If cache expires at different times on different nodes, some requests see stale data.Long cache TTL can cause mixed fresh and stale responses.
Case 7 – Thread pool without graceful shutdown
public class SimpleThreadPool {
private ExecutorService executor;
public SimpleThreadPool(int threads) {
executor = Executors.newFixedThreadPool(threads);
}
public void execute(Runnable task) {
executor.execute(task);
}
public void shutdown() {
executor.shutdown();
}
}If the service is redeployed while tasks are running, abrupt termination may leave dirty data.
Other examples include dirty data causing multiple rows from a selectOne query, rate‑limiting triggered by large batch operations, disk‑full crashes, memory leaks due to uncontrolled object creation, and RPC time‑outs.
Takeaways
Write rigorous, thread‑safe code; many bugs stem from careless coding.
Consider boundary conditions and simulate real‑world traffic in tests.
Maintain clear logs and avoid swallowing exceptions.
Implement graceful shutdown and health‑checks for each node.
Monitor resources (CPU, memory, disk, network) and set proper limits.
Java Tech Enthusiast
Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.