Backend Development 14 min read

Common Intermittent Bugs in Production: Scenarios, Cases, and Prevention

Production teams often face intermittent bugs that slip through local and test environments, typically caused by concurrency issues, cache inconsistencies, mutable shared templates, improper thread‑local cleanup, unsynchronized async tasks, race conditions, and resource failures, so writing thread‑safe code, simulating real traffic, logging clearly, and ensuring graceful shutdowns are essential for prevention.

Java Tech Enthusiast

Jan 30, 2024

Common Intermittent Bugs in Production: Scenarios, Cases, and Prevention

In daily development, many teams encounter intermittent bugs that only appear under specific conditions in production, despite passing local, test, and pre‑release environments.

This article categorises common scenarios that lead to such issues and provides concrete code examples.

Scenario categories include:

Concurrent access, asynchronous programming, resource contention

Cache consistency problems

Dirty data and data skew

Boundary values, time‑outs, rate limiting

Server and hardware failures

Incompatible code changes

Network and other external factors

Case 1 – Non‑thread‑safe collection with parallelStream

List<XXXDO> dataList = /* fetch from DB */;
List<XXXDO> successList = new ArrayList();
List<XXXDO> failList = new ArrayList();

dataList.parallelStream().forEach(vo -> {
    // ...
    if (/* success */) {
        successList.add(vo);
    } else {
        failList.add(vo);
    }
});

This works with small data sets but can produce wrong results when the volume grows.

Case 2 – ThreadLocal not removed

// Correct usage
try {
    // business logic
} finally {
    threadLocalUser.remove();
}

// Incorrect usage (remove may be skipped on exception)
try {
    // business logic
    threadLocalUser.remove(); // may never run
} catch (Exception e) {
    // handle
}

Failing to clean up ThreadLocal leads to stale data when threads are reused.

Case 3 – Modifying member variables of a configuration template

Map<String, AuthorizedCardParamVO> cardParamVO = new HashMap<>();
// ... load template from Nacos
String contentDescStr = Optional.ofNullable(stable.getContentDesc())
    .map(contentDesc -> contentDesc.replace("$userName$", params.get("userName")))
    .orElse(stable.getContentDesc());

stable.setContentDesc(contentDescStr); // mutates shared template

If multiple requests share the same template instance, concurrent modifications cause incorrect placeholders.

Case 4 – Asynchronous dependency with thread pool

List<XXXDO> successList = new ArrayList();
List<XXXDO> failList = new ArrayList();

for (XXXDO vo : dataList) {
    ThreadUtil.execute(() -> {
        // call external API
        if (/* ok */) {
            successList.add(vo);
        } else {
            failList.add(vo);
        }
    });
}
// Method may return before the lists are populated

Missing synchronization or latch leads to empty results under load.

Case 5 – Counter++ race condition

public class UnsafeConcurrencyExample {
    private static int counter = 0;
    public static void main(String[] args) throws InterruptedException {
        Thread t1 = new Thread(() -> {
            for (int i = 0; i < 1000; i++) counter++;
        });
        Thread t2 = new Thread(() -> {
            for (int i = 0; i < 1000; i++) counter++;
        });
        t1.start();
        t2.start();
        t1.join();
        t2.join();
        System.out.println("Counter: " + counter);
    }
}

Without atomic operations the final count is often less than expected.

Case 6 – Cache inconsistency

Cache<String, Object> cache = CacheBuilder.newBuilder()
    .expireAfterWrite(10, TimeUnit.MINUTES)
    .build();

Object data = cache.get(key, () -> fetchDataFromDB());
// If cache expires at different times on different nodes, some requests see stale data.

Long cache TTL can cause mixed fresh and stale responses.

Case 7 – Thread pool without graceful shutdown

public class SimpleThreadPool {
    private ExecutorService executor;
    public SimpleThreadPool(int threads) {
        executor = Executors.newFixedThreadPool(threads);
    }
    public void execute(Runnable task) {
        executor.execute(task);
    }
    public void shutdown() {
        executor.shutdown();
    }
}

If the service is redeployed while tasks are running, abrupt termination may leave dirty data.

Other examples include dirty data causing multiple rows from a selectOne query, rate‑limiting triggered by large batch operations, disk‑full crashes, memory leaks due to uncontrolled object creation, and RPC time‑outs.

Takeaways

Write rigorous, thread‑safe code; many bugs stem from careless coding.

Consider boundary conditions and simulate real‑world traffic in tests.

Maintain clear logs and avoid swallowing exceptions.

Implement graceful shutdown and health‑checks for each node.

Monitor resources (CPU, memory, disk, network) and set proper limits.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

concurrency thread-safety intermittent bugs production debugging

Written by

Java Tech Enthusiast

Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.