When a Promotion Launch Crashed the System: A Deep Dive into Backend Failure and Lessons Learned

A senior engineer recounts a midnight e‑commerce promotion disaster caused by a mis‑designed cache‑update loop, detailing the alert storm, step‑by‑step investigation, JVM heap and GC analysis, the offending FastJSON serialization code, the rapid rollback, and three hard‑won operational rules.

dbaplus Community
dbaplus Community
dbaplus Community
When a Promotion Launch Crashed the System: A Deep Dive into Backend Failure and Lessons Learned

Act 1: The Avalanche

At midnight on a high‑profile S‑level promotion, the promotion‑marketing cluster suddenly spiked: availability dropped below 10%, HSF thread‑pool active threads exceeded 95%, and CPU load surged past 8.0. Alerts flooded the monitoring channel, and the service that powered the promotion became effectively unavailable, causing the activity to disappear the moment it went live.

Act 2: Investigation Steps

Step 1 – Check logs. Numerous NullPointerExceptions appeared, but they originated from a peripheral client JAR unrelated to the core flow, so they were dismissed.

Step 2 – Suspect deadlock. The HSF thread pool was exhausted, a classic sign of thread‑starvation. A jstack snapshot showed no deadlocks, so the hypothesis was ruled out.

Step 3 – Restart machines. Restarting the most loaded nodes temporarily lowered CPU and load, but the metrics spiked again as soon as traffic returned.

Step 4 – Scale out. Adding 20 new machines only delayed the problem; the new instances quickly suffered the same high load and aggressive GC.

After 18 minutes of chaos, the team turned to the JVM internals for deeper insight.

Act 3: Root Cause

Heap dump analysis revealed a constantly high Old Gen usage and ineffective CMS collections, leading to frequent full GCs that explained the CPU surge. A massive char[] array held a huge activity‑configuration string, indicating a large object lingering in memory.

Thread‑stack analysis showed over 300 threads in TIMED_WAITING and 246 in RUNNABLE. The runnable threads were all blocked in FastJSON serialization: at com.alibaba.fastjson.toJSONString(...) The culprit was a cache‑update method that serialized a 1‑2 MB activity object inside a loop of 20 partitions, causing 20 serializations per request:

// ... omitted imports
public void updateActivityXxxCache(Long sellerId, List<XxxDO> xxxDOList) {
    try {
        if (CollectionUtils.isEmpty(xxxDOList)) {
            xxxDOList = new ArrayList<>();
        }
        // 20 partition keys to spread read pressure
        for (int index = 0; index < XXX_CACHE_PARTITION_NUMBER; index++) {
            // Fatal: serialization inside the loop!
            tairCache.put(String.format(ACTIVITY_PLAY_KEY, xxxId, index),
                         JSON.toJSONString(xxxDOList), // serialized 20 times!
                         EXPIRE_TIME);
        }
    } catch (Exception e) {
        log.warn("update cache exception occur", e);
    }
}

This loop caused each cache‑miss recovery to serialize the large object 20 times, turning the service into a "CPU meat grinder". The Tair LDB middleware, already fragile, was overwhelmed by the 20 × 1 MB write traffic, triggering rate‑limiting and further inflating latency.

Consequently, the HSF thread pool filled with these slow, CPU‑bound tasks, leading to a full‑cluster avalanche.

Act 4: Fix and Reflections

The offending loop was rolled back at around 00:30, restoring stability within 30 minutes. The post‑mortem yielded three rules:

Rule 1: Any optimization performed without capacity assessment is reckless.

Rule 2: Monitoring must drill down to the code‑block level; an APM that pinpoints the hot method would halve investigation time.

Rule 3: Technical debt inevitably explodes at the worst moment.

The incident underscored that even a single misplaced for loop can cause a P3 outage, and that respecting code and maintaining observability are essential for reliable backend systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JVMTroubleshooting
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.