When a Promotion Launch Crashed the System: A Deep Dive into Backend Failure and Lessons Learned
A senior engineer recounts a midnight e‑commerce promotion disaster caused by a mis‑designed cache‑update loop, detailing the alert storm, step‑by‑step investigation, JVM heap and GC analysis, the offending FastJSON serialization code, the rapid rollback, and three hard‑won operational rules.
Act 1: The Avalanche
At midnight on a high‑profile S‑level promotion, the promotion‑marketing cluster suddenly spiked: availability dropped below 10%, HSF thread‑pool active threads exceeded 95%, and CPU load surged past 8.0. Alerts flooded the monitoring channel, and the service that powered the promotion became effectively unavailable, causing the activity to disappear the moment it went live.
Act 2: Investigation Steps
Step 1 – Check logs. Numerous NullPointerExceptions appeared, but they originated from a peripheral client JAR unrelated to the core flow, so they were dismissed.
Step 2 – Suspect deadlock. The HSF thread pool was exhausted, a classic sign of thread‑starvation. A jstack snapshot showed no deadlocks, so the hypothesis was ruled out.
Step 3 – Restart machines. Restarting the most loaded nodes temporarily lowered CPU and load, but the metrics spiked again as soon as traffic returned.
Step 4 – Scale out. Adding 20 new machines only delayed the problem; the new instances quickly suffered the same high load and aggressive GC.
After 18 minutes of chaos, the team turned to the JVM internals for deeper insight.
Act 3: Root Cause
Heap dump analysis revealed a constantly high Old Gen usage and ineffective CMS collections, leading to frequent full GCs that explained the CPU surge. A massive char[] array held a huge activity‑configuration string, indicating a large object lingering in memory.
Thread‑stack analysis showed over 300 threads in TIMED_WAITING and 246 in RUNNABLE. The runnable threads were all blocked in FastJSON serialization: at com.alibaba.fastjson.toJSONString(...) The culprit was a cache‑update method that serialized a 1‑2 MB activity object inside a loop of 20 partitions, causing 20 serializations per request:
// ... omitted imports
public void updateActivityXxxCache(Long sellerId, List<XxxDO> xxxDOList) {
try {
if (CollectionUtils.isEmpty(xxxDOList)) {
xxxDOList = new ArrayList<>();
}
// 20 partition keys to spread read pressure
for (int index = 0; index < XXX_CACHE_PARTITION_NUMBER; index++) {
// Fatal: serialization inside the loop!
tairCache.put(String.format(ACTIVITY_PLAY_KEY, xxxId, index),
JSON.toJSONString(xxxDOList), // serialized 20 times!
EXPIRE_TIME);
}
} catch (Exception e) {
log.warn("update cache exception occur", e);
}
}This loop caused each cache‑miss recovery to serialize the large object 20 times, turning the service into a "CPU meat grinder". The Tair LDB middleware, already fragile, was overwhelmed by the 20 × 1 MB write traffic, triggering rate‑limiting and further inflating latency.
Consequently, the HSF thread pool filled with these slow, CPU‑bound tasks, leading to a full‑cluster avalanche.
Act 4: Fix and Reflections
The offending loop was rolled back at around 00:30, restoring stability within 30 minutes. The post‑mortem yielded three rules:
Rule 1: Any optimization performed without capacity assessment is reckless.
Rule 2: Monitoring must drill down to the code‑block level; an APM that pinpoints the hot method would halve investigation time.
Rule 3: Technical debt inevitably explodes at the worst moment.
The incident underscored that even a single misplaced for loop can cause a P3 outage, and that respecting code and maintaining observability are essential for reliable backend systems.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
