How a Single Looped Serialization Turned a Major Promotion into a System Avalanche
A 2021 midnight promotion in Hangzhou crashed when a poorly placed loop serialized a massive object twenty times per request, overwhelming CPU, thread pools, and the Tair cache, leading to a full‑stack service avalanche that was only resolved after a half‑hour emergency rollback.
In the sweltering night of 2021, a new e‑commerce engineer in Hangzhou launched a high‑stakes S‑level "member flash promotion" that immediately triggered a system avalanche. Within seconds, alerts flooded the monitoring channel, showing severe degradation in the promotion-marketing cluster: application availability dropped below 10%, HSF thread‑pool active threads exceeded 95%, and CPU load spiked above 8.0.
[严重] promotion-marketing集群 - 应用可用度 < 10%
[严重] promotion-marketing集群 - HSF线程池活跃线程数 > 95%
[紧急] promotion-marketing集群 - CPU Load > 8.0The service responsible for the promotion became effectively dead, causing the activity entry to disappear as soon as it went live.
Act One: Ineffective Struggles
Step One – Check logs. An NPE appeared frequently but originated from a peripheral client JAR unrelated to the core flow, so it was dismissed.
Step Two – Suspect deadlock. All HSF threads were exhausted, a classic sign of thread‑starvation. A jstack dump showed no deadlocks, so this hypothesis was also ruled out.
Step Three – Restart machines. Restarting the most loaded nodes helped for a couple of minutes, but traffic instantly drove CPU and load back to the ceiling.
Step Four – Scale out. Adding twenty new machines briefly alleviated pressure, yet they quickly fell into the same high‑load, massive‑GC trap.
“A junior engineer watching the red curves whispered, ‘It feels like we’re being lifted away…’”
Act Two: Digging Into the JVM
All conventional mitigations failed, so the team inspected the JVM internals. A retained faulty machine was dumped for heap and thread‑stack analysis.
The heap showed the Old Generation constantly full, with CMS failing to reclaim memory, causing frequent full GCs that explained the CPU surge. Moreover, large char[] arrays referenced a massive "activity configuration" string, indicating a huge object lingering in memory.
Thread‑stack snapshots revealed over 300 threads in TIMED_WAITING and 246 in RUNNABLE. The runnable threads were all stuck in FastJSON serialization: at com.alibaba.fastjson.toJSONString(...) This pointed to a single massive object being repeatedly serialized, consuming CPU and thread resources.
Act Three: The Bad Code
The culprit was identified in XxxxxCacheManager.java, marked months earlier with a TODO about performance risk:
// TODO: 此处有性能风险,大促前需优化。The method updateActivityXxxCache fetched activity data from the Tair cache and, to disperse read pressure, wrote the data to twenty partitioned keys. However, the serialization was performed inside the loop, causing the 1‑2 MB object to be serialized twenty times per request:
// ... 省略部分代码
// 从缓存(Tair)里获取活动玩法数据的工具类
public void updateActivityXxxCache(Long sellerId, List<XxxDO> xxxDOList) {
try {
if (CollectionUtils.isEmpty(xxxDOList)) {
xxxDOList = new ArrayList<>();
}
// 为了防止单Key读压力过大,设计了20个散列Key
for (int index = 0; index < XXX_CACHE_PARTITION_NUMBER; index++) {
// 致命问题:将序列化操作放在了循环体内!
tairCache.put(String.format(ACTIVITY_PLAY_KEY, xxxId, index),
JSON.toJSONString(xxxDOList), // 就是这行代码,序列化了20次!
EXPIRE_TIME);
}
} catch (Exception e) {
log.warn("update cache exception occur", e);
}
}This turned the method into a "CPU meat grinder". The already fragile Tair LDB middleware could not handle the amplified write traffic, leading to throttling, increased latency, and eventual exhaustion of the HSF thread pool.
Act Four: Truth and Reflection
The root cause was clear. Rolling back the looped serialization restored the cluster around 00:30, ending a 30‑minute crisis.
Law One: Any optimization without capacity assessment is reckless.
Good optimizations add value; bad ones are like “painting a snake on a foot”. Respect for system limits outweighs clever tricks.
Law Two: Monitoring should target code‑block latency.
Existing metrics covered machines, interfaces, and middleware, but missed the time spent inside XxxxxCacheManager.update. An APM that highlighted this would have cut diagnosis time in half.
Law Three: Technical debt explodes when you least expect it.
The legacy Tair LDB, no longer maintained, became the hidden bomb that detonated under load.
老A说: Many P3 incidents stem not from architecture but from a single misplaced for loop. Reverence for code is a basic engineer virtue.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
