Operations 17 min read

Bilibili SLB Outage Postmortem (July 13, 2021): Timeline, Root Cause, and Improvements

On July 13 2021 Bilibili’s L7 SLB crashed when a recent Lua deployment set a balancer weight to the string “0”, producing a NaN value that triggered an infinite loop and 100 % CPU, prompting emergency restarts, a fresh cluster rollout, and long‑term safeguards such as automated provisioning, stricter Lua validation, and enhanced multi‑active disaster‑recovery processes.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Bilibili SLB Outage Postmortem (July 13, 2021): Timeline, Root Cause, and Improvements

On July 13, 2021 at 22:52, the SRE team received a massive alarm indicating that the access layer (SLB/LB) for Bilibili services was unavailable. Users could not access the website, mobile app, or internal systems. Initial suspicion fell on data center, network, L4 LB, or L7 SLB infrastructure.

Incident Timeline

22:55 Remote engineers could not log into the internal authentication system after VPN login, preventing them from viewing logs.

22:57 An on‑call SRE discovered that the L7 SLB (built on OpenResty) CPU was at 100%, confirming the fault was in the SLB layer.

23:07‑23:17 Engineers gained access via a green channel and gathered the core team (SLB, L4 LB, CDN).

23:20‑23:55 Attempts to reload or cold‑restart the SLB failed; CPU remained at 100%.

23:23 Multi‑active SLB services began to recover as traffic pressure eased.

00:00‑01:50 A brand‑new SLB cluster was built, configured, and traffic was gradually shifted to it, restoring all core services.

Root Cause Investigation

Perf analysis showed CPU hotspots in Lua functions. The recent deployment introduced Lua code that interacted with the lua‑resty‑balancer module. A special release mode occasionally set a container instance weight to the string "0". The balancer’s _gcd function received this string, performed a modulo operation resulting in nan , and entered an infinite loop, driving CPU to 100%.

Even after disabling JIT compilation, the issue persisted until the weight‑zero condition disappeared.

Mitigation Steps

Reloaded and cold‑restarted the faulty SLB.

Rolled back recent Lua code and disabled the custom WAF.

Rebuilt a fresh SLB cluster and shifted traffic via CDN.

Temporarily closed JIT compilation on all SLB nodes.

Collected core dumps for further analysis.

Long‑Term Improvements

Enhanced multi‑active architecture and clarified data‑center routing.

Automated SLB cluster provisioning, L4 LB IP allocation, and CDN IP updates (now under 5 minutes).

Implemented versioned Lua code management with quick rollback capability.

Added comprehensive Lua parameter validation and type checks.

Established regular fault‑injection drills and multi‑data‑center disaster‑recovery exercises.

Improved incident response process: defined incident commander roles, backup commanders, and a unified incident notification platform.

Conclusion

The outage highlighted the importance of robust multi‑active designs, automated provisioning, and thorough testing of core load‑balancing components. Subsequent releases will enforce stricter Lua code reviews, better weight handling, and tighter integration between SLB, L4 LB, and CDN teams.

High AvailabilitySREincident responseroot cause analysisload balancerSLB
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.