Bilibili SLB Outage Postmortem (July 13, 2021): Timeline, Root Cause, and Improvements
On July 13 2021 Bilibili’s L7 SLB crashed when a recent Lua deployment set a balancer weight to the string “0”, producing a NaN value that triggered an infinite loop and 100 % CPU, prompting emergency restarts, a fresh cluster rollout, and long‑term safeguards such as automated provisioning, stricter Lua validation, and enhanced multi‑active disaster‑recovery processes.
On July 13, 2021 at 22:52, the SRE team received a massive alarm indicating that the access layer (SLB/LB) for Bilibili services was unavailable. Users could not access the website, mobile app, or internal systems. Initial suspicion fell on data center, network, L4 LB, or L7 SLB infrastructure.
Incident Timeline
22:55 Remote engineers could not log into the internal authentication system after VPN login, preventing them from viewing logs.
22:57 An on‑call SRE discovered that the L7 SLB (built on OpenResty) CPU was at 100%, confirming the fault was in the SLB layer.
23:07‑23:17 Engineers gained access via a green channel and gathered the core team (SLB, L4 LB, CDN).
23:20‑23:55 Attempts to reload or cold‑restart the SLB failed; CPU remained at 100%.
23:23 Multi‑active SLB services began to recover as traffic pressure eased.
00:00‑01:50 A brand‑new SLB cluster was built, configured, and traffic was gradually shifted to it, restoring all core services.
Root Cause Investigation
Perf analysis showed CPU hotspots in Lua functions. The recent deployment introduced Lua code that interacted with the lua‑resty‑balancer module. A special release mode occasionally set a container instance weight to the string "0". The balancer’s _gcd function received this string, performed a modulo operation resulting in nan , and entered an infinite loop, driving CPU to 100%.
Even after disabling JIT compilation, the issue persisted until the weight‑zero condition disappeared.
Mitigation Steps
Reloaded and cold‑restarted the faulty SLB.
Rolled back recent Lua code and disabled the custom WAF.
Rebuilt a fresh SLB cluster and shifted traffic via CDN.
Temporarily closed JIT compilation on all SLB nodes.
Collected core dumps for further analysis.
Long‑Term Improvements
Enhanced multi‑active architecture and clarified data‑center routing.
Automated SLB cluster provisioning, L4 LB IP allocation, and CDN IP updates (now under 5 minutes).
Implemented versioned Lua code management with quick rollback capability.
Added comprehensive Lua parameter validation and type checks.
Established regular fault‑injection drills and multi‑data‑center disaster‑recovery exercises.
Improved incident response process: defined incident commander roles, backup commanders, and a unified incident notification platform.
Conclusion
The outage highlighted the importance of robust multi‑active designs, automated provisioning, and thorough testing of core load‑balancing components. Subsequent releases will enforce stricter Lua code reviews, better weight handling, and tighter integration between SLB, L4 LB, and CDN teams.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.