Operations 17 min read

Bilibili SLB Outage Postmortem (July 13, 2021): Timeline, Root Cause, and Improvements

On July 13 2021 Bilibili’s L7 SLB crashed when a recent Lua deployment set a balancer weight to the string “0”, producing a NaN value that triggered an infinite loop and 100 % CPU, prompting emergency restarts, a fresh cluster rollout, and long‑term safeguards such as automated provisioning, stricter Lua validation, and enhanced multi‑active disaster‑recovery processes.

Bilibili Tech

Jul 12, 2022

Bilibili SLB Outage Postmortem (July 13, 2021): Timeline, Root Cause, and Improvements

On July 13, 2021 at 22:52, the SRE team received a massive alarm indicating that the access layer (SLB/LB) for Bilibili services was unavailable. Users could not access the website, mobile app, or internal systems. Initial suspicion fell on data center, network, L4 LB, or L7 SLB infrastructure.

Incident Timeline

22:55 Remote engineers could not log into the internal authentication system after VPN login, preventing them from viewing logs.

22:57 An on‑call SRE discovered that the L7 SLB (built on OpenResty) CPU was at 100%, confirming the fault was in the SLB layer.

23:07‑23:17 Engineers gained access via a green channel and gathered the core team (SLB, L4 LB, CDN).

23:20‑23:55 Attempts to reload or cold‑restart the SLB failed; CPU remained at 100%.

23:23 Multi‑active SLB services began to recover as traffic pressure eased.

00:00‑01:50 A brand‑new SLB cluster was built, configured, and traffic was gradually shifted to it, restoring all core services.

Root Cause Investigation

Perf analysis showed CPU hotspots in Lua functions. The recent deployment introduced Lua code that interacted with the lua‑resty‑balancer module. A special release mode occasionally set a container instance weight to the string "0". The balancer’s _gcd function received this string, performed a modulo operation resulting in nan, and entered an infinite loop, driving CPU to 100%.

Even after disabling JIT compilation, the issue persisted until the weight‑zero condition disappeared.

Mitigation Steps

Reloaded and cold‑restarted the faulty SLB.

Rolled back recent Lua code and disabled the custom WAF.

Rebuilt a fresh SLB cluster and shifted traffic via CDN.

Temporarily closed JIT compilation on all SLB nodes.

Collected core dumps for further analysis.

Long‑Term Improvements

Enhanced multi‑active architecture and clarified data‑center routing.

Automated SLB cluster provisioning, L4 LB IP allocation, and CDN IP updates (now under 5 minutes).

Implemented versioned Lua code management with quick rollback capability.

Added comprehensive Lua parameter validation and type checks.

Established regular fault‑injection drills and multi‑data‑center disaster‑recovery exercises.

Improved incident response process: defined incident commander roles, backup commanders, and a unified incident notification platform.

Conclusion

The outage highlighted the importance of robust multi‑active designs, automated provisioning, and thorough testing of core load‑balancing components. Subsequent releases will enforce stricter Lua code reviews, better weight handling, and tighter integration between SLB, L4 LB, and CDN teams.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High Availability SRE incident response Root Cause Analysis Load Balancer SLB

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.