Postmortem of the July 13, 2021 Bilibili SLB Outage: Timeline, Root Cause, and Improvement Measures
This article details the July 13, 2021 Bilibili service outage caused by a Lua‑based SLB CPU spike, describing the incident timeline, root‑cause analysis of a weight‑zero bug, mitigation steps including new SLB deployment, and the subsequent operational and architectural improvements.
Incident Overview On July 13, 2021 at 22:52, the SRE team received massive alerts indicating that the access layer (four‑layer LB and seven‑layer SLB) of Bilibili services was unavailable, causing users to be unable to open the website and the app homepage.
Initial Investigation Remote colleagues could not log in to the internal authentication system after VPN login, preventing access to monitoring and logs. At 22:57, an on‑call SRE discovered that the seven‑layer SLB built on OpenResty had 100% CPU usage, confirming the fault lay in the access layer.
Fault Mitigation Attempts The team first tried to reload the SLB, then performed a cold restart, both of which failed to reduce CPU usage. Subsequent attempts to roll back recent Lua code, remove a custom WAF, revert recent retry‑logic changes, and disable HTTP/2 also did not restore the service.
New Source SLB Creation At 00:00 a new SLB cluster was built, four‑layer LB and public IPs were configured, and by 01:50 the online business was fully restored after traffic was gradually shifted to the new cluster.
Root Cause Identification Profiling showed CPU hotspots in Lua functions. The _gcd function in the lua‑resty‑balancer module received a string "0" as weight, leading to a nan result and an infinite loop when JIT compilation was enabled. The underlying trigger was a special release mode that temporarily set container weight to "0".
Background Bilibili migrated from Tengine to OpenResty in September 2019, using a custom service‑discovery module that stores service info in Nginx shared memory. A later feature allowed dynamic weight adjustments via the registration center, which introduced the weight‑zero bug.
Problem Analysis The authentication system relied on cookies from domains behind the faulty SLB, causing login failures. Multi‑active SLBs in other data centers also became overloaded due to traffic spikes caused by CDN retries.
Optimization and Improvement The team outlined technical improvements: enhancing multi‑active infrastructure, establishing unified multi‑active metadata management, improving cut‑over processes, automating SLB provisioning, strengthening testing and code review, and expanding monitoring and event‑analysis capabilities.
Fault Drills and Emergency Response Future plans include full‑stack fault drills, better coordination between SLB, four‑layer LB, and CDN teams, and establishing clear on‑call leadership and incident communication platforms.
Summary The incident demonstrated the effectiveness of multi‑active high‑availability architecture, highlighted critical gaps in operational processes, and drove a series of concrete improvements to prevent similar outages.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.