Operations 19 min read

How a Lua Bug Crashed Bilibili’s Load Balancer and the Lessons Learned

A detailed post‑mortem of the July 2021 Bilibili outage reveals how a Lua‑induced CPU spike in the OpenResty‑based SLB caused widespread service disruption, the step‑by‑step emergency response, root‑cause analysis, and the subsequent architectural and operational improvements to prevent recurrence.

Programmer DD
Programmer DD
Programmer DD
How a Lua Bug Crashed Bilibili’s Load Balancer and the Lessons Learned

Darkest Moment

On July 13, 2021 at 22:52, the SRE team received massive alerts that the access layer (both services and domains) was unavailable. Users reported that Bilibili could not be accessed, including the app homepage. Initial suspicion fell on the data center, network, L4 LB, or L7 SLB infrastructure, prompting an urgent voice conference.

Initial Diagnosis

22:55 Remote colleagues logged into the VPN but could not access the internal authentication system, preventing them from viewing monitoring and logs.

22:57 An on‑call SRE, who did not need VPN, discovered that the L7 SLB (built on OpenResty) CPU was at 100%, confirming the fault lay in the access‑layer SLB.

23:07‑23:17 Remote staff used a green channel to log into the internal system, bringing core SLB, L4 LB, and CDN personnel on‑site.

Fault Mitigation

23:20‑23:55 The SLB team attempted a reload and cold restart, but CPU remained at 100%. Multi‑active data‑center SLBs also showed timeouts, though their CPU was not overloaded.

Perf showed CPU hotspots in Lua functions; recent Lua code changes were rolled back.

A custom Lua WAF was removed and SLB restarted, but no recovery.

Recent retry‑logic changes in balance_by_lua were reverted without success.

HTTP/2 support was disabled, still no recovery.

New Origin SLB

00:00‑01:50 A brand‑new SLB cluster was built, configured with L4 LB and public IPs, tested, and traffic was gradually shifted. By 01:50, online services were largely restored.

Root‑Cause Analysis

11:40‑12:30 Lab reproduction confirmed the bug persisted even after disabling JIT compilation. The trigger was a special release mode that temporarily set a container instance weight to "0".

13:24‑14:30 The platform banned the release mode, SLB code was modified to ignore the weight from the registry, and the fix was rolled out across environments.

Underlying Cause

Bilibili migrated from Tengine to OpenResty in September 2019, using Lua for service discovery and the lua‑resty‑balancer module to select upstream nodes. A release mode set the weight string to "0", which Lua treated as a string. During arithmetic, Lua converted it to a number, resulting in nan when computing n % 0. The _gcd function lacked type checks, causing a loop that drove CPU to 100%.

Root cause diagram
Root cause diagram

Problem Analysis

1. Users could not log into the internal authentication system because one of the domains used for cookie authentication was behind the failing SLB.

2. Multi‑active SLBs were overloaded due to a 4‑fold traffic surge and a massive increase in connections, exposing insufficient capacity.

Login flow diagram
Login flow diagram

3. Rebuilding a new origin SLB took long because the process required coordination among three teams (SLB, L4 LB, CDN) and lacked full‑link rehearsals.

Multi‑active overload diagram
Multi‑active overload diagram

Improvement Directions

1. Multi‑Active Architecture

Clarify data‑center and business mapping.

Enable user‑attribute based routing in CDN.

Support write operations in multi‑active setups.

Enhance storage component synchronization.

2. SLB Governance

Separate SLB clusters per business unit and assign dedicated public IPs.

Move non‑SLB capabilities to API Gateway.

Version‑control Lua code with rapid rollback.

Automate node provisioning and L4 LB IP allocation (now under 5 minutes).

Increase SLB capacity to lower CPU usage to ~15 %.

3. Fault Drills

Simulate CDN source‑site failures and validate multi‑active disaster recovery.

Gray‑scale traffic to test failover paths.

Practice single‑data‑center outages with the multi‑active control platform.

4. Emergency Response

Define clear roles for incident commander and responders.

Build an easy‑to‑use incident announcement platform.

Enhance event‑analysis capabilities across applications, users, platforms, and components.

Conclusion

The outage quickly trended on national hot topics, underscoring the pressure on technical staff. By reflecting on the incident, Bilibili aims to strengthen its multi‑active infrastructure, SLB reliability, and incident response processes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LuaOpenRestySLBIncidentResponseLoadBalancingMultiActive
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.