How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned
In July 2021 a sudden CPU‑100% spike in Bilibili's OpenResty‑based SLB caused widespread service outages, prompting an emergency response that rebuilt load‑balancer clusters, traced a Lua _gcd function bug triggered by a zero weight string, and led to extensive operational and architectural improvements.
Background
Bilibili migrated its service‑discovery module from Tengine to OpenResty in September 2019, using Lua and the lua‑resty‑balancer module to select upstream nodes from shared memory. The system ran stably for nearly two years before the incident.
Incident Timeline
22:55 Remote engineers could not log into the internal authentication system after VPN login.
22:57 On‑call SRE discovered the 7‑layer SLB (built on OpenResty) CPU at 100% and isolated the fault to the access‑layer SLB.
23:07‑23:55 Attempts to reload, cold‑restart, and roll back SLB configurations failed; traffic overload was suspected.
00:00‑01:50 A brand‑new SLB cluster was provisioned, four‑layer LB and public IPs configured, and traffic gradually shifted to the new cluster, restoring core services.
01:10‑01:58 Flame‑graph analysis pinpointed hot spots in the lua‑resty‑balancer module, especially the _gcd function.
01:59‑02:07 Disabling JIT compilation temporarily stopped the CPU spike and a core dump was saved for later analysis.
11:40‑14:30 Offline reproduction confirmed that a container with weight="0" caused the _gcd function to receive a string "0", leading to a nan result and an infinite loop.
Root Cause
The _gcd function lacked type checking. When a service instance weight was temporarily set to the string "0" during a special release mode, the function received a non‑numeric argument, produced nan, and entered a dead loop that drove the SLB worker CPU to 100%.
Mitigation During the Incident
Rebuilt a fresh SLB cluster and migrated traffic via CDN.
Temporarily disabled Lua JIT compilation across all SLB nodes.
Collected a core dump for post‑mortem analysis.
Post‑Incident Improvements
Prohibited the release mode that could set weight to zero.
Modified the balancer code to ignore zero‑weight entries from the registry.
Separated SLB clusters per business unit and isolated public IPs.
Automated SLB cluster provisioning, reducing full‑stack initialization time to under five minutes.
Enhanced monitoring of connection counts and performed stress testing of CDN back‑origin timeouts.
Introduced a version‑controlled Lua code repository and a rapid rollback mechanism.
Established a multi‑active (multi‑zone) architecture governance platform to centralize routing rules and metadata.
Implemented regular fault‑response drills, including simulated CDN and single‑zone failures.
Key Takeaways
The incident demonstrated the importance of type safety in dynamically typed languages, the need for robust multi‑active traffic routing, and the value of automated, end‑to‑end incident response workflows. By addressing both technical and organizational gaps, Bilibili improved the resilience of its core services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
