How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned
On July 13, 2021, Bilibili’s OpenResty‑based SLB suffered a CPU‑100% outage caused by a Lua _gcd function bug triggered when a service’s weight was set to the string “0”, leading to a multi‑hour incident that was resolved by rebuilding SLB clusters and disabling JIT compilation.
Incident Timeline
At 22:52 on July 13, 2021, the SRE team received alerts that the access layer for many services and domains was unavailable, causing Bilibili’s website and app to be inaccessible. Initial suspicion fell on the data center, network, L4 LB, and L7 SLB infrastructure.
By 22:57, an on‑call SRE discovered that the L7 SLB (built on OpenResty) was at 100 % CPU and unable to process requests, confirming the fault lay in the SLB itself.
Remote teammates logged in via a VPN green‑channel to assist, and by 01:50 the majority of online services were restored after rebuilding a new SLB cluster.
Mitigation Steps
Attempted a reload of the failing SLB – no improvement.
Cold‑restarted the SLB to reject traffic – CPU remained at 100 %.
Rolled back recent Lua code changes, removed a custom WAF, and disabled new HTTP/2 support – none restored service.
Created a brand‑new SLB cluster, configured L4 LB and public IPs, and gradually shifted traffic from the broken cluster.
During the rebuild, the team added debug logs to suspect Lua functions, captured flame‑graph data, and identified the _gcd function in lua‑resty‑balancer as the hotspot.
Root Cause Analysis
The
_gcd</b> function received a string weight "0" from the service registry during a special deployment mode. Lua’s dynamic typing caused the string to be treated as a number, leading to a <code>nanresult in a modulo operation, which then caused an infinite loop and 100 % CPU usage.
Disabling JIT compilation temporarily stopped the loop, but the underlying bug persisted until the offending deployment completed and the zero‑weight instance disappeared.
Technical Findings
Lua does not enforce type checks; arithmetic on a numeric string can produce nan.
The _gcd function lacked validation for a zero weight, allowing the loop to enter a dead‑end state.
When the weight was the string "0", the function returned nan, which propagated and caused the worker process to spin.
Both the primary and multi‑active SLB clusters were overloaded by a traffic surge (connection count spiking to millions) during the incident, highlighting the importance of sufficient capacity and proper multi‑active routing.
Post‑mortem Improvements
Key actions include:
Separating SLB clusters by business domain to isolate failures.
Version‑controlled Lua code with rapid rollback capability.
Automating SLB node provisioning, L4 LB IP allocation, and CDN integration to reduce new‑cluster creation time to under five minutes.
Enhancing multi‑active architecture: clearer business‑level metadata, unified routing rules, and automated traffic shifting.
Introducing rigorous testing for SLB input parameters and reviewing upstream open‑source libraries.
Establishing clearer incident response roles, a dedicated NOC‑like communication platform, and richer event‑query capabilities across applications, users, and platform components.
These measures aim to improve both technical resilience and operational efficiency for future incidents.
Conclusion
The incident demonstrated that a subtle Lua bug could cripple a critical load‑balancing layer, but systematic debugging, rapid cluster rebuild, and subsequent architectural refinements restored service and strengthened Bilibili’s high‑availability posture.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
