Operations 17 min read

How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned

On July 13, 2021, Bilibili’s OpenResty‑based SLB suffered a CPU‑100% outage caused by a Lua _gcd function bug triggered when a service’s weight was set to the string “0”, leading to a multi‑hour incident that was resolved by rebuilding SLB clusters and disabling JIT compilation.

ITPUB
ITPUB
ITPUB
How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned

Incident Timeline

At 22:52 on July 13, 2021, the SRE team received alerts that the access layer for many services and domains was unavailable, causing Bilibili’s website and app to be inaccessible. Initial suspicion fell on the data center, network, L4 LB, and L7 SLB infrastructure.

By 22:57, an on‑call SRE discovered that the L7 SLB (built on OpenResty) was at 100 % CPU and unable to process requests, confirming the fault lay in the SLB itself.

Remote teammates logged in via a VPN green‑channel to assist, and by 01:50 the majority of online services were restored after rebuilding a new SLB cluster.

Mitigation Steps

Attempted a reload of the failing SLB – no improvement.

Cold‑restarted the SLB to reject traffic – CPU remained at 100 %.

Rolled back recent Lua code changes, removed a custom WAF, and disabled new HTTP/2 support – none restored service.

Created a brand‑new SLB cluster, configured L4 LB and public IPs, and gradually shifted traffic from the broken cluster.

During the rebuild, the team added debug logs to suspect Lua functions, captured flame‑graph data, and identified the _gcd function in lua‑resty‑balancer as the hotspot.

Root Cause Analysis

The

_gcd</b> function received a string weight "0" from the service registry during a special deployment mode. Lua’s dynamic typing caused the string to be treated as a number, leading to a <code>nan

result in a modulo operation, which then caused an infinite loop and 100 % CPU usage.

Disabling JIT compilation temporarily stopped the loop, but the underlying bug persisted until the offending deployment completed and the zero‑weight instance disappeared.

Technical Findings

Lua does not enforce type checks; arithmetic on a numeric string can produce nan.

The _gcd function lacked validation for a zero weight, allowing the loop to enter a dead‑end state.

When the weight was the string "0", the function returned nan, which propagated and caused the worker process to spin.

Both the primary and multi‑active SLB clusters were overloaded by a traffic surge (connection count spiking to millions) during the incident, highlighting the importance of sufficient capacity and proper multi‑active routing.

Post‑mortem Improvements

Key actions include:

Separating SLB clusters by business domain to isolate failures.

Version‑controlled Lua code with rapid rollback capability.

Automating SLB node provisioning, L4 LB IP allocation, and CDN integration to reduce new‑cluster creation time to under five minutes.

Enhancing multi‑active architecture: clearer business‑level metadata, unified routing rules, and automated traffic shifting.

Introducing rigorous testing for SLB input parameters and reviewing upstream open‑source libraries.

Establishing clearer incident response roles, a dedicated NOC‑like communication platform, and richer event‑query capabilities across applications, users, and platform components.

These measures aim to improve both technical resilience and operational efficiency for future incidents.

Conclusion

The incident demonstrated that a subtle Lua bug could cripple a critical load‑balancing layer, but systematic debugging, rapid cluster rebuild, and subsequent architectural refinements restored service and strengthened Bilibili’s high‑availability posture.

SLB architecture diagram
SLB architecture diagram
Incident flow diagram
Incident flow diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SRELoad BalancerOpenRestyhigh-availabilityincident-response
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.