Operations 18 min read

How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned

This post‑mortem details the July 2021 Bilibili outage caused by a Lua bug in the OpenResty‑based SLB, describing the timeline, root‑cause analysis, mitigation steps, and the technical and organizational improvements implemented to prevent similar incidents.

Su San Talks Tech

Jul 13, 2022

How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned

Initial Identification

On July 13, 2021 at 22:52, the SRE team received alerts that the access layer for many services and domains was unavailable. Users reported that Bilibili (B站) could not be accessed on both web and app. The team initially suspected issues in the data center, network, L4 LB, or L7 SLB and convened an emergency voice conference.

Early Timeline

22:55 Remote colleagues could not log into the internal authentication system after VPN login, preventing access to monitoring and logs.

22:57 An on‑call SRE discovered that the L7 SLB (built on OpenResty) was at 100% CPU, confirming the fault lay in the SLB.

23:07 A green‑channel login method was arranged for internal access.

23:17‑23:20 Core personnel for SLB, L4 LB, and CDN arrived and began troubleshooting.

Damage Control

SLB operators observed a traffic spike and attempted a reload, which failed. They then performed a cold restart, but CPU remained at 100%.

Multi‑active SLB also became unavailable due to a massive surge in retry traffic, reducing business success rates to about 50%.

Perf showed CPU hotspots in Lua functions, prompting a rollback of recent Lua code.

A newly deployed Lua‑based WAF was suspected and removed.

Recent retry‑logic changes in the balance_by_lua phase were rolled back.

HTTP/2 support added a week earlier was disabled.

New SLB Cluster Creation

At 00:00 a new SLB cluster was built, four‑layer LB and public IPs were configured, and traffic was gradually shifted. By 01:50 the online services were fully restored.

SLB Recovery and Root‑Cause Analysis

After rebuilding, operators used a Lua profiling tool to generate flame graphs, pinpointing the lua‑resty‑balancer module as the hotspot. Debug logs revealed that the internal _gcd function returned nan when a container IP weight of "0" was processed, triggering a JIT compiler bug and an infinite loop.

Temporarily disabling JIT compilation restored CPU usage to normal, and core dumps were saved for further analysis.

Root Cause

The bug was reproduced in a test environment. A special deployment mode occasionally set container instance weight to the string "0". The balancer module passed this string to _gcd, which performed arithmetic without type checking, causing nan and a CPU‑burning loop.

Problem Analysis

1. Internal authentication failed because the login flow relied on a domain behind the faulty SLB.

2. Multi‑active SLB was overloaded by a 4‑fold traffic surge and massive connection spikes.

3. Limited SLB team size prevented parallel changes during the incident.

4. Creating a new source‑site required coordination among SLB, L4 LB, and CDN teams, leading to long latency.

Optimization & Improvements

Multi‑Active Architecture

Clarify data‑center relationships and improve CDN routing rules.

Provide unified metadata management for multi‑active services.

Automate and visualize traffic shifting.

SLB Governance

Split SLB clusters per business unit and isolate public IPs.

Version‑control Lua code with quick rollback capability.

Automate cluster initialization to under 5 minutes.

Increase CPU headroom and add elastic scaling.

Testing & Reliability

Introduce dedicated testing for core components.

Review upstream open‑source libraries for vulnerabilities.

Incident Drills

Simulate single‑data‑center failures and validate multi‑active failover.

Gray‑scale traffic to faulty CDN nodes for resilience testing.

Emergency Response

Define clear roles for incident commander and responders.

Build a lightweight incident announcement platform.

Enhance event‑query capabilities across applications, users, and platforms.

Summary

The outage quickly trended on national hot searches, highlighting the pressure on engineers. By dissecting the incident, identifying the Lua bug, and implementing extensive technical and organizational improvements, the team reinforced high‑availability and multi‑active capabilities for Bilibili’s services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SRE incident response multi-active Lua Load Balancer

Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.