Operations 16 min read

Postmortem of Bilibili SLB Outage on July 13, 2021

This postmortem details the July 13, 2021 Bilibili outage caused by a Lua‑induced CPU 100% bug in the OpenResty‑based SLB, describing the incident timeline, root‑cause analysis, mitigation steps, and the subsequent technical and process improvements to enhance reliability and multi‑active deployment.

Big Data Technology Architecture

Jul 14, 2022

Postmortem of Bilibili SLB Outage on July 13, 2021

On July 13, 2021 at 22:52, Bilibili's SRE team received alarms indicating that the access layer (four‑layer LB and seven‑layer SLB) was unavailable, causing users to be unable to access the website and app. Initial suspicion fell on network or load‑balancer infrastructure, prompting an emergency voice conference.

Investigation revealed that the seven‑layer SLB, built on OpenResty, hit 100% CPU usage due to a Lua bug. The bug originated from a special deployment mode that temporarily set a container instance weight to the string "0". The lua‑resty‑balancer module's _gcd function received this string, performed a modulo operation, produced nan, and entered an infinite loop, exhausting CPU.

Mitigation steps included attempting to reload the SLB, cold‑restarting it, and finally rebuilding a brand‑new SLB cluster. The new cluster was initialized, four‑layer LB and public IPs were configured, and traffic was gradually shifted back, restoring core services (live streaming, recommendation, comments, etc.) by 01:50.

Root‑cause analysis confirmed the Lua type‑conversion issue and the lack of weight validation. Temporary fixes such as disabling JIT compilation helped but did not fully resolve the problem; the underlying weight‑zero condition had to disappear for the service to recover.

Following the incident, several improvements were planned: strengthening multi‑active (multi‑active) architecture, isolating SLB clusters per business unit, version‑controlled Lua code deployment, automated SLB provisioning within five minutes, enhanced monitoring and event‑query capabilities, and formalized incident‑response procedures.

The postmortem concludes that while the outage highlighted serious gaps in load‑balancer reliability and operational processes, the lessons learned led to concrete technical and organizational actions to prevent similar failures in the future.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations SRE Lua Load Balancer postmortem incident SLB

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.