Operations 8 min read

What Really Caused Bilibili’s Sudden Outage? A Deep Dive into the Technical Failure

The article analyzes Bilibili's recent half‑hour service disruption, explores technical rumors such as an etcd crash, examines Kubernetes‑based cloud‑native infrastructure, reviews similar historic outages, and offers expert recommendations for improving high‑availability and disaster‑recovery in large‑scale internet services.

Code Ape Tech Column

Jul 15, 2021

What Really Caused Bilibili’s Sudden Outage? A Deep Dive into the Technical Failure

On a recent night, Bilibili’s mobile app became inaccessible for about thirty minutes, triggering a hot‑search trend and prompting the platform to publish an official notice that a server‑room fault caused the outage and that services have since been restored.

Social media users quickly spread a flood of speculative explanations—including fire, data‑deletion, criminal incidents, hardware failures, and even alien attacks—none of which were substantiated. The most technically plausible guesses focused on core infrastructure components.

Two knowledgeable contributors on Zhihu offered concrete hypotheses: the first suggested that the distributed key‑value store etcd had failed, which would prevent the reverse proxy from locating pod IPs and break network communication; the second pointed to a generic site‑wide fault, likely triggered by a buggy new version that caused a core service to crash and required an emergency rollback. Both explanations align with the typical failure modes of large‑scale Kubernetes deployments, where network plugins, pod orchestration, and cloud‑native services operate semi‑independently.

Additional analysis highlighted the role of cloud service providers: a CDN outage can force traffic directly to gateways, overwhelming them, activating disaster‑recovery mechanisms, and causing a cascading service degradation (snowball effect) that brings the entire environment down.

To contextualize the incident, the article lists several historic large‑scale outages, such as the 2013 WeChat fiber‑cut failure, the 2015 Alipay optical‑cable break, the 2015 Amazon AWS overload caused by a new DynamoDB feature, the 2013 Nasdaq bug that halted trading for hours, and the 2016 U.S. DDoS‑driven blackout of major sites like Twitter and Spotify.

Expert commentary from Zilliz’s quality‑assurance lead classifies outage causes into software faults (code bugs introduced by new features) and hardware faults (physical damage like fiber cuts). The recommended mitigation strategies include adopting cloud‑native architectures with automatic fault isolation, implementing active‑active or multi‑region disaster‑recovery setups, and ensuring robust backup and failover mechanisms.

In summary, the Bilibili outage likely stemmed from a critical component failure within its Kubernetes‑based infrastructure—potentially an etcd crash or a network‑plugin issue—exacerbated by insufficient redundancy, underscoring the importance of high‑availability design for modern internet services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native Operations Kubernetes Bilibili Etcd incident analysis service outage

Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.