Tag

outage

0 views collected around this technical thread.

Efficient Ops
Efficient Ops
Dec 22, 2024 · Operations

What Caused OpenAI’s Massive Outage? Inside the Kubernetes Failure and Recovery

On December 11, OpenAI suffered a severe outage across ChatGPT, its API, and Sora due to a misconfigured telemetry service that overloaded Kubernetes control planes worldwide, prompting a cascade of failures and a coordinated recovery effort.

KubernetesOpenAIcloud operations
0 likes · 8 min read
What Caused OpenAI’s Massive Outage? Inside the Kubernetes Failure and Recovery
DevOps Operations Practice
DevOps Operations Practice
Dec 16, 2024 · Cloud Native

Analysis of OpenAI's December 2024 Outage: Kubernetes Control Plane Overload and Mitigation

The December 11, 2024 OpenAI outage, caused by a misconfigured monitoring service that overloaded the Kubernetes control plane, led to a four‑hour service disruption and was resolved through cluster scaling, API blocking, and resource expansion, highlighting critical infrastructure risks for large‑scale cloud‑native operations.

Cloud NativeControl PlaneKubernetes
0 likes · 7 min read
Analysis of OpenAI's December 2024 Outage: Kubernetes Control Plane Overload and Mitigation
IT Services Circle
IT Services Circle
Aug 21, 2024 · Operations

Analysis of NetEase Cloud Music Outage on August 19: Infrastructure Failure and Operational Lessons

On August 19, NetEase Cloud Music suffered a severe infrastructure‑related outage that prevented user login, playlist loading, and song search, prompting a two‑hour recovery effort, a brief free‑membership compensation, and highlighting the critical role of proper change management, gray releases, disaster recovery, and cross‑functional coordination in large‑scale services.

NetEase Cloud MusicOperationsdisaster recovery
0 likes · 6 min read
Analysis of NetEase Cloud Music Outage on August 19: Infrastructure Failure and Operational Lessons
Top Architecture Tech Stack
Top Architecture Tech Stack
Aug 19, 2024 · Operations

Analysis of NetEase Cloud Music Outage: Causes and Data‑Center Migration Challenges

On August 19 the NetEase Cloud Music service suffered a major outage that was traced to a complex migration of its Hangzhou data center to Guizhou, highlighting large‑scale operational risks, technical debt, and strict continuity constraints for high‑traffic internet platforms.

NetEase Cloud MusicOperationscloud computing
0 likes · 6 min read
Analysis of NetEase Cloud Music Outage: Causes and Data‑Center Migration Challenges
Cognitive Technology Team
Cognitive Technology Team
Aug 17, 2024 · Operations

GitHub Outage on August 14, 2024: Causes, Impact, and Recovery

On August 14, 2024, GitHub experienced a massive site-wide outage caused by a database infrastructure configuration change that disrupted traffic routing, leading to loss of database connections and affecting core services such as Pull Requests, Pages, Copilot, and the API, with full restoration confirmed later that evening.

DatabaseGitHubOperations
0 likes · 2 min read
GitHub Outage on August 14, 2024: Causes, Impact, and Recovery
Architecture Digest
Architecture Digest
Nov 14, 2023 · Cloud Computing

Alibaba Cloud Outage in November 2023 Affects Numerous Services and Regions

In November 2023, Alibaba Cloud suffered a massive outage that disrupted client, web, and mobile access for nearly eight hours, affecting a wide range of Alibaba products, over 300,000 enterprise customers, and numerous global data centers, highlighting the fragility of large-scale cloud services.

2023 incidentAlibabaAlibaba Cloud
0 likes · 9 min read
Alibaba Cloud Outage in November 2023 Affects Numerous Services and Regions
IT Services Circle
IT Services Circle
Sep 12, 2023 · Operations

Microsoft Azure Sydney Data Center Outage: Causes, Impact, and Operational Lessons

A prolonged Azure outage in Sydney caused by a sudden power drop that disabled cooling systems, compounded by insufficient on‑site staff, led to service disruptions for over 24 hours and highlighted critical operational lessons for cloud data‑center management.

AzureMicrosoftOperations
0 likes · 9 min read
Microsoft Azure Sydney Data Center Outage: Causes, Impact, and Operational Lessons
Wukong Talks Architecture
Wukong Talks Architecture
Dec 26, 2022 · Operations

Alibaba Cloud Hong Kong Region Outage Postmortem – December 18, 2023

On December 18, 2023, Alibaba Cloud's Hong Kong Region experienced a severe cooling‑system failure that caused a 14‑hour outage of ECS, OSS, EBS, RDS and other services, prompting extensive emergency procedures, service impact analysis, and a detailed post‑mortem with improvement actions.

Alibaba CloudOperationsdatabases
0 likes · 14 min read
Alibaba Cloud Hong Kong Region Outage Postmortem – December 18, 2023
macrozheng
macrozheng
Jul 18, 2021 · Operations

Why Did Bilibili Crash? A Developer’s Deep Dive into High‑Availability Failures

In this article, a programmer recounts the recent Bilibili outage, analyzes its timeline, proposes technical root‑cause hypotheses such as CDN failure and service‑chain avalanche, shares insights from the platform’s high‑availability architecture, and outlines preventive techniques for building more resilient backend systems.

BilibiliHigh AvailabilityOperations
0 likes · 10 min read
Why Did Bilibili Crash? A Developer’s Deep Dive into High‑Availability Failures
Efficient Ops
Efficient Ops
Apr 16, 2020 · Operations

What Caused Cloudflare’s 4‑Hour Outage? Lessons on Cable Management and Process Clarity

A four‑hour Cloudflare outage was triggered by an unauthorized cable removal during a planned maintenance, compounded by unclear instructions and unlabeled wiring, highlighting the need for better cable management, clear operational procedures, and robust single‑point‑of‑failure mitigation.

CloudflareData Center OperationsProcess Improvement
0 likes · 3 min read
What Caused Cloudflare’s 4‑Hour Outage? Lessons on Cable Management and Process Clarity