How QQ Music Achieves High Availability: Architecture, Toolchain, and Observability
This article explains how QQ Music builds a high‑availability system by combining redundant architecture, a comprehensive toolchain—including chaos engineering and full‑link pressure testing—and deep observability to gracefully handle failures in a large‑scale microservice environment.
1. QQ Music High‑Availability Architecture Overview
Failures are inevitable in distributed systems, so the focus is on embracing them by building a high‑availability architecture composed of three subsystems: architecture, toolchain, and observability.
Architecture
Redundant architecture eliminates single points of failure through cluster, multi‑datacenter, and multi‑region deployments, supporting horizontal scaling, load balancing, and automatic failover. Stability strategies such as distributed rate limiting, circuit breaking, and dynamic timeouts further improve availability.
Toolchain
The toolchain integrates experiments and tests to enhance architecture reliability, including chaos engineering and full‑link pressure testing. Chaos engineering injects faults to discover weak points, while full‑link pressure testing applies realistic traffic to identify performance bottlenecks.
Observability
Observability improves fault detection and resolution by collecting logs, metrics, tracing, profiling, and dumps, enabling end‑to‑end visibility of service health.
2. Disaster‑Recovery Architecture
Common DR solutions include remote cold standby, same‑city active‑active, two‑region three‑center, and remote active‑active/multi‑active. QQ Music adopts a dual‑center active‑active model with a write‑to‑one, read‑from‑both approach to balance cost and risk.
1) Dual‑Center Deployment
Two centers (Shenzhen and Shanghai) host identical STGW and API gateways. Global Server Load Balancing (GSLB) directs traffic based on proximity, ensuring isolation between centers.
Logical layer separates read/write: Shenzhen handles read/write, Shanghai provides read‑only services, and write requests are routed from Shanghai to Shenzhen.
Storage is duplicated in both centers; synchronization components keep data consistent across regions, using native cross‑region sync where available.
2) Automatic Failover
Initial client‑side dynamic IP scoring proved unstable, so the solution shifted to API‑gateway‑side failover, reducing client involvement.
Two failover mechanisms:
API‑gateway failover: When a local API fails (including circuit break or rate limit), the gateway routes the request to the remote center.
Client failover: If the gateway times out, the client retries remotely; if the gateway returns a 5xx response, the client also retries remotely; otherwise no retry.
The gateway‑side retry is more controllable and, combined with adaptive rate‑limit and circuit‑break strategies, prevents traffic amplification.
3) Adaptive Retry Algorithm
<code>// Let f(i) be the i‑th probe window, g(i) the actual probe amount, s(i) the success rate, t the total local requests.
if s(i) >= 98% {
if g(i) >= f(i) {
f(i+1) = f(i) + max(min(1% * t, f(i)), 1)
} else {
f(i+1) = f(i)
}
} else {
f(i+1) = max(1, f(i)/2)
}
// Initial window size f(0) = 1.
</code>The algorithm adjusts the retry window based on probe success, with detection and back‑off phases.
3. Stability Strategies
Distributed Rate Limiting
QQ Music uses a sliding‑window counter for distributed rate limiting, discarding excess requests at the service level without introducing global dependencies.
Adaptive Rate Limiting
Server‑side adaptive limiting balances inflight requests using Little’s Law (inflight = latency × QPS) and triggers limits when CPU > 800 and inflight exceeds the optimal threshold.
Circuit Breaking
Adopts an SRE‑style circuit breaker with only Closed and Half‑Open states, discarding requests based on a dynamic success‑rate threshold (requests > K × accepts).
Dynamic Timeout
Uses an EMA‑based algorithm to adjust timeout thresholds dynamically, expanding timeout when average latency is low and shrinking it when latency spikes.
Service Grading
Services are classified into four grades (1‑critical, 2‑important, 3‑minor, 4‑trivial) to prioritize traffic and SLA commitments.
API‑Gateway Graded Rate Limiting
The gateway applies graded rate limiting, ensuring that during high load only grade‑1 services remain available.
4. Toolchain
Chaos Engineering
TMEChaos, built on ChaosMesh, provides a cloud‑native chaos platform with experiment orchestration, dashboards, and integration with TME microservice architecture.
Full‑Link Pressure Testing
Generates realistic traffic by sampling production API calls, applies traffic coloring to isolate test traffic, and uses a pressure engine to drive requests while smart monitoring detects and aborts unhealthy experiments.
5. Observability
Metrics
Prometheus federation collects millions of metrics with 3‑second scrape intervals, providing real‑time monitoring of QPS, latency, error rate, and saturation.
Logging
ELK stack (Filebeat → Kafka → Logstash → Elasticsearch → Kibana) centralizes log collection and enables fast query and analysis.
Tracing
Jaeger captures distributed traces, linking spans across services to reconstruct call chains for fault isolation.
Profiles
Conprof continuously collects CPU/heap profiles in production, storing them for later analysis via a unified UI.
Dumps
Panic dumps are captured via RPC interceptors and reported to Sentry for post‑mortem analysis.
6. Summary
The article presents QQ Music’s high‑availability practice across architecture, toolchain, and observability. Redundant dual‑center design, adaptive failover, and comprehensive stability strategies form the backbone, while chaos engineering, full‑link testing, and deep observability continuously improve resilience.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.