How QQ Music Achieves High Availability: Architecture, Tools, and Observability
This article explains how QQ Music embraces inevitable faults by building a high‑availability architecture that combines redundant infrastructure, automated failover, stability strategies, a robust toolchain for chaos engineering and full‑link load testing, and comprehensive observability to ensure graceful fault handling at scale.
Introduction
Faults are unavoidable in distributed systems, so QQ Music focuses on embracing them rather than preventing them. By constructing a high‑availability (HA) architecture, the service can respond to failures gracefully and maintain a reliable user experience.
HA Architecture Overview
The HA system consists of three subsystems: Architecture , Toolchain , and Observability . The architectural layer provides redundancy, automatic failover, and stability policies such as distributed rate limiting, circuit breaking, and dynamic timeout.
Disaster‑Recovery Architecture
QQ Music evaluates common DR patterns (cold standby, active‑active, multi‑center) and selects a dual‑center active‑read model: a primary data‑center in Shenzhen and a secondary one in Shanghai. Both centers deploy identical STGW and API gateways, with GSLB routing traffic by proximity.
Read‑write separation is applied: Shenzhen hosts read/write services, while Shanghai hosts read‑only services. Writes are processed in Shenzhen and synchronized to Shanghai via storage sync components (e.g., Cmongo, CKV+). This design avoids the high cost of write‑side DR while ensuring data consistency for read‑heavy workloads.
Automatic Failover
When a local API fails (including circuit‑break or rate‑limit events), the API gateway automatically retries the request in the remote center. A client‑side dynamic scoring algorithm was replaced by an API‑gateway‑centric adaptive retry mechanism to reduce client complexity.
Adaptive Retry Algorithm
// f(i) is the retry window for the i‑th probe, g(i) the actual probe count, s(i) the success rate, t the total local requests.
if s(i) >= 98% {
if g(i) >= f(i) {
f(i+1) = f(i) + max(min(1% * t, f(i)), 1)
} else {
f(i+1) = f(i)
}
} else {
f(i+1) = max(1, f(i)/2)
}
// The window starts at 1 and is adjusted based on probe success.The algorithm combines a probing strategy (expanding the window when success is high) with a back‑off strategy (halving the window on failures) and includes a global retry switch to control traffic.
Stability Strategies
QQ Music employs several stability mechanisms:
Distributed Rate Limiting : a sliding‑window counter is used to drop excess requests at the microservice framework level.
Circuit Breaking : an SRE‑style breaker with only Closed and Half‑Open states drops requests based on success rate, providing more elastic behavior than traditional three‑state breakers.
Dynamic Timeout : an EMA‑based algorithm adjusts timeout thresholds according to observed latency, preventing unnecessary request failures during short network spikes.
Service Tiering : services are classified into four levels (1‑4) based on business impact, guiding rate‑limit priority and disaster‑recovery planning.
API‑Gateway Tiered Limiting : the gateway enforces tier‑aware rate limiting, ensuring critical (Level‑1) services remain available under high load.
Toolchain
To proactively discover and mitigate risks, QQ Music uses two main tools:
Chaos Engineering
Based on ChaosMesh, the TMEChaos platform injects failures (e.g., network timeouts) into production to expose fragile components. It includes a web dashboard, backend services, and a controller manager that orchestrates experiments across multiple clusters.
Full‑Link Load Testing
Full‑link testing generates realistic traffic by sampling production API calls and optionally crafting custom flows. Traffic is “colored” in RPC contexts so that it can be isolated, stored in shadow databases, and monitored without affecting real users.
Observability
Observability is built on three pillars: Metrics, Logging, and Tracing, supplemented by Profiles and Dumps.
Metrics
Prometheus federated clusters collect millions of metrics with a 3‑second scrape interval, visualized in Grafana. Four core metrics—Traffic (QPS), Latency, Error, and Saturation—are tracked for each service.
Logging
Logs are collected via Filebeat → Kafka → Logstash → Elasticsearch → Kibana, providing a non‑intrusive, centralized log search platform.
Tracing
Jaeger captures distributed traces using Trace IDs and spans, enabling end‑to‑end request flow visualization and root‑cause analysis.
Profiles
Conprof continuously collects CPU/heap profiles based on load‑aware sampling, storing them for later analysis via a unified UI.
Dumps
Instead of traditional core dumps, panics are intercepted in the RPC framework and reported to Sentry for rapid post‑mortem analysis.
Conclusion
The three subsystems—architecture, toolchain, and observability—are tightly coupled. The architecture’s fragility drives the need for robust tooling and observability, while improvements in tooling and observability feed back into higher architectural availability. Ultimately, business‑level design decisions (idempotency, dependency minimization, graceful degradation) must align with these technical foundations to sustain long‑term growth.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
