QQ Music High-Availability Architecture Overview
QQ Music achieves high availability by layering redundant multi‑datacenter architecture, proactive chaos‑engineering toolchains, and comprehensive observability—including metrics, logging, tracing and profiling—while employing service grading, adaptive retry windows and EMA‑based dynamic timeouts to gracefully handle faults across its massive micro‑service ecosystem.
Faults are inevitable in distributed systems; QQ Music focuses on embracing faults by building a high‑availability architecture that can gracefully handle failures.
The system is organized into three subsystems—architecture, toolchain, and observability. The architecture layer provides redundancy (cluster deployment, multi‑datacenter, multi‑center), automatic failover, and stability strategies such as distributed rate limiting, circuit breaking, and dynamic timeout.
The toolchain layer adopts chaos engineering (TMEChaos built on ChaosMesh) and full‑link pressure testing to proactively discover risk points. TMEChaos includes a dashboard, backend, API server, status manager, steady‑state monitor, ChaosMesh controller manager, and daemon, enabling fault injection and experiment orchestration.
The observability layer supplies metrics (Prometheus federation with golden metrics like traffic, latency, error, saturation), logging (ELK stack with Filebeat → Kafka → Logstash → Elasticsearch → Kibana), tracing (Jaeger pipeline: agent → collector → ingester → Elasticsearch → query), profiling (conprof for continuous performance analysis), and dumps (panic interception reported to Sentry).
Service grading defines four importance levels (1‑critical, 2‑high, 3‑medium, 4‑low) that guide traffic routing, SLA definition, and disaster‑recovery priorities.
Adaptive retry and API‑gateway failover are realized through a retry‑window algorithm with detection and back‑off strategies. The algorithm is:
// Set the detection window f(i), actual detection g(i), success rate s(i), total requests t.
if s(i) = [98%, 100%] { // detection normal
if g(i) >= f(i) {
f(i+1) = f(i) + max(min(1% * t, f(i)), 1)
} else {
f(i+1) = f(i) // keep window unchanged
}
} else {
f(i+1) = max(1, f(i)/2) // back‑off on abnormal detection
}
// Initial window size is 1; parameters are tuned from tests.Dynamic timeout leverages an EMA‑based approach to adjust timeout thresholds according to average latency, providing elasticity for short‑term spikes while protecting resources.
In summary, QQ Music combines robust architectural design, proactive tooling, and comprehensive observability to achieve high availability, continuously iterating on reliability practices across its massive micro‑service ecosystem.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
