Cloud Native 24 min read

QQ Music High-Availability Architecture Overview

QQ Music achieves high availability by layering redundant multi‑datacenter architecture, proactive chaos‑engineering toolchains, and comprehensive observability—including metrics, logging, tracing and profiling—while employing service grading, adaptive retry windows and EMA‑based dynamic timeouts to gracefully handle faults across its massive micro‑service ecosystem.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
QQ Music High-Availability Architecture Overview

Faults are inevitable in distributed systems; QQ Music focuses on embracing faults by building a high‑availability architecture that can gracefully handle failures.

The system is organized into three subsystems—architecture, toolchain, and observability. The architecture layer provides redundancy (cluster deployment, multi‑datacenter, multi‑center), automatic failover, and stability strategies such as distributed rate limiting, circuit breaking, and dynamic timeout.

The toolchain layer adopts chaos engineering (TMEChaos built on ChaosMesh) and full‑link pressure testing to proactively discover risk points. TMEChaos includes a dashboard, backend, API server, status manager, steady‑state monitor, ChaosMesh controller manager, and daemon, enabling fault injection and experiment orchestration.

The observability layer supplies metrics (Prometheus federation with golden metrics like traffic, latency, error, saturation), logging (ELK stack with Filebeat → Kafka → Logstash → Elasticsearch → Kibana), tracing (Jaeger pipeline: agent → collector → ingester → Elasticsearch → query), profiling (conprof for continuous performance analysis), and dumps (panic interception reported to Sentry).

Service grading defines four importance levels (1‑critical, 2‑high, 3‑medium, 4‑low) that guide traffic routing, SLA definition, and disaster‑recovery priorities.

Adaptive retry and API‑gateway failover are realized through a retry‑window algorithm with detection and back‑off strategies. The algorithm is:

// Set the detection window f(i), actual detection g(i), success rate s(i), total requests t.
if s(i) = [98%, 100%] { // detection normal
    if g(i) >= f(i) {
        f(i+1) = f(i) + max(min(1% * t, f(i)), 1)
    } else {
        f(i+1) = f(i) // keep window unchanged
    }
} else {
    f(i+1) = max(1, f(i)/2) // back‑off on abnormal detection
}
// Initial window size is 1; parameters are tuned from tests.

Dynamic timeout leverages an EMA‑based approach to adjust timeout thresholds according to average latency, providing elasticity for short‑term spikes while protecting resources.

In summary, QQ Music combines robust architectural design, proactive tooling, and comprehensive observability to achieve high availability, continuously iterating on reliability practices across its massive micro‑service ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsMicroservicesObservabilityhigh availabilitychaos engineeringfault tolerance
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.