Operations 14 min read

How NetEase Cloud Music Built a Resilient RPC Framework for Microservices

This article details the practical steps and architectural choices NetEase Cloud Music took to improve RPC stability in a micro‑service environment, covering service discovery, connection management, cloud‑native challenges, SLO design, log governance, degradation, rate limiting, outlier detection, thread‑pool isolation, fast‑failure handling, registry optimizations, multi‑registry support, and post‑incident knowledge‑base building.

dbaplus Community

Jan 22, 2024

How NetEase Cloud Music Built a Resilient RPC Framework for Microservices

Overall Architecture

The RPC framework is the backbone of Cloud Music’s micro‑service ecosystem, connecting services such as user, membership, advertising, and data platforms across independent nodes and clusters after the migration from a monolith.

Key RPC Challenges

Service discovery – quickly locate all provider nodes and propagate offline events to consumers.

Connection management – decide between single connections or pools and handle reconnection after network jitter.

Cloud‑native readiness – detect abnormal nodes in containerized deployments and trigger circuit‑breakers.

Retry strategies – choose appropriate retry policies for timeouts and errors.

Stability Stages

Stability is divided into three phases: pre‑failure (prevention), during‑failure (detection, recovery, root‑cause analysis), and post‑failure (knowledge accumulation).

Pre‑failure – SLO Construction

You can't manage what you can't measure.

Metrics are collected from exception logs, core indicators (thread‑pool, CPU, memory, GC), and automated test results. The focus is on user‑impact metrics such as interface success rate and response time (RT) rather than raw resource usage.

Example SLOs:

90% of RPC requests complete within 200 ms.

99% of requests return HTTP 200.

Pre‑failure – Log Governance

Trace linking – attach a traceId to each log entry so that end‑to‑end request flow can be visualized in APM.

Asynchronous adaptation – preserve traceId across thread‑pool switches to avoid loss during async processing.

Core log standardization – unify log prefixes, module names, and error codes for fast filtering.

During Failure – Degradation Platform

The internal degradation platform provides zero‑code integration and dynamic rule updates:

Template rules – automatically trigger degradation when error rates exceed thresholds.

Fallback strategies – method fallback, fixed values, or cache‑backed responses.

Monitoring & alerts – granular metrics routed to responsible owners.

Second‑level rule updates – changes take effect without redeploy.

During Failure – Rate Limiting

Integration with the internal rate‑limiting platform supports:

Single‑machine and concurrent limiting.

Distributed limiting.

Parameter‑based limiting.

High‑frequency limiting.

During Failure – Outlier Node Removal

When a provider node shows an abnormal error rate, it is temporarily removed from routing tables. After successful health checks, the node is reintegrated.

During Failure – Thread‑Pool Isolation

Product isolation – different products (e.g., Cloud Music, Live) route to separate clusters and pools.

Granular isolation – isolation by application, interface, or method.

Retry‑request separation – retry traffic uses dedicated pools to avoid blocking normal traffic.

During Failure – Fast Failure

Multi‑layer timeout checks abort requests early (e.g., if pre‑processing already exceeds 100 ms). Async processing further reduces thread consumption.

During Failure – Weak Dependency on Registry

The framework treats Zookeeper as a weak dependency: connection failures trigger asynchronous retries without blocking application startup.

Registry Optimizations (Zookeeper)

Config & retry tuning – extend sessionTimeout to 30 s to tolerate short network glitches.

Event‑listener improvement – ignore unchanged session IDs to avoid unnecessary re‑registrations.

Dynamic configuration – parameters such as retries and timeouts can be changed at runtime.

Multi‑Registry Support

The RPC framework now registers simultaneously to Zookeeper and Nacos. Routing preferences are configurable and automatic fallback to the alternate registry is provided.

Post‑failure – Knowledge‑Base Accumulation

Incidents are recorded in a knowledge base that matches exception types and stack traces to remediation steps, enabling developers to resolve recurring issues without manual search.

Conclusion

RPC stability and performance are fundamental for reliable micro‑services. Continuous governance, cloud‑native adaptations, and robust observability are essential to maintain high availability while supporting cost‑effective operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native Operations RPC Logging Stability SLO

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.