How NetEase Cloud Music Built a Resilient RPC Framework for Microservices
This article details the practical steps and architectural choices NetEase Cloud Music took to improve RPC stability in a micro‑service environment, covering service discovery, connection management, cloud‑native challenges, SLO design, log governance, degradation, rate limiting, outlier detection, thread‑pool isolation, fast‑failure handling, registry optimizations, multi‑registry support, and post‑incident knowledge‑base building.
Overall Architecture
The RPC framework is the backbone of Cloud Music’s micro‑service ecosystem, connecting services such as user, membership, advertising, and data platforms across independent nodes and clusters after the migration from a monolith.
Key RPC Challenges
Service discovery – quickly locate all provider nodes and propagate offline events to consumers.
Connection management – decide between single connections or pools and handle reconnection after network jitter.
Cloud‑native readiness – detect abnormal nodes in containerized deployments and trigger circuit‑breakers.
Retry strategies – choose appropriate retry policies for timeouts and errors.
Stability Stages
Stability is divided into three phases: pre‑failure (prevention), during‑failure (detection, recovery, root‑cause analysis), and post‑failure (knowledge accumulation).
Pre‑failure – SLO Construction
You can't manage what you can't measure.
Metrics are collected from exception logs, core indicators (thread‑pool, CPU, memory, GC), and automated test results. The focus is on user‑impact metrics such as interface success rate and response time (RT) rather than raw resource usage.
Example SLOs:
90% of RPC requests complete within 200 ms.
99% of requests return HTTP 200.
Pre‑failure – Log Governance
Trace linking – attach a traceId to each log entry so that end‑to‑end request flow can be visualized in APM.
Asynchronous adaptation – preserve traceId across thread‑pool switches to avoid loss during async processing.
Core log standardization – unify log prefixes, module names, and error codes for fast filtering.
During Failure – Degradation Platform
The internal degradation platform provides zero‑code integration and dynamic rule updates:
Template rules – automatically trigger degradation when error rates exceed thresholds.
Fallback strategies – method fallback, fixed values, or cache‑backed responses.
Monitoring & alerts – granular metrics routed to responsible owners.
Second‑level rule updates – changes take effect without redeploy.
During Failure – Rate Limiting
Integration with the internal rate‑limiting platform supports:
Single‑machine and concurrent limiting.
Distributed limiting.
Parameter‑based limiting.
High‑frequency limiting.
During Failure – Outlier Node Removal
When a provider node shows an abnormal error rate, it is temporarily removed from routing tables. After successful health checks, the node is reintegrated.
During Failure – Thread‑Pool Isolation
Product isolation – different products (e.g., Cloud Music, Live) route to separate clusters and pools.
Granular isolation – isolation by application, interface, or method.
Retry‑request separation – retry traffic uses dedicated pools to avoid blocking normal traffic.
During Failure – Fast Failure
Multi‑layer timeout checks abort requests early (e.g., if pre‑processing already exceeds 100 ms). Async processing further reduces thread consumption.
During Failure – Weak Dependency on Registry
The framework treats Zookeeper as a weak dependency: connection failures trigger asynchronous retries without blocking application startup.
Registry Optimizations (Zookeeper)
Config & retry tuning – extend sessionTimeout to 30 s to tolerate short network glitches.
Event‑listener improvement – ignore unchanged session IDs to avoid unnecessary re‑registrations.
Dynamic configuration – parameters such as retries and timeouts can be changed at runtime.
Multi‑Registry Support
The RPC framework now registers simultaneously to Zookeeper and Nacos. Routing preferences are configurable and automatic fallback to the alternate registry is provided.
Post‑failure – Knowledge‑Base Accumulation
Incidents are recorded in a knowledge base that matches exception types and stack traces to remediation steps, enabling developers to resolve recurring issues without manual search.
Conclusion
RPC stability and performance are fundamental for reliable micro‑services. Continuous governance, cloud‑native adaptations, and robust observability are essential to maintain high availability while supporting cost‑effective operations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
