How We Built a Rock‑Solid RPC Framework for Cloud‑Native Microservices

This article details the challenges of RPC stability in a large‑scale microservice environment and explains the architectural redesign, SLO implementation, logging governance, exception dashboards, degradation, rate‑limiting, outlier removal, thread‑pool isolation, weak registry dependencies, and post‑incident knowledge‑base practices that together ensure reliable, high‑performance service communication.

NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
How We Built a Rock‑Solid RPC Framework for Cloud‑Native Microservices

Background

In a typical microservice architecture, RPC frameworks connect services and components, acting as a critical backbone for user, membership, advertising, and data platforms. With the rapid adoption of cloud‑native principles and cost‑reduction pressures, RPC reliability became a major concern because failures can cause playback interruptions, latency spikes, and user churn.

Overall Architecture

Overall RPC architecture diagram
Overall RPC architecture diagram

After moving from a monolith to microservices, function calls inside a process no longer suffice, prompting the adoption of RPC. This introduced new challenges:

Service discovery – how consumers quickly find all provider nodes and receive timely updates when providers go offline.

Connection management – deciding between single connections or pools and handling reconnection after network jitter.

Cloud‑native considerations – detecting abnormal nodes in containerized deployments and performing rapid circuit‑break.

Retry strategies – selecting appropriate retry policies for timeouts and errors.

Stability is broken into three phases:

Pre‑failure – prevention through automated tests, process controls, monitoring, and alerts.

During failure – rapid detection, recovery, and root‑cause isolation.

Post‑failure – analysis, standardization, and automation to avoid recurrence.

Pre‑failure

SLO System Construction

You can't manage what you can't measure.
SLO illustration
SLO illustration

We selected a balanced set of metrics—success rate and response time (RT)—instead of overwhelming logs or too few alerts. By focusing on user‑visible impact (e.g., increased latency or reduced availability), the SLO platform can trigger alerts that prompt developers to investigate promptly.

Log Governance

Early logs were noisy and lacked context. We introduced:

Trace linkage – attaching a traceId to logs enables end‑to‑end correlation across services.

Asynchronous adaptation – ensuring traceId propagation across thread‑pool switches to avoid loss during async processing.

Core link‑log refinement – completing and standardizing essential request logs.

Log standardization – unified prefixes, modules, and error codes for quick filtering by interface or IP.

Exception Dashboard

Exception dashboard
Exception dashboard

We aggregated all RPC‑related loggers and built a centralized exception dashboard that provides top‑N application statistics, outlier detection, and log sampling, helping operators quickly pinpoint problematic services and drill down to root causes.

During Failure

Degradation

Degradation flow
Degradation flow

The internal degradation platform, already used by thousands of applications, offers:

Template rules – automatic degradation when error rates exceed thresholds.

Rich fallback strategies – invoking fallback methods, returning fixed values, or serving cached data.

Monitoring & alerts – detailed metrics and routing alerts to responsible owners.

Dynamic adjustments – rule changes propagate within seconds without redeployments.

Zero‑code integration – developers enable degradation via configuration only.

Rate Limiting

Rate limiting diagram
Rate limiting diagram

Integration with the internal rate‑limiting platform provides single‑machine, concurrent, distributed, parameter‑based, and high‑frequency limiting strategies.

Outlier Node Removal

Outlier detection
Outlier detection

When a provider node exhibits abnormal error rates, it is temporarily excluded from routing for a configurable window, and re‑added after successful health checks, preserving overall SLO.

Thread‑Pool Isolation

To avoid contention, we isolate thread pools by product, application, interface, and method, and separate retry requests into dedicated pools.

Fast Failure

Pre‑ and post‑business‑logic timeout checks abort requests early (e.g., if a 100 ms deadline is already exceeded) and leverage async capabilities to reduce thread usage.

Weak Dependency on Registry

Registry dependency diagram
Registry dependency diagram

Our RPC originally depended strongly on Zookeeper, causing massive provider offline events during Zookeeper glitches. By treating the registry as a weak dependency, we introduced:

Zookeeper tuning – longer session timeout (e.g., 30 s) and session‑aware event handling to reduce unnecessary writes.

Recycle bin – offline node metadata is cached and can be restored after large‑scale Zookeeper failures.

Automated degradation – when Zookeeper health degrades, the system automatically activates the recycle‑bin mechanism.

Multi‑registry support – optional Nacos registration alongside Zookeeper, with dynamic routing preferences and fallback strategies.

These changes allow the RPC framework to continue operating even when a registry experiences partial outages.

Post‑failure

Experience Knowledge Base

Experience repository
Experience repository

Because RPC component code changes infrequently, we accumulated a repository of incident patterns and resolutions. When a known exception occurs, the platform automatically matches it to the knowledge base, guiding developers toward self‑service remediation.

Conclusion

Stability and performance are foundational for any RPC framework in a microservice ecosystem. Continuous governance—spanning SLOs, logging, exception dashboards, degradation, rate limiting, outlier handling, thread‑pool isolation, fast‑fail mechanisms, and weak registry dependencies—ensures the framework can meet cloud‑native, cost‑effective demands while delivering a smooth developer experience.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativemicroservicesBackend DevelopmentRPCReliabilityservice registrySLO
NetEase Cloud Music Tech Team
Written by

NetEase Cloud Music Tech Team

Official account of NetEase Cloud Music Tech Team

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.