How Vivo Supercharged Dubbo Routing with Async Caching and Load‑Balancing Optimizations
This article analyzes the CPU‑heavy routing and load‑balancing modules of Dubbo in a large‑scale microservice cluster, identifies O(n) traversal as the root cause, and presents a step‑by‑step redesign—including disabling unused routers, caching route results with epoch validation, BitMap‑based intersection, and grouped routing—that reduces CPU usage by up to 27% and doubles TPS when provider instances exceed 2,000.
Overview
Vivo’s Java stack uses Apache Dubbo 2.7.x. In clusters with more than 100 providers, routing and load‑balancing consume up to 30% of CPU, throttling business logic. Flame‑graph profiling identified the hot paths and motivated a redesign of the routing chain and weight calculation.
Dubbo Call Flow
A Dubbo client invokes ClusterInvoker, which obtains a service list from Directory, passes it through a RouterChain , and finally selects an Invoker via a LoadBalance implementation (random by default). The routing chain follows a responsibility‑pattern; each Router can be ordered by priority. Core classes are RouterChain, RouterFactory and Router.
Problem Analysis
Profiling showed that getWeight (used by the random load‑balancer) and the route method of each router dominate CPU. Both traverse the full provider list, giving O(n) complexity. When the list grows to thousands of instances, the cumulative cost becomes prohibitive.
Optimization Strategies
Disable Unused Routers – For services that never use application‑level tag routing, turn off the native TagRouter via configuration: dubbo.consumer.router=-tag or the equivalent annotation/XML forms.
Cache Routing Results – Provider lists are stable between deployment windows. Cache the result of each router keyed by the routing criterion (e.g., data‑center or tag). An epoch value attached to the cached list guarantees consistency; if the epoch mismatches, the cache is bypassed and recomputed.
Bitmap Intersection – Cached results are stored as a BitList (a bitmap of provider indices). Intersections between successive routers become simple bitwise AND operations, dramatically speeding up the combination step.
Grouped Routing – After all routers, if the remaining provider count exceeds a threshold, split the list into groupNum virtual groups and randomly select one group, reducing the number of candidates entering the load‑balancer.
Load‑Balance Weight Optimization – The original getWeight always queried registry weight and performed warm‑up calculations. The revised version first checks whether the invoker is a ClusterInvoker before accessing registry data, cutting unnecessary branches.
Implementation Highlights
RouterChain core logic:
public class RouterChain<T> {
private List<Invoker<T>> invokers = Collections.emptyList();
private volatile List<Router> routers = Collections.emptyList();
private List<Router> builtinRouters = Collections.emptyList();
public List<Invoker<T>> route(URL url, Invocation invocation) {
List<Invoker<T>> finalInvokers = invokers;
for (Router router : routers) {
finalInvokers = router.route(finalInvokers, url, invocation);
}
return finalInvokers;
}
}Cache‑aware routing (simplified):
private <T> BitList<Invoker<T>> getNearestInvokersWithCache(BitList<Invoker<T>> invokers) {
ValueWrapper vw = getCache(getSystemProperty(LOC));
if (vw != null && invokers.isSameEpoch((BitList<Invoker<T>>) vw.get())) {
BitList<Invoker<T>> tmp = invokers.clone();
return tmp.and((BitList<Invoker<T>>) vw.get());
}
return getNearestInvokers(invokers);
}Grouped routing logic:
public <T> List<Invoker<T>> doGroup(List<Invoker<T>> invokers, int groupNum) {
int listLength = invokers.size() / groupNum;
List<Invoker<T>> result = new ArrayList<>(listLength);
int random = ThreadLocalRandom.current().nextInt(groupNum);
for (int i = random; i < invokers.size(); i += groupNum) {
result.add(invokers.get(i));
}
return result;
}Weight‑calculation optimisation (original vs. revised):
// Original
if (UrlUtils.isRegistryService(url)) {
weight = url.getParameter(REGISTRY_KEY + "." + WEIGHT_KEY, DEFAULT_WEIGHT);
}
// Revised
if (invoker instanceof ClusterInvoker && UrlUtils.isRegistryService(url)) {
weight = url.getParameter(REGISTRY_KEY + "." + WEIGHT_KEY, DEFAULT_WEIGHT);
}Performance Results
Two demo projects were benchmarked: (1) no routing configuration, (2) routing with nearest‑plus‑tag enabled. Provider counts of 100, 500, 1,000, 2,000 and 5,000 were tested at ~1,000 TPS. After applying the optimisations, TPS increased by more than 100 % when providers exceeded 2,000, while average CPU dropped by ~27 %. The CPU share of routing and load‑balancing fell from >30 % to <10 %.
Conclusion & Future Work
Disabling unnecessary routers, introducing epoch‑validated bitmap caching, and adding grouped routing reduced Dubbo’s routing‑related CPU consumption dramatically and doubled throughput in large clusters. Remaining issues include the inherent cost of the random load‑balancer and VM/container over‑commitment. Future plans involve adopting Dubbo 3.2’s adaptive load‑balancer and a CPU‑aware custom balancer to further smooth resource utilization.
References: Dubbo load‑balancing documentation, Dubbo traffic control, and the “Dubbo 3 StateRouter” article.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
