Backend Development 18 min read

Optimizing Dubbo Routing and Load Balancing at Scale: Vivo's Practice

Vivo tackled high CPU overhead in large‑scale Dubbo deployments by disabling unused routers, caching routing results with BitMap intersections and epoch validation, optimizing weight calculations, and adding a grouping router, which together delivered over 100 % TPS gains for 20 k+ providers and cut CPU usage by roughly 27 %.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Optimizing Dubbo Routing and Load Balancing at Scale: Vivo's Practice

This article presentsvivo's optimization practices for Apache Dubbo's routing module and load balancing in large-scale microservice deployments. The content covers the technical challenges encountered when using Dubbo (based on version 2.7.x) in production environments with hundreds of service providers, where CPU consumption in routing and load balancing reached 30% according to flame graph analysis.

Background and Problem Analysis:

The article explains Dubbo's client invocation flow: clients use local proxy to call ClusterInvoker, which retrieves service lists from Directory, applies routing chains to filter services, and uses load balancing to select an invoker for RPC calls. The routing mechanism uses a responsibility chain pattern supporting multiple routing strategies (nearest-router, tag-router, conditional router). Load balancing defaults to random selection but includes weight calculation for warm-up purposes.

Performance analysis revealed O(n) time complexity in both getWeight() method for load balancing and route() methods for each router, causing significant CPU overhead when provider count exceeds 100.

Optimization Solutions:

1. Router Optimization:

Disable unused routers (e.g., -tag to disable native application-level tag router)

Pre-calculate and cache routing results - cache full provider lists by datacenter for nearest-router, cache tag-based results for tag-router

Use BitMap for efficient intersection operations between routing results

Implement epoch-based cache invalidation to ensure consistency

2. Load Balancing Optimization:

Optimize getWeight() method by adding type checking before registry service weight lookup

Add grouping router as the final step in routing chain to reduce nodes entering load balancing - randomly select one group to proceed

Key Code Implementations:

The article provides source code for RouterChain, RouterFactory, Router interface, and concrete implementations including nearest-router with caching logic using BitList and epoch validation.

public <T> List<Invoker<T>> route(List<Invoker<T>> invokers, URL consumerUrl, Invocation invocation) throws RpcException {
    BitList<Invoker<T>> bitList = (BitList<Invoker<T>>) invokers;
    BitList<Invoker<T>> result = getNearestInvokersWithCache(bitList);
    // ... fallback logic
}

private <T> BitList<Invoker<T>> getNearestInvokersWithCache(BitList<Invoker<T>> invokers) {
    ValueWrapper valueWrapper = getCache(getSystemProperty(LOC));
    if (valueWrapper != null) {
        BitList<Invoker<T>> invokerBitList = (BitList<Invoker<T>>) valueWrapper.get();
        if (invokers.isSameEpoch(invokerBitList)) {
            BitList<Invoker<T>> tmp = invokers.clone();
            return tmp.and(invokerBitList); // Intersection using BitMap
        }
    }
    return getNearestInvokers(invokers);
}

Performance Results:

Testing with 100 to 50,000 provider nodes and ~1000 TPS showed significant improvements: when provider count exceeds 20,000, TPS improvement reached over 100%, average CPU usage decreased by approximately 27%, and routing/load balancing CPU proportion was significantly reduced. The optimization effect becomes more pronounced as provider count increases.

Javabackend architectureMicroservicesRPCLoad BalancingDubboperformance tuningBitMapCache OptimizationRouting Optimization
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.