How ICBC Scaled Dubbo Service Discovery for 20,000+ Services
This article details Industrial Bank's migration to Dubbo micro‑services, the performance and high‑availability challenges of managing over 20,000 services with Zookeeper, and the concrete optimizations—delayed subscription, multiple‑registry mode, and per‑node registration—that enabled stable, large‑scale service discovery.
Background and Overview
Industrial Bank (ICBC) moved from a monolithic JEE architecture to a Dubbo‑based micro‑service platform starting in 2014. After years of rollout, the platform supports more than 20,000 Dubbo service interfaces and over 700,000 provider entries, becoming a core component of the bank’s open‑platform banking system.
Key Challenges
Performance & Capacity – Online services exceed 20,000, with each registry holding over 700,000 provider nodes. Future growth targets 100,000 services and 5 million provider entries per registry.
High Availability – Any node failure must not affect 24/7 transaction processing. Version upgrades and registry updates must be transparent to the business.
Service Discovery Basics in Dubbo
Dubbo follows a standard pattern: providers register themselves, consumers subscribe and obtain a full provider list, and RPC calls are made point‑to‑point without passing through the registry. ICBC chose Zookeeper as the registry in 2014 because of its proven scalability and CP‑style strong consistency.
Within Zookeeper, each Dubbo service creates four child nodes: providers (temporary, list of providers), consumers (temporary, list of consumers), configurations (persistent, service parameters), and routers (persistent, routing rules).
Problems Observed with Zookeeper at Scale
Massive data push: when a service with 100 providers starts, each provider triggers a watch event, causing every consumer to read the full provider list 100 times (total 5,050 reads). During peak deployments this saturates network bandwidth and degrades registration performance.
Large snapshot files: the growing number of Zookeeper nodes inflates snapshot size, leading to high disk I/O and longer recovery times after failures.
Observer‑node re‑sync delays: after a leader election, Observer nodes must sync the full transaction log. If the sync is slow, client sessions may timeout, causing temporary loss of provider nodes and subsequent registration storms.
Optimization Measures
1) Delayed Subscription Updates
ICBC modified the zkclient library to introduce a short delay after a childchange event before fetching the provider list. This batches rapid changes and reduces the number of read operations.
Benchmark results: before the change each consumer received ~4.22 million provider entries during a large rollout; after a 1‑second delay the volume dropped to ~260 k, a reduction to about 5 % of the original traffic.
2) Multiple‑Registry (multiple) Mode
Dubbo’s registry‑multiple SPI was adopted and enhanced. Instead of handling each registry independently, the client merges provider data from all registries before updating its cache. This prevents a single registry failure from causing missing providers and balances load across registries.
Additional benefit: the merged cache halves the number of Reference objects in the JVM, saving memory.
3) Per‑Node Registration Model
ICBC back‑ported Dubbo 2.7/Dubbo 3.0 service‑discovery logic to a “per‑node registration” model, separating configuration, metadata, and registration concerns:
Configuration Center : stores dynamic node‑level parameters and persistent configurations / routers data.
Metadata Center : holds service‑to‑node mappings and method signatures.
Registration Center : only stores the mapping between node name and its IP/port.
This redesign does not affect consumer call flow; consumers resolve the service via metadata and then locate the actual endpoint via the registration center.
Load tests showed data stored in the registry shrank to 1.68 % of the original size, comfortably supporting 100 k services and 100 k nodes.
Future Plans
ICBC aims to contribute its enhancements back to the open‑source community, including refined RPC result handling, multi‑protocol support, and registration circuit‑breaker mechanisms. The bank is also exploring a migration path from Dubbo to service‑mesh solutions (e.g., Istio, MCP) to address SDK version upgrades and mesh‑era challenges.
ICBC invites other Dubbo practitioners to share large‑scale experiences and jointly improve enterprise adoption.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
