Bilibili's Exploration and Practice of Microservice Governance

This article presents Bilibili's exploration of microservice governance, detailing the challenges of service splitting and large‑scale management, the design and evolution of its Go‑based Discovery service discovery framework, advanced load‑balancing algorithms, adaptive rate‑limiting, circuit‑breaking strategies, and future directions for resilient backend systems.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Bilibili's Exploration and Practice of Microservice Governance

Microservice architectures face two major pain points: how to split services and define boundaries, and how to manage them at scale, where small issues can be amplified into cascading failures.

Bilibili initially used Zookeeper, a CP system, for service discovery, but encountered cross‑datacenter registration failures and performance bottlenecks. In 2018 they built an AP‑oriented Discovery framework in Go, where providers register via HTTP long‑polling, health checks are asynchronous, and self‑protection mechanisms keep the system stable during network partitions.

The load‑balancing strategy evolved from a simple Weighted Round‑Robin (WRR 1.0) to a dynamic WRR that adjusts weights based on real‑time CPU usage, and finally to a 3.0 algorithm that combines exponential weighted moving averages, a “best‑of‑two‑random” choice, and inflight‑based throttling to quickly discard unhealthy nodes and reduce latency.

For traffic protection, Bilibili first employed token‑bucket limiting and a Hystrix‑style circuit breaker, later replacing them with a Google‑SRE‑inspired elastic breaker that adapts to success rates. They also introduced a BBR‑based adaptive limiter that uses CPU and IOPS as signals, and a Codel‑style LIFO queue that drops long‑waiting requests during high load.

Looking ahead, the team plans to automate multi‑datacenter traffic scheduling, add Merkle‑tree and gossip support to Discovery, pre‑warm RPC load balancers, develop a globally‑aware distributed rate‑limiting solution, and implement priority queues for RPC requests.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

load balancingservice discoveryGorate limitingCircuit Breaking
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.