How Bilibili Scaled Live Streaming from LAMP to a Multi‑Million‑User Microservice Platform
This article chronicles Bilibili Live's eight‑year journey from a simple LAMP monolith to a sophisticated microservice architecture, detailing the migration to Swoole‑based services, Docker containerization, Golang adoption, hot‑key handling, request amplification mitigation, and the creation of a high‑performance Envoy‑based gateway.
Bilibili Live was launched in 2014 and grew from a modest LAMP‑based system (Linux, Apache, MySQL, PHP) into a complex microservice ecosystem serving millions of concurrent users. The early "live‑app‑web" project combined front‑end pages, JS, and a PHP message‑queue worker, initially deployed on physical servers and rapidly expanding from 50k to 130k lines of code within two years.
From Monolith to Microservices
Rapid growth exposed problems such as merge conflicts, deployment bottlenecks, and single‑point failures. A critical incident in 2016 prompted a shift to microservices. The team chose Swoole, a high‑performance PHP coroutine framework, as the foundation and defined four guiding principles:
Split services by business domain.
Give each service its own database and cache.
Enforce RPC‑only communication between services.
Assign service stability responsibility to the service owner.
The resulting framework provided process management, graceful restarts, ORM, caching, and logging, while a custom TCP‑based RPC protocol called liverpc handled inter‑service calls.
Service Discovery and Configuration
Zookeeper was adopted for service discovery, and a companion program named Apollo managed configuration pulling, registration, and health checks, enabling hot‑reloading of service settings.
Unified Gateway
A dedicated gateway service live‑api (built on Swoole) introduced traffic forwarding, URL rewriting, timeout control, rate limiting, caching, and degradation capabilities, centralizing external access.
Containerization
To overcome physical‑machine limitations, all services were Dockerized. The team evaluated CPU scheduling options, ultimately selecting CPUSET (CPU pinning) for PHP services after observing severe timeouts with the default CFS scheduler. They introduced two resource pools—fixed (CPUSET) and elastic (shared)—and used the gateway to route burst traffic to the elastic pool.
Transition to Golang
By 2018, PHP’s process model and scaling limits became a bottleneck. Golang’s goroutine model solved worker blockage, RPC concurrency, and connection‑pool pressure. New services were classified as business gateways, business services, and background jobs. A Golang‑based gateway aggregated multiple downstream calls, halving latency compared to the PHP version.
New Envoy‑Based Gateway (Ekango)
Evaluating Kong, Tyk, and Envoy, the team built a custom control plane in Golang with Envoy as the data plane, naming it Ekango. Ekango added distributed rate limiting, request rewriting, degradation, unified authentication, and achieved >150k QPS per instance.
Hot‑Key Management
Hot keys (e.g., popular rooms, articles, comments) caused cache pressure and potential snowball failures. Solutions evolved from PHP‑side monitor services pushing hot room IDs to Redis, to a Golang SDK that lets services query hot‑room status, to a generic SDK using sliding‑window LFU and HeavyKeeper algorithms for top‑K detection. Additional strategies included proxy‑layer client‑side caching, Redis client‑side hot‑key caches, and write‑aggregation SDKs to batch increment operations.
Request Amplification Mitigation
Analysis revealed excessive data requests and duplicate calls across services when users entered a room. The team introduced field‑mask‑style API segmentation, TAG‑based request gating, and adjusted call flows to reduce unnecessary traffic, cutting request amplification by an order of magnitude.
Activity Assurance Framework
For large‑scale events, a systematic process covering scenario mapping, capacity planning, load testing, degradation plans, and on‑site incident response was established. A real‑time assurance platform aggregates alerts, links them to SOPs, and generates post‑event reports.
Future Outlook
The architecture continues to evolve toward higher availability, multi‑active deployments, and further service mesh integration, aiming to support ever‑larger live‑stream audiences.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
