How Bilibili Scaled Live Streaming from LAMP to Microservices and Beyond
This article chronicles Bilibili Live's eight‑year journey from a simple LAMP monolith to a sophisticated micro‑service ecosystem, detailing the motivations, architectural decisions, containerization, Golang migration, gateway redesign, hot‑key handling, request amplification mitigation, and operational practices that enabled millions of concurrent viewers.
Architecture Evolution of Bilibili Live
Bilibili Live started in 2014 with a classic LAMP stack (Linux, Apache, MySQL, PHP) where all front‑end, back‑end and scheduled tasks lived in a single project live‑app‑web. The monolith was organized into three logical layers: business logic, push/pull streaming, and long‑connection services.
Micro‑service Transformation (2016‑2017)
After a high‑profile outage, the team refactored the system into micro‑services using the high‑performance PHP framework Swoole . Core principles were:
Domain‑driven service decomposition.
Each service owns its own database and cache.
Inter‑service communication limited to RPC.
Service owners are accountable for stability.
A custom micro‑service framework provided process management, graceful restarts, ORM, caching and logging. Communication used a simple TCP‑based RPC protocol called liverpc (fixed‑length header + variable‑length JSON body). Service discovery and configuration were handled by Zookeeper together with an internal tool Apollo. Kafka acted as the message backbone, with dedicated publisher and notify services. A dedicated gateway live‑api (built on Swoole) performed traffic forwarding, URL rewriting, timeout control, rate‑limiting, caching and degradation.
Containerization
All services were Dockerized to eliminate port conflicts, resource contention and scaling uncertainty. Two CPU scheduling modes were evaluated:
CFS – fair sharing, but caused severe time‑outs for PHP workers.
CPUSET – CPU pinning; selected for PHP services. Load tests showed the optimal worker count to be 3–4× the allocated CPU cores.
To handle traffic bursts, resources were split into a fixed pool (CPUSET) and an elastic pool. The gateway routes excess QPS to the elastic pool, and request tagging allows the elastic pool to prioritize its own services.
Adoption of Golang (2018‑present)
Golang replaced PHP for new services because its lightweight goroutine model solved PHP’s process‑blocking, RPC concurrency and connection‑explosion problems. Services were classified into:
Business gateways (interface) – aggregate APIs per scenario, implement proactive caching and automatic degradation.
Business services (service) – contain domain‑specific logic.
Background jobs (job) – scheduled and asynchronous tasks.
The Golang gateway consolidated dozens of downstream calls into a single request, cutting average latency by more than 50 % compared with the PHP gateway.
New Gateway – Ekango
To replace the aging live‑api, the team evaluated open‑source gateways and selected Envoy as the data plane. A custom Golang control plane named Ekango was built on top of Envoy. Ekango adds distributed rate‑limiting, request rewriting, degradation, unified authentication, risk control and multi‑zone failover, supporting >150 k QPS per instance.
Ekango also upgraded the internal liverpc protocol to HTTP, simplifying debugging and integration.
Hot‑Key Management
Hot data (e.g., high‑traffic rooms) caused node overloads. The solution evolved through three stages:
PHP‑side monitoring and cache pre‑warming.
A Golang SDK that lets services query hot‑room status.
A generic hot‑key detection SDK using sliding windows, LFU and priority queues (Top‑K). The SDK provides a HeavyKeeper -based algorithm for accurate hot‑key estimation with low memory overhead.
Redis 6.0 client‑side caching and a custom proxy‑less Redis client were introduced to transparently cache hot keys and aggregate writes.
Request Amplification Mitigation
Three amplification patterns were identified and addressed:
Clients requesting full room data when only a flag was needed – solved by modular APIs inspired by FieldMask, allowing callers to request only required fields.
Duplicate room requests from multiple downstream services – eliminated by passing room data directly between services instead of re‑fetching.
Unnecessary QPS on low‑traffic services – mitigated with a TAG mechanism; the gateway skips calls to services whose TAG indicates irrelevance for the current request.
Activity Assurance Practices
Large‑scale events require systematic preparation:
Scenario mapping and capacity forecasting.
Full‑stack load testing (including write‑path isolation).
Degradation SOPs and pre‑run rehearsals.
A real‑time assurance platform that aggregates alerts, links them to SOPs and records post‑event reports for continuous improvement.
References and Useful Links
GitHub repository for the Kratos gateway: https://github.com/go-kratos/gateway
Additional technical articles (Chinese): https://www.zhihu.com/question/48457286, https://www.infoq.cn/article/y2semvajjgxj9mbg9p00, https://www.redis.com.cn/topics/client-side-caching.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
