From LAMP to Microservices: Bilibili Live’s 8‑Year Architecture Evolution
This article chronicles Bilibili Live’s eight‑year journey from a simple LAMP monolith to a highly available microservice ecosystem, detailing the technical motivations, design principles, Swoole‑based services, containerization, Golang migration, custom gateways, hot‑key handling, and operational safeguards that enabled millions of concurrent viewers.
Introduction
Bilibili Live, launched in 2014, grew from a modest trial project into a core business unit with a complex microservice system serving tens of millions of concurrent users. The article reviews the architectural evolution over eight years, highlighting key decisions and lessons learned.
0‑to‑1: Early LAMP Architecture
Initially the live platform ran on a classic LAMP stack (Linux, Apache, MySQL, PHP) within a single repository called live‑app‑web. The project combined front‑end pages rendered by Smarty, JavaScript UI, and a PHP‑based message queue built on Redis List.
Microservice Transition with Swoole
Rapid growth exposed monolith limitations: deployment bottlenecks, release conflicts, and single‑point failures. The team adopted Swoole, a high‑performance PHP coroutine framework, to build a microservice platform based on four principles:
Split services by business domain.
Give each service its own database and cache.
Enforce RPC‑only inter‑service communication.
Assign service owners responsible for stability.
The custom microservice framework provided process management, graceful restarts, ORM, caching, and logging. Communication used a simple TCP‑based RPC protocol called liverpc (fixed‑length header + variable‑length JSON body). Service discovery and configuration were handled by Zookeeper and an internal tool named Apollo.
Containerization
Physical‑machine deployments caused port conflicts, resource contention, and scaling challenges. After evaluating the internal container platform, the team Dockerized all services. They discovered that the default CFS CPU scheduler caused severe timeouts for PHP services, so they switched to CPUSET (CPU pinning) and tuned worker counts to 3‑4× the allocated CPU cores.
To handle traffic spikes, resources were split into a fixed pool (CPUSET) and an elastic pool (shared resources). The gateway live‑api directed requests below a QPS threshold to the fixed pool and excess traffic to the elastic pool, enabling graceful handling of bursty loads.
Golang Migration ("Golang 真香")
By 2018, PHP’s multi‑process model could not meet scaling demands: single‑process failures caused cascade outages, RPC concurrency was limited, and database connection explosion hindered horizontal scaling. Golang’s goroutine model solved these problems. The migration introduced three service types:
Business gateway (interface) – aggregates APIs per scenario (App, Web).
Business service (service) – domain‑specific logic such as room or gift services.
Business job (job) – scheduled or asynchronous tasks.
The new Golang gateway reduced client‑side request counts from dozens to one or two per page, added proactive caching, and automatic degradation for downstream failures. Performance tests showed roughly 50% lower latency compared with the PHP implementation.
New Gateway – Ekango (Envoy + Custom Control Plane)
To replace the aging live‑api, the team evaluated Kong, Tyk, and Envoy, ultimately selecting Envoy as the data plane and building a Golang control plane named Ekango. Ekango provides distributed rate limiting, request rewriting, degradation, unified authentication, risk control, and multi‑zone failover, handling >150k QPS per instance.
Ekango also enabled a service‑mesh solution called Yuumi, which lets PHP/JS services call Golang‑implemented gRPC services via sidecar proxies, abstracting service discovery, retries, and load balancing.
Hot‑Key Management
Hot keys arise from popular rooms, articles, or comments, causing single‑node overloads. The team built a multi‑level caching strategy:
PHP era: a monitor service collected CDN and long‑connection metrics, pushed hot‑room IDs to a queue, and pre‑loaded them into in‑memory caches.
Golang era: a SDK exposed hot‑room checks; services could query the SDK for hot status and cache data locally.
General SDK: used sliding‑window + LFU + priority queue to compute Top‑K hot IDs, then pushed them to business services for proactive memory caching.
Proxy‑layer client‑side caching (Redis 6.0) allowed regex‑based key caching at the edge.
HeavyKeeper‑based SDK embedded in the Redis client provided transparent hot‑key detection and caching.
Request Amplification Mitigation
Room service (20 W+ QPS) suffered from three amplification patterns:
Over‑fetching: clients requested full room objects when only a flag was needed. The team introduced FieldMask‑style modular APIs to let callers request only required fields.
Duplicate requests: multiple downstream services (gift‑panel, dm‑service) each fetched room data, inflating traffic tenfold. The solution was to pass room data downstream instead of re‑fetching.
Tag‑based gating: downstream services were only invoked when a room’s TAG indicated relevance, reducing unnecessary QPS.
Activity Assurance
Large‑scale events require systematic safeguards. The team established a workflow covering scenario mapping, capacity estimation, full‑stack load testing, degradation SOPs, and real‑time on‑site monitoring via a custom activity‑assurance platform. Alerts are linked to SOP manuals, and post‑event reports are generated automatically.
Highlights and Future Outlook
In 2021, Bilibili Live streamed the League of Legends World Championship with over ten million concurrent viewers, marking a technical high point. The roadmap ahead focuses on further stability, multi‑active deployments, and unit‑level isolation, aiming to break new records in upcoming esports seasons.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
