Industry Insights 27 min read

From LAMP to Microservices: Bilibili Live’s 8‑Year Architecture Evolution

This article chronicles Bilibili Live’s eight‑year journey from a simple LAMP monolith to a highly available microservice ecosystem, detailing the technical motivations, design principles, Swoole‑based services, containerization, Golang migration, custom gateways, hot‑key handling, and operational safeguards that enabled millions of concurrent viewers.

Bilibili Tech
Bilibili Tech
Bilibili Tech
From LAMP to Microservices: Bilibili Live’s 8‑Year Architecture Evolution

Introduction

Bilibili Live, launched in 2014, grew from a modest trial project into a core business unit with a complex microservice system serving tens of millions of concurrent users. The article reviews the architectural evolution over eight years, highlighting key decisions and lessons learned.

0‑to‑1: Early LAMP Architecture

Initially the live platform ran on a classic LAMP stack (Linux, Apache, MySQL, PHP) within a single repository called live‑app‑web. The project combined front‑end pages rendered by Smarty, JavaScript UI, and a PHP‑based message queue built on Redis List.

Early live system architecture
Early live system architecture

Microservice Transition with Swoole

Rapid growth exposed monolith limitations: deployment bottlenecks, release conflicts, and single‑point failures. The team adopted Swoole, a high‑performance PHP coroutine framework, to build a microservice platform based on four principles:

Split services by business domain.

Give each service its own database and cache.

Enforce RPC‑only inter‑service communication.

Assign service owners responsible for stability.

The custom microservice framework provided process management, graceful restarts, ORM, caching, and logging. Communication used a simple TCP‑based RPC protocol called liverpc (fixed‑length header + variable‑length JSON body). Service discovery and configuration were handled by Zookeeper and an internal tool named Apollo.

Microservice framework overview
Microservice framework overview

Containerization

Physical‑machine deployments caused port conflicts, resource contention, and scaling challenges. After evaluating the internal container platform, the team Dockerized all services. They discovered that the default CFS CPU scheduler caused severe timeouts for PHP services, so they switched to CPUSET (CPU pinning) and tuned worker counts to 3‑4× the allocated CPU cores.

To handle traffic spikes, resources were split into a fixed pool (CPUSET) and an elastic pool (shared resources). The gateway live‑api directed requests below a QPS threshold to the fixed pool and excess traffic to the elastic pool, enabling graceful handling of bursty loads.

Fixed vs. elastic resource pools
Fixed vs. elastic resource pools

Golang Migration ("Golang 真香")

By 2018, PHP’s multi‑process model could not meet scaling demands: single‑process failures caused cascade outages, RPC concurrency was limited, and database connection explosion hindered horizontal scaling. Golang’s goroutine model solved these problems. The migration introduced three service types:

Business gateway (interface) – aggregates APIs per scenario (App, Web).

Business service (service) – domain‑specific logic such as room or gift services.

Business job (job) – scheduled or asynchronous tasks.

The new Golang gateway reduced client‑side request counts from dozens to one or two per page, added proactive caching, and automatic degradation for downstream failures. Performance tests showed roughly 50% lower latency compared with the PHP implementation.

Golang service diagram
Golang service diagram

New Gateway – Ekango (Envoy + Custom Control Plane)

To replace the aging live‑api, the team evaluated Kong, Tyk, and Envoy, ultimately selecting Envoy as the data plane and building a Golang control plane named Ekango. Ekango provides distributed rate limiting, request rewriting, degradation, unified authentication, risk control, and multi‑zone failover, handling >150k QPS per instance.

Ekango also enabled a service‑mesh solution called Yuumi, which lets PHP/JS services call Golang‑implemented gRPC services via sidecar proxies, abstracting service discovery, retries, and load balancing.

Ekango architecture
Ekango architecture

Hot‑Key Management

Hot keys arise from popular rooms, articles, or comments, causing single‑node overloads. The team built a multi‑level caching strategy:

PHP era: a monitor service collected CDN and long‑connection metrics, pushed hot‑room IDs to a queue, and pre‑loaded them into in‑memory caches.

Golang era: a SDK exposed hot‑room checks; services could query the SDK for hot status and cache data locally.

General SDK: used sliding‑window + LFU + priority queue to compute Top‑K hot IDs, then pushed them to business services for proactive memory caching.

Proxy‑layer client‑side caching (Redis 6.0) allowed regex‑based key caching at the edge.

HeavyKeeper‑based SDK embedded in the Redis client provided transparent hot‑key detection and caching.

Hot‑key detection flow
Hot‑key detection flow

Request Amplification Mitigation

Room service (20 W+ QPS) suffered from three amplification patterns:

Over‑fetching: clients requested full room objects when only a flag was needed. The team introduced FieldMask‑style modular APIs to let callers request only required fields.

Duplicate requests: multiple downstream services (gift‑panel, dm‑service) each fetched room data, inflating traffic tenfold. The solution was to pass room data downstream instead of re‑fetching.

Tag‑based gating: downstream services were only invoked when a room’s TAG indicated relevance, reducing unnecessary QPS.

Request amplification diagram
Request amplification diagram

Activity Assurance

Large‑scale events require systematic safeguards. The team established a workflow covering scenario mapping, capacity estimation, full‑stack load testing, degradation SOPs, and real‑time on‑site monitoring via a custom activity‑assurance platform. Alerts are linked to SOP manuals, and post‑event reports are generated automatically.

Activity assurance platform
Activity assurance platform

Highlights and Future Outlook

In 2021, Bilibili Live streamed the League of Legends World Championship with over ten million concurrent viewers, marking a technical high point. The roadmap ahead focuses on further stability, multi‑active deployments, and unit‑level isolation, aiming to break new records in upcoming esports seasons.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

architecturelive streamingMicroservicesScalabilityGolangcontainerizationBilibiliSwoole
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.