How Bilibili Scales Capacity: VPA, HPA, and Cost‑Saving Strategies
This article summarizes Zhang He’s Bilibili SRE talk on building a capacity‑management system that visualizes resource usage, reduces costs, improves stability, and leverages Kubernetes VPA, HPA, pooling, and quota management to support massive live‑stream events and rapid feature releases.
Design Philosophy and Motivation
Capacity management at Bilibili targets three fundamental problems:
Visibility: cluster, resource‑pool and node water‑marks are not exposed, making stability hard to guarantee.
Root‑cause tracing: frequent code, configuration and traffic‑shift changes obscure when and why capacity variations occur.
Autoscaling coverage: many bursty activities exceed the limits of existing Horizontal Pod Autoscaler (HPA) configurations.
Key Challenges
Internal dynamics – code releases, config updates, load‑testing, multi‑active traffic routing and cache expiration constantly reshape capacity models.
External spikes – live events, promotions and viral topics generate unpredictable traffic bursts.
Multiple bottlenecks – long service chains (upstream, downstream, middleware) make early detection difficult.
Manual emergency response – reliance on human intervention leads to long recovery times and risk of cascading failures.
Architecture Overview
The platform is built from the bottom up and consists of four layers:
Basic Capacity : collects metrics for clusters, resource pools, nodes and application profiles.
Elastic Resources : implements Vertical Pod Autoscaler (VPA) and HPA, adds pooling and quota controls, and provides visual dashboards.
PaaS Pooling : merges physical pools (comics, live, OGV) into a logical pool, shifting focus to logical quota management.
Quota Management : issues quota policies, integrates with the internal CMDB and ties over‑usage to the release platform for automatic throttling.
VPA‑Based Elastic Scaling
Each service defines a soft limit (recommended) and a hard limit (maximum). VPA uses real‑time CPU usage and service‑level profiles to compute request values, reducing over‑provisioning.
# VPA pipeline components
Generator – expands high‑level rules (e.g., L0 tier) into individual VPA objects per service.
Recommender – pulls metrics (CPU usage, P99 latency, etc.) from the monitoring system and calculates optimal request values.
Updater – patches the Pod spec with the new request values.
Webhook – listens to deployment events and triggers a resource adjustment if needed.During large‑scale events, non‑critical services (e.g., L2/L3 back‑ends) can have their soft limits lowered to free resources for core services.
Strategy Management
Metric management – configure which metric (CPU max, CPU P99, memory, QPS) drives the recommendation.
Template management – maintain per‑tier templates (L0, L1, L2…) that encode service‑type characteristics.
Pre‑estimation & A/B testing – simulate strategy impact before rollout.
Data Operations
Coverage dashboards – show pool‑wide VPA adoption rate and per‑service adjustment magnitude.
Execution logs – record each recommendation and its applied result for audit.
Strategy analysis – compare pre‑estimation with actual outcomes to refine templates.
Blacklist & Alerting
A blacklist excludes high‑risk services (e.g., those under heavy load tests or newly released features) from VPA adjustments during unexpected spikes.
Alerting monitors failure rate, coverage ratio and redundancy; alerts are routed to SRE and platform owners when VPA actions deviate from expectations.
PaaS Pooling Implementation
Physical pools for comics, live streaming and OGV are unified into a logical pool. The rollout follows three concrete steps:
Standardized governance : remove special constraints, unify kernel versions, disable nolimit bindings, normalize logs and cpuset settings.
Platform support : introduce logical quota objects per organization, enforce quota limits on the merged pool, and extend VPA coverage to the pooled resources.
Executive endorsement : secure top‑down commitment to coordinate cross‑department resource sharing.
Quota Management Integration
The capacity platform publishes quota policies to an internal CMDB‑backed business tree. Each organization receives a quota allocation; excess usage triggers the release platform to throttle or reject further scaling attempts.
HPA Design and Observability
HPA mirrors VPA concepts and adds horizontal scaling capabilities.
Policy management : define per‑tier thresholds (e.g., L0 services expand when CPU > 30%). Metrics include CPU, memory and QPS.
Elastic pre‑check : before scaling, verify downstream capacity (DB connection pools, TiDB, caches, message queues) to avoid overload.
Observability : track coverage rate, scaling quality and instance count; dashboards display bulk enable/disable, coverage percentages and current replica numbers.
Alerting : generate alerts for scaling failures, abnormal HPA behavior, or downstream bottlenecks.
Capacity Inspection and Protection
Regular inspections visualize risk‑prone services, usage rates and quota health for developers, platform teams and SRE. An event‑driven pipeline aggregates changes from the release platform, HPA, and node management, enabling rapid root‑cause analysis of capacity variations.
Operational Dashboards
Basic capacity charts – cluster, pool, node and application metrics.
Business‑level views – usage trends, hot services and pain points.
Capacity event streams – link platform actions (e.g., releases, scaling events) to resource changes.
Weekly reports – department‑specific and internal summaries of usage, efficiency gains and stability risks.
Achieved Benefits
No new physical machines were added for online PaaS workloads in the first half of 2022.
Zero additional procurement for large‑scale events (S12) thanks to pooled resources and VPA/HPA elasticity.
Event support capacity grew >10× while provisioning time dropped from weeks to hours.
Smaller services experienced reduced outage risk due to larger, more distributed pools.
Urgent scaling needs (blue‑green releases, HPA oversell) are satisfied within minutes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
