How Bilibili Scaled Monitoring: From Prometheus to a 2.0 VM‑Flink Architecture
Bilibili rebuilt its monitoring platform to handle explosive metric growth by separating collection, storage, and compute, adopting VictoriaMetrics, zone‑based scheduling, and Flink‑driven pre‑aggregation, which together improved stability, query performance, cloud data quality, and overall observability.
Background
Metrics are a cornerstone of observability. By the end of 2021 Bilibili deployed a unified monitoring platform based on Prometheus + Thanos, but rapid business growth caused metric volume to explode, stressing system stability, query performance, and cloud‑side data quality.
Stability issues : Alert calculation relied on local Prometheus; when large applications changed pod names, rebuilding the timeseries index consumed massive memory, leading to OOM crashes.
Poor user query experience : Frequent OOM alerts caused data gaps, and Prometheus + Thanos queries were slow or timed‑out, resulting in delayed panels, failed open‑API calls, and false alarms.
Cloud monitoring data quality : Multiple cloud providers, accounts, and regions created a complex network topology, often causing missing data and making it hard for users to select the correct data source.
2.0 Architecture Design
Design goals : Separate collection from storage, separate compute from storage, and adopt zone‑based unitized disaster recovery.
Collection‑storage separation : Decouple target scheduling from data ingestion so collectors can scale elastically.
Compute‑storage separation : Allow independent scaling of storage and query resources, avoiding linear resource waste.
Time‑series database selection : VictoriaMetrics (VM) was chosen for its high write/read performance, distributed architecture, and operational efficiency.
Unitized disaster recovery : All targets are scheduled by zone, ensuring that a full‑stack unit can fail over within the same zone.
Functional Overview
Below is the high‑level functional diagram of the 2.0 architecture:
The architecture supports a wide range of scenarios, from PaaS application monitoring to IaaS host, container, and network monitoring.
Data Sources
Metrics are collected from two main layers: PaaS (application and middleware) and IaaS (servers, containers, network). The original pull‑based discovery caused two problems: delayed target detection (up to 30 s) and high operational cost when pull endpoints failed.
To solve this, Bilibili switched to a push model where services actively push target definitions, enabling real‑time discovery and better visibility of collection status.
Data Collection
Scheduling layer : Generates collection jobs and distributes target instances. It consists of a primary (Master) scheduler and a secondary (Contractor) scheduler per zone.
Master scheduler : Retrieves full job configurations from the database, builds per‑zone configurations in memory, and protects against large‑scale accidental deletions. Optimizations (asynchronous processing, memory cache, multi‑goroutine) reduced full‑schedule time from ~50 s to <10 s.
Contractor scheduler : Pulls zone‑specific configurations from Master, assigns jobs to healthy collectors based on capacity, and ensures deterministic placement to avoid random target migration.
Collector (vmagent wrapper) : Sends heartbeats to Contractor and reloads vmagent with updated configs. Enabling promscrape.streamParse=true reduced vmagent memory usage by ~20 %.
Data Storage
Metrics are stored in vmstorage. Core types include:
MetricName – serialized label KV pairs (e.g., 4 byte accountId | 4 byte projectId | metricName | …).
MetricId – nanosecond timestamp serving as a unique key for a time series.
TSID – composite key of tenant, metric name ID, job ID, instance ID, and MetricId.
Index structures map MetricName → TSID, MetricId → MetricName, and MetricId → TSID, with inverted indexes enabling fast lookup of MetricIds by label filters. Compared with Prometheus, VM reduces disk usage by ~40 %.
In production, a 48‑core, 256 GB vmstorage node can handle 400 k writes per second and 20 k QPS queries, with gogc tuning to balance CPU and GC overhead.
Data Query
PromQL auto‑replace enhancement : Complex queries such as
histogram_quantile(0.99, sum(rate(grpc_server_requests_duration_ms_bucket{app="$app",env=~"$env"}[2m])) by (le, method))often cause slow panels or failures due to massive data volume.
The execution tree is parsed by vmselect, then processed depth‑first: leaf node fetches raw data, rate computes per‑label rates, sum aggregates by le and method, and finally histogram_quantile calculates the percentile.
To optimize, Bilibili pre‑aggregates the heavy sub‑tree into a new metric test_metric_app_A. A mapping replaces the original sub‑tree with the pre‑aggregated metric, adding an extra aggregation layer when necessary to preserve query semantics.
Because the pre‑aggregated metric may drop the env label, the mapping was adjusted to include code in the group‑by, ensuring no loss of filtering capability.
Flink‑Based PromQL Pre‑Aggregation
Instead of periodic batch pre‑aggregation, Bilibili uses Flink to perform real‑time aggregation:
Data filtering : Leaf‑node metric names and labels are used to filter the incoming Kafka stream.
Data partitioning : The sum(rate(...)) group‑by keys become Flink partition keys, drastically reducing memory pressure.
Window execution : Flink processes the execution tree in a depth‑first manner, handling functions like increase by reusing Prometheus logic.
Data sink : Aggregated results are written back to vminsert via the remote‑write protocol.
A 100‑core, 400 GB Flink job can cache and compute 300 M data points per 2‑minute window, cutting resource usage compared with batch pre‑aggregation.
Query Optimization Benefits
Automatic query rewriting combined with Flink pre‑aggregation reduces >90 % of queries that previously exceeded 20 s, cuts data source consumption by ~50 %, and brings p90 query latency down to the millisecond range.
Grafana Upgrade
The monitoring UI was upgraded from Grafana 6.7.x to 9.2.x. Challenges included breaking changes in plugins and data sources. Solutions involved extensive testing, script‑based compatibility fixes, and containerizing the entire stack (nginx + auth + Grafana) for easier deployment.
Post‑upgrade, panel loading speed improved tenfold, and query performance increased dramatically.
Cloud Monitoring Solution
All cloud‑side Prometheus instances remote‑write their data to the IDC storage cluster, with a dedicated vm‑auth component providing tenant authentication and traffic scheduling for security.
Zone‑based scheduling for cloud targets unified data sources, reducing the number of cloud data sources from >20 to a single source and cutting on‑call incidents caused by missing data by >90 %.
Future Plans
Extend metric retention beyond the current 15 days for deeper analysis.
Support finer‑grained metric collection intervals (e.g., 5 s) for high‑resolution observability.
Enhance self‑monitoring capabilities to cover all operational SOPs.
Iterate the metric platform with write/query bans, whitelist features, and LLM‑driven text‑to‑PromQL translation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
