Scaling Bilibili’s Metrics Platform with VictoriaMetrics and Flink Pre‑aggregation
This article details how Bilibili redesigned its monitoring system to overcome explosive metric growth by separating collection and storage, adopting VictoriaMetrics, implementing zone‑based scheduling, automating PromQL query replacement, and using Flink for efficient pre‑aggregation, resulting in dramatically lower latency and higher stability.
Background
By the end of 2021 Bilibili had deployed a unified monitoring platform based on Prometheus + Thanos, but rapid business growth caused metric data to explode, leading to stability issues, OOM alerts, poor query performance, and fragmented cloud data sources.
Stability problems: Prometheus performed local alert calculations; when large applications changed pod labels, rebuilding the time‑series index consumed massive memory and caused OOM.
Poor user experience: Frequent OOM alerts broke data continuity, and Prometheus + Thanos queries were slow or timed out.
Cloud data quality: Multiple cloud providers, accounts, and regions produced isolated Prometheus instances, making it hard for users to select the correct data source.
2.0 Architecture Design
Design Principles
Separate collection from storage: Decouple collectors from storage so target instances can be dynamically scheduled and collectors can scale elastically.
Separate compute from storage: Allow independent scaling of write, storage, and query resources to avoid waste.
Time‑series database selection: Choose VictoriaMetrics (VM) for its write/read performance, distributed architecture, and operational efficiency.
Unit‑level disaster recovery: Schedule all targets by zone dimension to keep the entire collection‑transfer‑storage‑query chain within a single zone, achieving zone‑level fault isolation.
The overall functional architecture is illustrated below:
Data Source
Metrics are collected from two layers:
PAAS layer – application, middleware, and offline component monitoring.
IAAS layer – self‑hosted servers, cloud VMs, containers, and network devices.
Original pull‑based discovery introduced a 30 s delay and high operational cost. Switching to a push model lets services actively register targets, providing real‑time visibility of collection status.
Data Collection
Scheduling Layer
Two‑level scheduling ensures zone‑aware distribution:
Master (first‑level): Loads all job configs from the database, builds per‑zone configurations in memory, and protects against massive deletions or updates by intercepting jobs affecting >5k targets.
Contractor (second‑level): Periodically fetches zone‑specific configs from Master, assigns them to healthy collectors based on capacity, and maintains state during restarts to avoid data gaps.
Collector
Implemented as a wrapper around vmagent, reporting heartbeats and reloading configs.
Enabled streaming collection ( promscrape.streamParse=true) to cut memory usage by ~20%.
Added graceful shutdown: on exit signal, the collector stops heartbeats but continues one more collection cycle before terminating, preventing metric gaps.
Data Storage
Metrics are stored in vmstorage. Key internal types:
MetricName: Serialises label key‑value pairs; on write, vminsert converts labels to a MetricName record.
MetricId: 8‑byte nanosecond timestamp serving as a unique key for a time‑series cluster.
TSID: Composite key consisting of tenant, metric name ID, job ID, instance ID, and MetricId. It enables lexicographic ordering so that same‑tenant, same‑metric series are stored contiguously.
4 byte account id | 4 byte projectid | metricname(__name__) | 1 | tag1 k | 1 | tag1 v | … | tagn k | 1 | tagn v | 1 | 2 4 byte accountid | 4 byte projectid | 8 byte metricname(__name__) id | 4 byte job id | 4 byte instance id | 8 byte metricidAbstract index structures:
MetricName -> TSID: 0 | MetricName | TSID</code><code>MetricId -> MetricName: 3 | 4 byte accountid | 4 byte projectid | MetricId | MetricName</code><code>MetricId -> TSID: 2 | 4 byte accountid | 4 byte projectid | MetricId | TSIDInverted index example:
1 | 4 byte accountid | 4 byte projectid | metricname(__name__) | 1 | MetricId</code><code>1 | 4 byte accountid | 4 byte projectid | tag k | 1 | tag v | 1 | MetricIdUsing prefix‑based compression reduces disk usage by ~40% compared with raw Prometheus.
Data Query
Typical heavy query:
histogram_quantile(0.99, sum(rate(grpc_server_requests_duration_ms_bucket{app="$app",env=~"$env"}[2m])) by (le, method))The execution tree consists of four stages: raw data fetch, rate calculation, sum aggregation, and histogram_quantile. The first stage dominates latency because it must read tens of millions of points.
Optimization idea: pre‑aggregate the sum result into a new metric (e.g., test_metric_app_A) and replace the subtree during query planning.
sum(rate(grpc_server_requests_duration_ms_bucket{app="A"}[2m])) by (le, method, code) histogram_quantile(0.99, test_metric_app_A)When the original query’s grouping keys are a subset of the pre‑aggregated metric’s keys, the replacement is safe; otherwise an extra aggregation layer is added.
Flink‑Based PromQL Pre‑aggregation
Batch pre‑aggregation stresses vmselect memory. Using Flink, the same PromQL expression is parsed into a JSON execution tree, then processed in a streaming job:
Filtering stage: Match leaf‑node metric names and labels against incoming Kafka metric streams.
Partitioning stage: Use the sum node’s by keys as Flink partition keys, discarding unnecessary labels to minimise memory (only retain a UUID for label set).
Windowed aggregation stage: Apply the same functions (e.g., rate, sum) within Flink’s time windows, mimicking Prometheus semantics where possible.
Sink stage: Write the pre‑aggregated metric back to vminsert via the remote‑write protocol.
Memory footprint per point is reduced to ~20 bytes, and a 100 c / 400 g Flink job can handle 300 M points per 2‑minute window.
Query Optimization Benefits
Automatic query rewriting combined with Flink pre‑aggregation cuts >90% of >20 s slow queries, halves overall query‑engine load, and brings p90 latency down to the millisecond range.
Data Visualization
Grafana was upgraded from v6.7.x to v9.2.x, containerised with nginx auth proxy, and configured to use the newer Prometheus label‑values API, achieving ~10× faster variable resolution.
Cloud Monitoring Solution
Cloud‑side metrics are collected via Prometheus agents, remote‑written to the IDC VM cluster, and protected by a vm‑auth component that authenticates tenants and routes traffic. Zone‑aware scheduling unifies cloud data sources, reducing on‑call incidents by >90%.
Future Plans
Extend metric retention beyond 15 days for analysis and capacity planning.
Support finer‑grained scrapes (e.g., 5 s intervals) for more detailed observability.
Enhance self‑monitoring pipelines and SOP integration.
Add write/query bans, whitelist controls, and explore LLM‑driven text2PromQL generation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
