Operations 31 min read

Scaling Bilibili’s Metrics Platform with VictoriaMetrics and Flink Pre‑aggregation

This article details how Bilibili redesigned its monitoring system to overcome explosive metric growth by separating collection and storage, adopting VictoriaMetrics, implementing zone‑based scheduling, automating PromQL query replacement, and using Flink for efficient pre‑aggregation, resulting in dramatically lower latency and higher stability.

ITPUB
ITPUB
ITPUB
Scaling Bilibili’s Metrics Platform with VictoriaMetrics and Flink Pre‑aggregation

Background

By the end of 2021 Bilibili had deployed a unified monitoring platform based on Prometheus + Thanos, but rapid business growth caused metric data to explode, leading to stability issues, OOM alerts, poor query performance, and fragmented cloud data sources.

Stability problems: Prometheus performed local alert calculations; when large applications changed pod labels, rebuilding the time‑series index consumed massive memory and caused OOM.

Poor user experience: Frequent OOM alerts broke data continuity, and Prometheus + Thanos queries were slow or timed out.

Cloud data quality: Multiple cloud providers, accounts, and regions produced isolated Prometheus instances, making it hard for users to select the correct data source.

2.0 Architecture Design

Design Principles

Separate collection from storage: Decouple collectors from storage so target instances can be dynamically scheduled and collectors can scale elastically.

Separate compute from storage: Allow independent scaling of write, storage, and query resources to avoid waste.

Time‑series database selection: Choose VictoriaMetrics (VM) for its write/read performance, distributed architecture, and operational efficiency.

Unit‑level disaster recovery: Schedule all targets by zone dimension to keep the entire collection‑transfer‑storage‑query chain within a single zone, achieving zone‑level fault isolation.

The overall functional architecture is illustrated below:

Data Source

Metrics are collected from two layers:

PAAS layer – application, middleware, and offline component monitoring.

IAAS layer – self‑hosted servers, cloud VMs, containers, and network devices.

Original pull‑based discovery introduced a 30 s delay and high operational cost. Switching to a push model lets services actively register targets, providing real‑time visibility of collection status.

Data Collection

Scheduling Layer

Two‑level scheduling ensures zone‑aware distribution:

Master (first‑level): Loads all job configs from the database, builds per‑zone configurations in memory, and protects against massive deletions or updates by intercepting jobs affecting >5k targets.

Contractor (second‑level): Periodically fetches zone‑specific configs from Master, assigns them to healthy collectors based on capacity, and maintains state during restarts to avoid data gaps.

Collector

Implemented as a wrapper around vmagent, reporting heartbeats and reloading configs.

Enabled streaming collection ( promscrape.streamParse=true) to cut memory usage by ~20%.

Added graceful shutdown: on exit signal, the collector stops heartbeats but continues one more collection cycle before terminating, preventing metric gaps.

Data Storage

Metrics are stored in vmstorage. Key internal types:

MetricName: Serialises label key‑value pairs; on write, vminsert converts labels to a MetricName record.

MetricId: 8‑byte nanosecond timestamp serving as a unique key for a time‑series cluster.

TSID: Composite key consisting of tenant, metric name ID, job ID, instance ID, and MetricId. It enables lexicographic ordering so that same‑tenant, same‑metric series are stored contiguously.

4 byte account id | 4 byte projectid | metricname(__name__) | 1 | tag1 k | 1 | tag1 v | … | tagn k | 1 | tagn v | 1 | 2
4 byte accountid | 4 byte projectid | 8 byte metricname(__name__) id | 4 byte job id | 4 byte instance id | 8 byte metricid

Abstract index structures:

MetricName -> TSID: 0 | MetricName | TSID</code><code>MetricId -> MetricName: 3 | 4 byte accountid | 4 byte projectid | MetricId | MetricName</code><code>MetricId -> TSID: 2 | 4 byte accountid | 4 byte projectid | MetricId | TSID

Inverted index example:

1 | 4 byte accountid | 4 byte projectid | metricname(__name__) | 1 | MetricId</code><code>1 | 4 byte accountid | 4 byte projectid | tag k | 1 | tag v | 1 | MetricId

Using prefix‑based compression reduces disk usage by ~40% compared with raw Prometheus.

Data Query

Typical heavy query:

histogram_quantile(0.99, sum(rate(grpc_server_requests_duration_ms_bucket{app="$app",env=~"$env"}[2m])) by (le, method))

The execution tree consists of four stages: raw data fetch, rate calculation, sum aggregation, and histogram_quantile. The first stage dominates latency because it must read tens of millions of points.

Optimization idea: pre‑aggregate the sum result into a new metric (e.g., test_metric_app_A) and replace the subtree during query planning.

sum(rate(grpc_server_requests_duration_ms_bucket{app="A"}[2m])) by (le, method, code)
histogram_quantile(0.99, test_metric_app_A)

When the original query’s grouping keys are a subset of the pre‑aggregated metric’s keys, the replacement is safe; otherwise an extra aggregation layer is added.

Flink‑Based PromQL Pre‑aggregation

Batch pre‑aggregation stresses vmselect memory. Using Flink, the same PromQL expression is parsed into a JSON execution tree, then processed in a streaming job:

Filtering stage: Match leaf‑node metric names and labels against incoming Kafka metric streams.

Partitioning stage: Use the sum node’s by keys as Flink partition keys, discarding unnecessary labels to minimise memory (only retain a UUID for label set).

Windowed aggregation stage: Apply the same functions (e.g., rate, sum) within Flink’s time windows, mimicking Prometheus semantics where possible.

Sink stage: Write the pre‑aggregated metric back to vminsert via the remote‑write protocol.

Memory footprint per point is reduced to ~20 bytes, and a 100 c / 400 g Flink job can handle 300 M points per 2‑minute window.

Query Optimization Benefits

Automatic query rewriting combined with Flink pre‑aggregation cuts >90% of >20 s slow queries, halves overall query‑engine load, and brings p90 latency down to the millisecond range.

Data Visualization

Grafana was upgraded from v6.7.x to v9.2.x, containerised with nginx auth proxy, and configured to use the newer Prometheus label‑values API, achieving ~10× faster variable resolution.

Cloud Monitoring Solution

Cloud‑side metrics are collected via Prometheus agents, remote‑written to the IDC VM cluster, and protected by a vm‑auth component that authenticates tenants and routes traffic. Zone‑aware scheduling unifies cloud data sources, reducing on‑call incidents by >90%.

Future Plans

Extend metric retention beyond 15 days for analysis and capacity planning.

Support finer‑grained scrapes (e.g., 5 s intervals) for more detailed observability.

Enhance self‑monitoring pipelines and SOP integration.

Add write/query bans, whitelist controls, and explore LLM‑driven text2PromQL generation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringarchitectureFlinkObservabilityPromQLVictoriaMetrics
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.