Design and Optimization of Monitoring 2.0 Architecture with VictoriaMetrics and Flink
The new Monitoring 2.0 architecture separates collection, compute and storage, adopts VictoriaMetrics for compact time‑series storage and a zone‑based scheduler, introduces push‑based ingestion, uses Flink for real‑time pre‑aggregation and automatic PromQL rewrite, delivering ten‑fold query speedups, sub‑300 ms p90 latency, and dramatically higher write and query throughput.
Background – Metrics data is a cornerstone of observability. By the end of 2021 Bilibili built a unified monitoring platform based on Prometheus + Thanos. Rapid business growth caused explosive metric growth, leading to stability, query‑performance, and cloud‑monitoring data‑quality problems.
2.0 Architecture Design – The new design follows four principles: (1) separate collection from storage to allow elastic scaling of collectors, (2) separate compute from storage to avoid resource waste, (3) adopt VictoriaMetrics (VM) as the time‑series database, and (4) unit‑level disaster recovery by scheduling targets per zone dimension.
Data Sources – Monitoring covers both PaaS (application, middleware) and IaaS (servers, containers, network). The previous pull‑based discovery caused latency and high operational cost, so a push‑based model was introduced, allowing services to actively push target instances.
Data Collection – The collection layer consists of a master scheduler (global) and contractor schedulers (per‑zone). Master builds full job configurations, protects against large batch deletions, and reduces full‑dispatch time from ~50 s to <10 s. Contractors fetch zone‑specific configs, assign them to healthy collectors, and ensure graceful shutdown to avoid metric gaps.
Data Storage – VM storage uses vmstorage . Core types include:
4 byte account id | 4 byte projectid | metricname(__name__) | 1 | tag1 k | 1 | tag1 v | … | tagn k | 1 | tagn v | 1 | 2MetricId is an 8‑byte timestamp; TSID combines tenant, metric name id, job id, instance id and MetricId:
4 byte accountid | 4 byte projectid | 8 byte metricname id | 4 byte job id | 4 byte instance id | 8 byte metricidIndex structures (MetricName → TSID, MetricId → MetricName, etc.) and inverted indexes enable fast binary‑search look‑ups, reducing disk usage by ~40 % compared with Prometheus.
Data Query – A typical heavy PromQL query:
histogram_quantile(0.99, sum(rate(grpc_server_requests_duration_ms_bucket{app="$app",env=~"$env"}[2m])) by (le, method))Execution steps: metric fetch → rate calculation → sum aggregation → histogram_quantile. The first step is the bottleneck (millions of points). By pre‑aggregating the sum stage and storing it as a new metric (e.g., test_metric_app_A ), subsequent queries can replace the sub‑tree, dramatically reducing data processed.
PromQL Auto‑Replacement – A mapping from original sub‑trees to pre‑aggregated metrics is built. During query parsing, the engine matches sub‑trees and substitutes them, adding an extra aggregation layer when necessary to preserve grouping keys (e.g., adding env back).
Flink‑Based Pre‑Aggregation – Instead of periodic batch jobs, Flink streams ingest raw metrics, filter by leaf nodes, partition by the aggregation keys (le, method, code, etc.), and perform windowed aggregations (2 min windows). Memory usage is minimized by discarding unnecessary labels and storing only a compact key ( promqlid + le + method + code ) and value ( uuid + timestamp + value ) – about 20 bytes per point. The resulting aggregated metrics are written back via remote‑write to vminsert .
Query Optimization Benefits – Automatic query rewriting plus Flink pre‑aggregation reduces >20 s slow queries by 90 %, cuts query‑engine load by 50 %, and brings p90 latency down to a few hundred milliseconds.
Data Visualization – Grafana was upgraded from v6.7.x to v9.2.x, containerized with Nginx auth proxy, and configured to use the newer Prometheus API, achieving ~10× faster variable queries.
Overall Gains – After migration to VM: p90 query time >10× faster, 1.7 M+ targets scheduled by zone, collection interval reduced from 60 s to 30 s, OOM alerts down >90 %, write throughput 44 M/s, query throughput 48 k/s, and p90 query latency ~300 ms.
Cloud Monitoring Solution – Cloud‑side metrics are collected via Prometheus agents, remote‑written to the IDC VM cluster, and protected by vm-auth for tenant authentication. All cloud sources are unified under a single zone‑based scheduler, improving data availability and reducing on‑call incidents by >90 %.
Future Plans – Extend metric retention beyond 15 days, support finer‑grained (5 s) ingestion, enhance self‑monitoring, add write/query bans and whitelist, and explore LLM‑driven text2promql generation.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.