Operations 29 min read

Design and Optimization of Monitoring 2.0 Architecture with VictoriaMetrics and Flink

The new Monitoring 2.0 architecture separates collection, compute and storage, adopts VictoriaMetrics for compact time‑series storage and a zone‑based scheduler, introduces push‑based ingestion, uses Flink for real‑time pre‑aggregation and automatic PromQL rewrite, delivering ten‑fold query speedups, sub‑300 ms p90 latency, and dramatically higher write and query throughput.

Bilibili Tech

Aug 9, 2024

Design and Optimization of Monitoring 2.0 Architecture with VictoriaMetrics and Flink

Background – Metrics data is a cornerstone of observability. By the end of 2021 Bilibili built a unified monitoring platform based on Prometheus + Thanos. Rapid business growth caused explosive metric growth, leading to stability, query‑performance, and cloud‑monitoring data‑quality problems.

2.0 Architecture Design – The new design follows four principles: (1) separate collection from storage to allow elastic scaling of collectors, (2) separate compute from storage to avoid resource waste, (3) adopt VictoriaMetrics (VM) as the time‑series database, and (4) unit‑level disaster recovery by scheduling targets per zone dimension.

Data Sources – Monitoring covers both PaaS (application, middleware) and IaaS (servers, containers, network). The previous pull‑based discovery caused latency and high operational cost, so a push‑based model was introduced, allowing services to actively push target instances.

Data Collection – The collection layer consists of a master scheduler (global) and contractor schedulers (per‑zone). Master builds full job configurations, protects against large batch deletions, and reduces full‑dispatch time from ~50 s to <10 s. Contractors fetch zone‑specific configs, assign them to healthy collectors, and ensure graceful shutdown to avoid metric gaps.

Data Storage – VM storage uses vmstorage. Core types include:

4 byte account id | 4 byte projectid | metricname(__name__) | 1 | tag1 k | 1 | tag1 v | … | tagn k | 1 | tagn v | 1 | 2

MetricId is an 8‑byte timestamp; TSID combines tenant, metric name id, job id, instance id and MetricId:

4 byte accountid | 4 byte projectid | 8 byte metricname id | 4 byte job id | 4 byte instance id | 8 byte metricid

Index structures (MetricName → TSID, MetricId → MetricName, etc.) and inverted indexes enable fast binary‑search look‑ups, reducing disk usage by ~40 % compared with Prometheus.

Data Query – A typical heavy PromQL query:

histogram_quantile(0.99, sum(rate(grpc_server_requests_duration_ms_bucket{app="$app",env=~"$env"}[2m])) by (le, method))

Execution steps: metric fetch → rate calculation → sum aggregation → histogram_quantile. The first step is the bottleneck (millions of points). By pre‑aggregating the sum stage and storing it as a new metric (e.g., test_metric_app_A), subsequent queries can replace the sub‑tree, dramatically reducing data processed.

PromQL Auto‑Replacement – A mapping from original sub‑trees to pre‑aggregated metrics is built. During query parsing, the engine matches sub‑trees and substitutes them, adding an extra aggregation layer when necessary to preserve grouping keys (e.g., adding env back).

Flink‑Based Pre‑Aggregation – Instead of periodic batch jobs, Flink streams ingest raw metrics, filter by leaf nodes, partition by the aggregation keys (le, method, code, etc.), and perform windowed aggregations (2 min windows). Memory usage is minimized by discarding unnecessary labels and storing only a compact key ( promqlid + le + method + code) and value ( uuid + timestamp + value) – about 20 bytes per point. The resulting aggregated metrics are written back via remote‑write to vminsert.

Query Optimization Benefits – Automatic query rewriting plus Flink pre‑aggregation reduces >20 s slow queries by 90 %, cuts query‑engine load by 50 %, and brings p90 latency down to a few hundred milliseconds.

Data Visualization – Grafana was upgraded from v6.7.x to v9.2.x, containerized with Nginx auth proxy, and configured to use the newer Prometheus API, achieving ~10× faster variable queries.

Overall Gains – After migration to VM: p90 query time >10× faster, 1.7 M+ targets scheduled by zone, collection interval reduced from 60 s to 30 s, OOM alerts down >90 %, write throughput 44 M/s, query throughput 48 k/s, and p90 query latency ~300 ms.

Cloud Monitoring Solution – Cloud‑side metrics are collected via Prometheus agents, remote‑written to the IDC VM cluster, and protected by vm-auth for tenant authentication. All cloud sources are unified under a single zone‑based scheduler, improving data availability and reducing on‑call incidents by >90 %.

Future Plans – Extend metric retention beyond 15 days, support finer‑grained (5 s) ingestion, enhance self‑monitoring, add write/query bans and whitelist, and explore LLM‑driven text2promql generation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Flink Observability Metrics Query Optimization prometheus VictoriaMetrics

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.