Operations 30 min read

How Bilibili Scaled Its Monitoring: From Prometheus OOMs to VictoriaMetrics & Flink Pre‑Aggregation

The article details Bilibili's evolution of its monitoring platform, describing the stability and performance challenges of a Prometheus‑Thanos stack, the redesign using VictoriaMetrics, collection‑storage separation, unit‑level disaster recovery, query‑tree auto‑replacement, Flink‑based pre‑aggregation, Grafana upgrades, and future roadmap for observability.

Architect

Sep 12, 2024

How Bilibili Scaled Its Monitoring: From Prometheus OOMs to VictoriaMetrics & Flink Pre‑Aggregation

Background and Pain Points

By the end of 2021 Bilibili built a unified monitoring platform on Prometheus + Thanos, but rapid business growth caused metric data explosion, leading to stability issues (OOMs, slow queries, poor cloud data quality) and unmet observability goals (1‑5‑10).

Design Principles for the 2.0 Architecture

Collect‑Storage Separation – Decouple collectors from storage so targets can be dynamically scheduled and collectors can scale elastically.

Compute‑Storage Separation – Allow independent scaling of write, storage, and query resources to avoid waste when compute demand grows faster than storage.

Time‑Series DB Selection – Adopt VictoriaMetrics (VM) for its high write/query performance, distributed design, and operational efficiency.

Unit‑Level Disaster Recovery – Schedule all targets by zone dimension, ensuring a closed‑loop chain from collection to storage to query within each unit.

Functional Architecture Overview

The new architecture consists of data sources (PAAS and IAAS), a push‑based target discovery, a two‑level scheduler (Master and Contractor), collectors built on vmagent, VM storage, and VM query components.

Data Collection Details

Scheduler Layer

Master Scheduler reads all job configs from the database, builds per‑zone configurations in memory, and protects against massive deletions by intercepting updates that affect >5k targets.

Contractor Scheduler pulls zone‑specific configs from Master, assigns them to healthy collectors based on capacity, and ensures deterministic placement to avoid random re‑scheduling.

Collector

Implemented as a wrapper around vmagent, it reports heartbeats, reloads configs via API, and performs streaming collection ( promscrape.streamParse=true) to reduce memory usage by ~20%.

On shutdown, the collector delays exit until the next collection cycle to avoid metric gaps.

Data Storage with vmstorage

VM stores metrics using three core types:

MetricName – Serialized label KV pairs (example layout shown below).

4 byte account id | 4 byte projectid | metricname(__name__) | 1 | tag1 k | 1 | tag1 v | … | tagn k | 1 | tagn v | 1 | 2

MetricId – 8‑byte nanosecond timestamp serving as a unique key for a time‑series. 8 byte metricid TSID – Composite key (tenant, metric name ID, job ID, instance ID, MetricId) enabling efficient prefix scans.

4 byte accountid | 4 byte projectid | 8 byte metricname id | 4 byte job id | 4 byte instance id | 8 byte metricid

These structures allow prefix‑based binary search and inverted indexes that reduce disk usage by ~40% compared with raw Prometheus.

Query Layer and PromQL Auto‑Replacement

Complex PromQL queries (e.g., p99 latency) often become bottlenecks because the first execution step fetches massive raw data. The solution builds a mapping from original sub‑trees to pre‑aggregated metrics, automatically replacing the heavy sub‑tree during query planning while preserving semantics.

Example mapping replacement flow (images omitted for brevity) shows how a histogram_quantile query can be rewritten to use a pre‑aggregated metric test_metric_app_A with an additional aggregation layer to retain missing label dimensions.

Flink‑Based Pre‑Aggregation

Instead of periodic batch jobs, Flink streams ingest raw Prometheus metrics, apply the same execution‑tree parsing, and perform windowed aggregation using the sum(rate(...)) pattern. By discarding unnecessary labels and keeping only a compact key ( promqlid + le + method + code) and value ( uuid + timestamp + value), memory per point drops to ~20 bytes.

Typical Flink job (100 CPU, 400 GB RAM) can handle 300 M metric points per 2‑minute window, achieving >10× lower resource consumption than batch pre‑aggregation.

Query Optimization Benefits

Automatic query rewriting reduces >20‑second slow queries to sub‑second latency.

Overall query‑engine resource usage drops ~50%.

p90 latency improves >10×, with many queries now completing in ~300 ms.

Grafana Upgrade and Visualization

Upgrading from Grafana 6.7.x to 9.2.x, consolidating plugins, containerizing deployment, and switching Prometheus data source to version >2.37.x (using label_values API) yields a 10× query‑performance boost.

Overall Gains

p90 query time reduced >10× after moving to VM.

170 W+ targets scheduled by zone, achieving zone‑level disaster tolerance.

Collection interval halved (60 s → 30 s) without extra hardware.

OOM alerts and metric gaps cut >90%.

Write throughput 44 M points/s, query throughput 48 k queries/s; post‑optimization p90 query <300 ms.

Cloud Monitoring Solution

To unify cloud and IDC monitoring, cloud‑side Prometheus instances remote‑write to the central VM cluster via a vm‑auth gateway that provides tenant authentication and traffic scheduling. This consolidates >20 cloud data sources into a single source and improves data availability.

Future Roadmap

Extend metric retention beyond the default 15 days for long‑term analysis.

Support finer‑grained scrape intervals (5 s) for latency‑critical services.

Enhance self‑monitoring pipelines and SOP integration.

Add write/query throttling, whitelist controls, and explore LLM‑driven text2promql generation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Cloud Native Flink Observability Metrics prometheus VictoriaMetrics

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.