Databases 13 min read

Didi's Time Series Storage Evolution: From InfluxDB to VictoriaMetrics

Facing exponential growth of time‑series data from 2017 to 2023, Didi migrated from InfluxDB to RRDtool, then to an in‑memory cache layer, and finally adopted VictoriaMetrics because its low‑cost commodity‑hardware operation, high write throughput, strong compression, and easy horizontal scaling solved the earlier storage, OOM, and scalability problems.

Didi Tech

Sep 26, 2023

Didi's Time Series Storage Evolution: From InfluxDB to VictoriaMetrics

Didi's time‑series curve volume grew dozens of times from 2017 to 2023. To cope with this explosive growth, the team continuously adjusted and upgraded the storage stack, moving from InfluxDB to RRDtool, then to an in‑memory TSDB, and finally to VictoriaMetrics. The following is a chronological overview of the decisions and trade‑offs.

2017 InfluxDB Era

InfluxDB was the first choice for a time‑series database. As the number of curves increased, its limitations became evident. Community discussions showed more than 400 OOM (out‑of‑memory) reports, caused by hotspot writes and queries that tried to read millions of series from a single instance.

InfluxDB OOM – impact on the charting service

2017‑2018 Open‑Falcon Era

InfluxDB's single‑node performance was limited and its clustering was not open. Even after sharding by business line, a single node still faced massive series volumes, leading to OOM. The team switched to the RRDtool storage used by Open‑Falcon, applying a consistent‑hash algorithm to disperse series across multiple instances, thus eliminating the hotspot‑induced OOM.

2018‑2020 Post Open‑Falcon Era

By April 2018, RRDtool remained in production, but cost pressures rose sharply. High‑end machines with NVMe disks were required, and memory usage often exceeded 90 %. The team needed a solution that could handle the growing data volume without such expensive hardware.

Analysis showed that over 80 % of queries targeted the latest two hours of data. Inspired by Facebook’s Gorilla paper, a new Cacheserver service was built to handle hot data, offloading the RRDtool cluster.

Cacheserver architecture (hot‑data layer)

2020‑Present VictoriaMetrics Era

With the rise of containers, the volume of time‑series data exploded again, and the cost of the RRDtool architecture became unsustainable. The team evaluated several alternatives:

Alternative Solutions

Druid – dynamic schema not supported, 10× cost increase, write performance worse than RRDtool.

Prometheus – short‑term storage only, no built‑in clustering, requires Thanos/Cortex for long‑term storage.

Thanos / Cortex – provide global view and long‑term storage but still experimental, need object storage, high memory overhead.

Uber M3 – high management cost (etcd) and requires high‑end SSDs.

VictoriaMetrics – low resource consumption, LSM‑based columnar storage, high write throughput (760 K points/s per instance), excellent compression (1.2‑1.5 bytes/point ≈ 13 % of raw), easy horizontal scaling, no down‑sampling, Prometheus‑compatible API with enhanced MetricsQL.

VictoriaMetrics Advantages

Runs on commodity hardware; no SSD/NVMe required.

LSM core model (similar to ClickHouse) with in‑memory buffering and background compaction.

Columnar storage enables ~30 M points/s per CPU core.

Write speed of 760 K points/s per instance.

Compression using an improved Gorilla algorithm + ZSTD (≈1.2‑1.5 bytes/point).

Share‑nothing clustering – easy scaling, automatic rerouting on node failure.

Retains raw data (no automatic down‑sampling).

Full Prometheus compatibility; extended query language (MetricsQL).

Weak support for out‑of‑order timestamps.

Predictable capacity planning.

Issues and Mitigations

Disk usage proportional to points – set different retention periods per business line to limit storage.

No down‑sampling – high query speed and strong compression mitigate storage cost.

Limited community activity – deep code review and direct communication with the authors built confidence in stability.

Multi‑Cluster Design

To improve scalability and availability, multiple VictoriaMetrics clusters are deployed per region, isolating resource competition between business lines. A “mixer” component can read and merge data across regions.

VictoriaMetrics multi‑cluster architecture

Conclusion

VictoriaMetrics was selected as the most suitable time‑series storage solution for Didi’s observability platform, meeting performance, cost, and scalability requirements. Future articles will cover containerized deployment, fault management, replication, and data migration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability time_series_database performance evaluation scalable storage TSDB VictoriaMetrics

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.