Didi's Time Series Storage Evolution: From InfluxDB to VictoriaMetrics
Facing exponential growth of time‑series data from 2017 to 2023, Didi migrated from InfluxDB to RRDtool, then to an in‑memory cache layer, and finally adopted VictoriaMetrics because its low‑cost commodity‑hardware operation, high write throughput, strong compression, and easy horizontal scaling solved the earlier storage, OOM, and scalability problems.
Didi's time‑series curve volume grew dozens of times from 2017 to 2023. To cope with this explosive growth, the team continuously adjusted and upgraded the storage stack, moving from InfluxDB to RRDtool, then to an in‑memory TSDB, and finally to VictoriaMetrics. The following is a chronological overview of the decisions and trade‑offs.
2017 InfluxDB Era
InfluxDB was the first choice for a time‑series database. As the number of curves increased, its limitations became evident. Community discussions showed more than 400 OOM (out‑of‑memory) reports, caused by hotspot writes and queries that tried to read millions of series from a single instance.
InfluxDB OOM – impact on the charting service
2017‑2018 Open‑Falcon Era
InfluxDB's single‑node performance was limited and its clustering was not open. Even after sharding by business line, a single node still faced massive series volumes, leading to OOM. The team switched to the RRDtool storage used by Open‑Falcon, applying a consistent‑hash algorithm to disperse series across multiple instances, thus eliminating the hotspot‑induced OOM.
2018‑2020 Post Open‑Falcon Era
By April 2018, RRDtool remained in production, but cost pressures rose sharply. High‑end machines with NVMe disks were required, and memory usage often exceeded 90 %. The team needed a solution that could handle the growing data volume without such expensive hardware.
Analysis showed that over 80 % of queries targeted the latest two hours of data. Inspired by Facebook’s Gorilla paper, a new Cacheserver service was built to handle hot data, offloading the RRDtool cluster.
Cacheserver architecture (hot‑data layer)
2020‑Present VictoriaMetrics Era
With the rise of containers, the volume of time‑series data exploded again, and the cost of the RRDtool architecture became unsustainable. The team evaluated several alternatives:
Alternative Solutions
Druid – dynamic schema not supported, 10× cost increase, write performance worse than RRDtool.
Prometheus – short‑term storage only, no built‑in clustering, requires Thanos/Cortex for long‑term storage.
Thanos / Cortex – provide global view and long‑term storage but still experimental, need object storage, high memory overhead.
Uber M3 – high management cost (etcd) and requires high‑end SSDs.
VictoriaMetrics – low resource consumption, LSM‑based columnar storage, high write throughput (760 K points/s per instance), excellent compression (1.2‑1.5 bytes/point ≈ 13 % of raw), easy horizontal scaling, no down‑sampling, Prometheus‑compatible API with enhanced MetricsQL.
VictoriaMetrics Advantages
Runs on commodity hardware; no SSD/NVMe required.
LSM core model (similar to ClickHouse) with in‑memory buffering and background compaction.
Columnar storage enables ~30 M points/s per CPU core.
Write speed of 760 K points/s per instance.
Compression using an improved Gorilla algorithm + ZSTD (≈1.2‑1.5 bytes/point).
Share‑nothing clustering – easy scaling, automatic rerouting on node failure.
Retains raw data (no automatic down‑sampling).
Full Prometheus compatibility; extended query language (MetricsQL).
Weak support for out‑of‑order timestamps.
Predictable capacity planning.
Issues and Mitigations
Disk usage proportional to points – set different retention periods per business line to limit storage.
No down‑sampling – high query speed and strong compression mitigate storage cost.
Limited community activity – deep code review and direct communication with the authors built confidence in stability.
Multi‑Cluster Design
To improve scalability and availability, multiple VictoriaMetrics clusters are deployed per region, isolating resource competition between business lines. A “mixer” component can read and merge data across regions.
VictoriaMetrics multi‑cluster architecture
Conclusion
VictoriaMetrics was selected as the most suitable time‑series storage solution for Didi’s observability platform, meeting performance, cost, and scalability requirements. Future articles will cover containerized deployment, fault management, replication, and data migration.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.