How to Build a High‑Performance InfluxDB Cluster for Massive Time‑Series Data
This article explores InfluxDB’s time‑series strengths, compares TSDB with traditional databases, explains its TSM storage engine and shard concepts, and details the design, architecture, performance benchmarks, integration steps, and future enhancements of a high‑availability InfluxDB‑HA solution used at 360.
Basic Concepts
TSDB vs Traditional DB
Traditional databases record current values.
Time‑series databases record a series of data over time.
TSDB Application Scenarios
Time‑series data that requires historical trends, periodic patterns, anomaly detection, and future prediction, such as device monitoring, medical vitals, and financial transaction logs.
Why Choose InfluxDB
Active community and proven performance.
SQL‑like query language reduces learning cost.
Native HTTP API supports multiple languages.
Pluggable storage solution.
InfluxDB TSM Storage Engine Overview
Components: cache (in‑memory map, default 1 GB), wal (write‑ahead log for persistence), tsm file (data storage), compactor (handles cache→snapshot→tsm and merges small tsm files).
Shard – Concept Above TSM Engine
Shards are created for different timestamp ranges, enabling fast time‑based queries and efficient batch deletions.
Project Origin
InfluxDB community edition lacks clustering.
Official influxdb‑relay only supports dual‑write, no load balancing.
Eleme’s influx‑proxy solution is complex to deploy and maintain.
360 needed real‑time monitoring for 100 k hosts and 200 metrics.
Thus the InfluxDB‑HA project was created.
Architecture
Official InfluxDB‑Relay Solution
Unresolved issues:
Dual‑write only backs up data, does not improve read/write performance.
Queries still go to InfluxDB, increasing configuration complexity.
No retry mechanism for failed writes.
Eleme InfluxDB High‑Availability Solution
Advantages:
Influx‑proxy rebuilt to meet performance and maintenance needs.
Dynamic scaling of InfluxDB nodes.
Robust retry mechanism for failed requests.
Disadvantages:
Many components increase learning and maintenance cost.
Retry can add load when machines are at capacity.
Not aligned with simple monitoring storage needs.
360 Internal InfluxDB‑HA Solution
Advantages:
Uses measurement as the smallest split unit, ensuring efficient time‑series queries.
Supports dynamic sharding and table splitting at the business layer.
Performance Comparison
Disk I/O comparison with a single‑node InfluxDB.
CPU usage comparison with a single‑node InfluxDB.
Business Integration Guide
InfluxDB‑HA manages InfluxDB instance configurations.
Grafana integration instructions.
Third‑party programs write data via the standard /write API and support any language SDK.
Future Iteration Plan
Integrate Kafka or RabbitMQ as a buffer before writes to reduce data loss.
Hot‑load configuration files (currently using Go’s fsnotify; future use etcd for centralized config).
Support business‑side partitioning to handle larger data scales while keeping measurement as the minimal split unit.
InfluxDB Usage Tips
Continuous Queries/Select: For queries over 100 k samples, using tags and proper indexing greatly reduces memory consumption and avoids OOM.
Prefer tags in queries.
Test continuous queries in a simulated production environment before deployment.
Retention Policy: Setting retention policies preserves data but can increase CPU usage on large volumes; apply during low read/write periods and test on a slave instance first.
Operate RP during low concurrency.
Iterate on RP settings on a slave before production rollout.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.