Design and Implementation of a High‑Availability InfluxDB Cluster at 360
This article introduces the fundamentals of time‑series databases, explains why InfluxDB was chosen, describes the TSM storage engine and shard concepts, outlines the internal 360 InfluxDB‑HA architecture, compares its performance with a single node, and provides integration and future‑development guidelines.
Introduction InfluxDB is widely used for storing and querying time‑series data such as resource monitoring, thanks to its built‑in functions (average, standard deviation, random sampling, etc.). The author, based on massive time‑series data generated inside 360, designs a high‑performance InfluxDB cluster.
Basic Concepts
TSDB vs. Traditional DB: traditional databases store the current value of data, while time‑series databases record a series of values indexed by time.
Typical TSDB application scenarios include device monitoring data, medical metrics (blood glucose, heart rate), and financial transaction data that require trend analysis, anomaly detection, and forecasting.
Why Choose InfluxDB
Active open‑source community, long‑standing support, and proven performance.
SQL‑like insertion and query language reduces learning cost.
Native HTTP API enables calls from any language.
Pluggable storage solution, usable solely as a backend.
TSM Storage Engine Overview
Cache – an in‑memory map (default 1 GB) that holds recent writes.
WAL – write‑ahead log that persists cache data; on startup the WAL is replayed into memory.
TSM file – the on‑disk storage format for time‑series data.
Compactor – merges small TSM files into larger ones (cache → snapshot → TSM) and performs other maintenance tasks.
Shard – Concept Above the TSM Engine
Create different shards based on timestamp ranges.
Enable fast location of data for queries, improving query speed.
Make bulk deletion by time simple and efficient.
Project Origin
The community edition of InfluxDB does not provide a clustering solution.
InfluxDB‑relay only supports dual‑write, lacking load‑balancing.
Ele.me’s open‑source influx‑proxy solution is complex and costly to maintain.
360 needed real‑time dashboards, alerts, and fault prediction for 100 k hosts and 200 monitoring items.
Thus the InfluxDB‑HA project was created.
Program Architecture
Official InfluxDB‑relay solution – suffers from unresolved issues: it only solves data backup, does not improve read/write performance, adds configuration complexity, and lacks retry mechanisms for failed writes.
Ele.me high‑availability solution – advantages: rebuilt after relay’s limitations, supports dynamic scaling of InfluxDB nodes, and adds robust retry mechanisms. Disadvantages: many components increase learning and maintenance cost; retries can overload saturated machines; the architecture is over‑engineered for simple monitoring storage.
360 internal InfluxDB‑HA solution – advantages: uses measurement as the smallest split unit for efficient time‑series queries and supports dynamic sharding/splitting at the business layer.
Performance Comparison
Disk I/O and CPU usage of the HA cluster are compared against a single‑node InfluxDB (illustrated by the accompanying charts).
Business Integration
InfluxDB‑HA manages InfluxDB instance configurations, integrates with Grafana for visualization, and fully supports the native /write API as well as any language‑specific InfluxDB SDKs.
Future Iteration Plan
Introduce Kafka or RabbitMQ buffers before write requests to reduce data loss risk.
Implement hot‑loading of configuration files (currently using Go’s fsnotify; future versions may use etcd for centralized config).
Support business‑level sharding to handle larger data scales, keeping measurement as the minimal split unit.
Usage Recommendations
Continuous Queries (CQs) – use tags for indexing, simulate production load before deployment, and avoid OOM by limiting query size.
Retention Policies – schedule changes during low concurrency, test on a slave instance before applying to production.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.