Databases 9 min read

How to Build a High‑Performance InfluxDB Cluster for Massive Time‑Series Data

This article explores InfluxDB’s time‑series strengths, compares TSDB with traditional databases, explains its TSM storage engine and shard concepts, and details the design, architecture, performance benchmarks, integration steps, and future enhancements of a high‑availability InfluxDB‑HA solution used at 360.

360 Zhihui Cloud Developer

Dec 3, 2019

How to Build a High‑Performance InfluxDB Cluster for Massive Time‑Series Data

Basic Concepts

TSDB vs Traditional DB

Traditional databases record current values.

Time‑series databases record a series of data over time.

TSDB Application Scenarios

Time‑series data that requires historical trends, periodic patterns, anomaly detection, and future prediction, such as device monitoring, medical vitals, and financial transaction logs.

Why Choose InfluxDB

Active community and proven performance.

SQL‑like query language reduces learning cost.

Native HTTP API supports multiple languages.

Pluggable storage solution.

InfluxDB TSM Storage Engine Overview

Components: cache (in‑memory map, default 1 GB), wal (write‑ahead log for persistence), tsm file (data storage), compactor (handles cache→snapshot→tsm and merges small tsm files).

Shard – Concept Above TSM Engine

Shards are created for different timestamp ranges, enabling fast time‑based queries and efficient batch deletions.

Project Origin

InfluxDB community edition lacks clustering.

Official influxdb‑relay only supports dual‑write, no load balancing.

Eleme’s influx‑proxy solution is complex to deploy and maintain.

360 needed real‑time monitoring for 100 k hosts and 200 metrics.

Thus the InfluxDB‑HA project was created.

Architecture

Official InfluxDB‑Relay Solution

Unresolved issues:

Dual‑write only backs up data, does not improve read/write performance.

Queries still go to InfluxDB, increasing configuration complexity.

No retry mechanism for failed writes.

Eleme InfluxDB High‑Availability Solution

Advantages:

Influx‑proxy rebuilt to meet performance and maintenance needs.

Dynamic scaling of InfluxDB nodes.

Robust retry mechanism for failed requests.

Disadvantages:

Many components increase learning and maintenance cost.

Retry can add load when machines are at capacity.

Not aligned with simple monitoring storage needs.

360 Internal InfluxDB‑HA Solution

Advantages:

Uses measurement as the smallest split unit, ensuring efficient time‑series queries.

Supports dynamic sharding and table splitting at the business layer.

Performance Comparison

Disk I/O comparison with a single‑node InfluxDB.

CPU usage comparison with a single‑node InfluxDB.

Business Integration Guide

InfluxDB‑HA manages InfluxDB instance configurations.

Grafana integration instructions.

Third‑party programs write data via the standard /write API and support any language SDK.

Future Iteration Plan

Integrate Kafka or RabbitMQ as a buffer before writes to reduce data loss.

Hot‑load configuration files (currently using Go’s fsnotify; future use etcd for centralized config).

Support business‑side partitioning to handle larger data scales while keeping measurement as the minimal split unit.

InfluxDB Usage Tips

Continuous Queries/Select: For queries over 100 k samples, using tags and proper indexing greatly reduces memory consumption and avoids OOM.

Prefer tags in queries.

Test continuous queries in a simulated production environment before deployment.

Retention Policy: Setting retention policies preserves data but can increase CPU usage on large volumes; apply during low read/write periods and test on a slave instance first.

Operate RP during low concurrency.

Iterate on RP settings on a slave before production rollout.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Clustering InfluxDB TimeSeriesDatabase HighAvailability

Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.