Databases 8 min read

Design and Implementation of a High‑Availability InfluxDB Cluster at 360

This article introduces the fundamentals of time‑series databases, explains why InfluxDB was chosen, describes the TSM storage engine and shard concepts, outlines the internal 360 InfluxDB‑HA architecture, compares its performance with a single node, and provides integration and future‑development guidelines.

360 Tech Engineering

Dec 5, 2019

Design and Implementation of a High‑Availability InfluxDB Cluster at 360

Introduction InfluxDB is widely used for storing and querying time‑series data such as resource monitoring, thanks to its built‑in functions (average, standard deviation, random sampling, etc.). The author, based on massive time‑series data generated inside 360, designs a high‑performance InfluxDB cluster.

Basic Concepts

TSDB vs. Traditional DB: traditional databases store the current value of data, while time‑series databases record a series of values indexed by time.

Typical TSDB application scenarios include device monitoring data, medical metrics (blood glucose, heart rate), and financial transaction data that require trend analysis, anomaly detection, and forecasting.

Why Choose InfluxDB

Active open‑source community, long‑standing support, and proven performance.

SQL‑like insertion and query language reduces learning cost.

Native HTTP API enables calls from any language.

Pluggable storage solution, usable solely as a backend.

TSM Storage Engine Overview

Cache – an in‑memory map (default 1 GB) that holds recent writes.

WAL – write‑ahead log that persists cache data; on startup the WAL is replayed into memory.

TSM file – the on‑disk storage format for time‑series data.

Compactor – merges small TSM files into larger ones (cache → snapshot → TSM) and performs other maintenance tasks.

Shard – Concept Above the TSM Engine

Create different shards based on timestamp ranges.

Enable fast location of data for queries, improving query speed.

Make bulk deletion by time simple and efficient.

Project Origin

The community edition of InfluxDB does not provide a clustering solution.

InfluxDB‑relay only supports dual‑write, lacking load‑balancing.

Ele.me’s open‑source influx‑proxy solution is complex and costly to maintain.

360 needed real‑time dashboards, alerts, and fault prediction for 100 k hosts and 200 monitoring items.

Thus the InfluxDB‑HA project was created.

Program Architecture

Official InfluxDB‑relay solution – suffers from unresolved issues: it only solves data backup, does not improve read/write performance, adds configuration complexity, and lacks retry mechanisms for failed writes.

Ele.me high‑availability solution – advantages: rebuilt after relay’s limitations, supports dynamic scaling of InfluxDB nodes, and adds robust retry mechanisms. Disadvantages: many components increase learning and maintenance cost; retries can overload saturated machines; the architecture is over‑engineered for simple monitoring storage.

360 internal InfluxDB‑HA solution – advantages: uses measurement as the smallest split unit for efficient time‑series queries and supports dynamic sharding/splitting at the business layer.

Performance Comparison

Disk I/O and CPU usage of the HA cluster are compared against a single‑node InfluxDB (illustrated by the accompanying charts).

Business Integration

InfluxDB‑HA manages InfluxDB instance configurations, integrates with Grafana for visualization, and fully supports the native /write API as well as any language‑specific InfluxDB SDKs.

Future Iteration Plan

Introduce Kafka or RabbitMQ buffers before write requests to reduce data loss risk.

Implement hot‑loading of configuration files (currently using Go’s fsnotify; future versions may use etcd for centralized config).

Support business‑level sharding to handle larger data scales, keeping measurement as the minimal split unit.

Usage Recommendations

Continuous Queries (CQs) – use tags for indexing, simulate production load before deployment, and avoid OOM by limiting query size.

Retention Policies – schedule changes during low concurrency, test on a slave instance before applying to production.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Performance InfluxDB Cluster Architecture

Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.