Databases 12 min read

Root Cause Analysis and Performance Optimization of InfluxDB 1.8 Memory and Disk I/O on a Production Server

The article investigates why an InfluxDB 1.8 instance on a 32‑core, 64 GB production server consumes over 95% memory and generates heavy disk I/O, analyzes runtime statistics, pprof data, and Go memory‑release behavior, and presents configuration and runtime tweaks that reduce memory usage to ~55% and I/O load to acceptable levels.

360 Smart Cloud
360 Smart Cloud
360 Smart Cloud
Root Cause Analysis and Performance Optimization of InfluxDB 1.8 Memory and Disk I/O on a Production Server

After a week of continuous writes (~100 GB per day) to an InfluxDB 1.8 instance on a 32‑core, 64 GB server, the process showed >95% memory consumption and frequent swap alerts. The top output revealed the influxd process using 58 GB RES, with a high I/O wait (wa ≈ 43%).

Memory consumption analysis showed that the Go runtime reported HeapIdle ≈ 51 GB , HeapInUse ≈ 16 GB and HeapReleased ≈ 44 GB , giving an effective heap of about 23 GB, far lower than the 58 GB RES reported by the OS. Detailed inspection with pmap , /proc/…/smaps and gdb confirmed no obvious memory leak.

Using go tool pprof -alloc_space exposed that the DropSeriesGlobal routine in the InfluxDB index consumed >42 TB of cumulative allocation during series deletion, explaining the memory spikes. The Go runtime in versions ≥1.12 uses MADV_FREE to return memory to the kernel, but the RSS only drops when the system is under memory pressure; the GODEBUG=madvdontneed=1 flag forces the older MADV_DONTNEED behavior for faster RSS reduction.

Disk I/O analysis with iostat showed the device dm‑4 handling ~5 361 IOPS, ~27 MiB/s reads and ~33 MiB/s writes, with an average queue length of 3.48 and %util near 98%, indicating the storage subsystem was saturated.

To mitigate the issues, the configuration influxdb.conf was adjusted:

[data]
wal-fsync-delay = "1s"   # reduce WAL sync frequency
index-version = "tsi1"    # switch from in‑memory to TSI index
compact-throughput = "64m" # increase TSM compaction throughput

The server was then started with the environment variable GODEBUG=madvdontneed=1 :

env GODEBUG=madvdontneed=1 /usr/bin/influxd -config /usr/bin/influxdb.conf

After a week of operation, monitoring showed memory usage stabilised around 55% of total RAM and disk I/O dropped to ~200 IOPS with %util ≈ 6%, confirming that the optimisations resolved the high‑consumption symptoms.

References include Go runtime memory‑release documentation, InfluxDB index tuning guides, and Linux performance analysis resources.

Performance Tuningmemory-leakDatabase OptimizationInfluxDBDisk I/OGo Runtime
360 Smart Cloud
Written by

360 Smart Cloud

Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.