Analyzing and Optimizing High Memory and Disk I/O Consumption of InfluxDB 1.8 on a Production Server
This article investigates why an InfluxDB 1.8 instance on a 32‑core, 64 GB server consumes over 58 GB of resident memory and generates heavy disk I/O, examines Go runtime memory accounting, uses system tools such as top, pmap, pprof and iostat for diagnosis, and presents configuration and runtime tweaks that reduce memory pressure and I/O load.
Background
The production server (32‑core, 64 GB) runs InfluxDB 1.8 written in Go and writes about 100 GB of data per day. After a week the process memory usage exceeds 95 % of RAM and occasional swap alerts appear.
Problem Symptoms
Top shows the InfluxDB process (PID 32309) using 58 GB RES, 95.3 % memory usage, and a high I/O wait (wa 43.1%). The questions are why the process memory is so high and why disk I/O is saturated.
Memory High‑Consumption Analysis
Using the InfluxDB client to query show stats reveals runtime metrics: HeapIdle ≈ 51 GB, HeapReleased ≈ 44 GB, HeapInUse ≈ 16 GB, giving an effective heap of 23 GB. However the system reports 58 GB RES, suggesting ~35 GB of memory is marked as unused but not returned to the kernel.
Further inspection with pmap -x 32309 and cat /proc/32309/smaps shows a huge anonymous heap region (≈ 66 GB) with a large amount of Private_Dirty memory. GDB backtrace ( bt ) does not reveal a leak.
Running go tool pprof -alloc_space shows that the function index/inmem.(*Index).DropSeriesGlobal consumes ~42 TB of cumulative allocation during series deletion, confirming that the in‑memory index causes massive temporary allocations.
The Go runtime releases memory to the kernel using MADV_DONTNEED (Go 1.12+ uses the more efficient MADV_FREE ). This means RSS does not drop immediately; it only decreases under memory pressure unless the GODEBUG=madvdontneed=1 flag forces the older behavior.
Disk I/O Consumption Analysis
Using iostat -x 1 3 shows the InfluxDB process writes to device dm‑4 with IOPS ≈ 5 361 /s, read ≈ 27 MB/s, write ≈ 33 MB/s, avgqu‑sz ≈ 3.48, await ≈ 0.47 ms, and %util ≈ 97.5 % – indicating the I/O subsystem is saturated.
Performance Optimizations
1. Set GODEBUG=madvdontneed=1 to force the runtime to release memory more aggressively.
2. Change InfluxDB configuration ( influxdb.conf ) to reduce I/O pressure and memory usage:
[data]
# Reduce WAL sync frequency to 1s (asynchronous flush)
wal-fsync-delay = "1s"
# Switch index from in‑memory to TSI1 to avoid high memory during retention deletions
index-version = "tsi1"
# Increase compaction throughput to 64 MB
compact-throughput = "64m"Restart the service with the environment variable:
env GODEBUG=madvdontneed=1 /usr/bin/influxd -config /usr/bin/influxdb.confOnline Verification
After a week of running the tuned instance, memory usage dropped to ~55 % of RAM and disk I/O fell to ~200 IOPS with only 6 % utilization, confirming the problem was mitigated.
References
Understanding virt/res/shr relationships – https://www.orchome.com/298
Server architectures (SMP, NUMA, MPP) – https://cloud.tencent.com/developer/article/1372348
Swap pitfalls – https://blog.huoding.com/2012/11/08/198
Go scheduler – https://draveness.me/golang/docs/part3-runtime/ch06-concurrency/golang-goroutine/
NUMA‑aware scheduler for Go – https://docs.google.com/document/u/0/d/1d3iI2QWURgDIsSR6G2275vMeQ_X7w-qxM2Vp7iGwwuM/pub
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.