MongoDB High‑Throughput Cluster Optimization: Software, Configuration, and Storage Engine Tuning
This article details how a high‑traffic MongoDB sharded cluster exceeding one million TPS was optimized through software‑level tweaks, configuration changes, WiredTiger storage‑engine tuning, and hardware upgrades, resulting in latency reductions from hundreds of milliseconds to a few milliseconds and significantly improved stability.
1. Background
Online cluster peak TPS exceeds 1 million writes per second, average latency over 100 ms, and severe jitter as traffic grows. The cluster uses MongoDB native sharding with balanced data distribution. Node traffic monitoring is shown in the following images.
The cluster reaches over 1.2 M TPS (excluding delete traffic). Including primary‑node delete operations, total TPS exceeds 1.5 M.
2. Software Optimizations
Without adding hardware, the following software‑level optimizations achieved several‑fold performance gains:
Business‑level optimization
MongoDB configuration optimization
Storage‑engine optimization
2.1 Business‑level Optimization
The cluster stores nearly 10 billion documents, each retained for three days. Expired documents generate massive delete load during peak hours. The solution moves delete‑by‑expire to nighttime using an expireAt index:
<span class="hljs-attribute">Db.collection.createIndex({ "expireAt": 1 }, { expireAfterSeconds: 0 })</span>The index deletes documents exactly at the specified expireAt timestamp, e.g.:
db.collection.insert({
// document will be deleted at 2022‑07‑22 01:00:00
"expireAt": new Date('July 22, 2019 01:00:00'),
"logEvent": 2,
"logMessage": "Success!"
})Scheduling expiration at night reduces peak‑hour load, lowering average latency and jitter.
Delete Expire Tips1: Meaning of expireAfterSeconds
expireAt specifies the absolute expiration time, e.g., 2022‑12‑22 02:01.
Db.collection.createIndex({ "expireAt": 1 }, { expireAfterSeconds: 0 })
db.log_events.insert({ "expireAt": new Date(Dec 22, 2019 01:00), "logEvent": 2, "logMessage": "Success!" })expireAfterSeconds adds a delay after expireAt; e.g., 60 s delay.
db.log_events.insert({ "createdAt": new Date(), "logEvent": 2, "logMessage": "Success!" })
Db.collection.createIndex({ "expireAt": 1 }, { expireAfterSeconds: 60 })Why does mongostat only show delete on secondary? Because the expiration index runs on the primary, which deletes directly via WiredTiger without passing through the client connection path, so primary delete stats are not recorded.
2.2 MongoDB Configuration Optimization (Network I/O Reuse & Separation)
High TPS and burst traffic cause the default one‑thread‑per‑connection model to overload the system. MongoDB 3.6+ introduces serviceExecutor: adaptive to dynamically adjust network threads and enable I/O reuse.
2.2.1 Internal Network Thread Model
MongoDB creates a dedicated thread per client connection handling both network and disk I/O, which leads to massive thread creation (up to ~10 k threads) and high load during spikes.
Rapid thread creation under high concurrency.
Thread destruction during low‑traffic periods adds further load.
One thread handles both network and storage I/O, a design flaw.
2.2.2 Optimization Method
Enable serviceExecutor: adaptive to reuse network I/O and separate it from disk I/O, reducing thread‑creation overhead and improving throughput.
2.2.3 Performance Comparison Before/After
After enabling adaptive service executor, latency dropped 1‑2× and system load decreased.
2.2.3.1 Load Comparison
2.2.3.2 Slow‑log Comparison
2.2.3.3 Average Latency Comparison
2.3 WiredTiger Storage‑Engine Optimization
Even after service executor tuning, latency remained ~80 ms. Analysis revealed I/O spikes caused by large cacheSize and dirty‑page flushing.
2.3.1 CacheSize Adjustment
CacheSize set to 110 GB caused massive dirty data, leading to I/O saturation. Reducing cacheSize to 50 GB limited simultaneous write bursts.
2.3.2 Dirty‑Page Eviction Tuning
Adjusted eviction thresholds (target 75 %, trigger 97 %, dirty_target 3 %, dirty_trigger 25 %) and increased eviction threads to mitigate write spikes.
eviction_target: 75%
eviction_trigger: 97%
eviction_dirty_target: 3%
eviction_dirty_trigger: 25%
evict.threads_min: 8
evict.threads_max: 12
2.3.3 Checkpoint Optimization
Reduced checkpoint interval to 25 s (log size 1 GB) to limit dirty data accumulation.
checkpoint=(wait=25,log_size=1GB)2.3.4 IO Comparison Before/After
2.3.5 Latency Comparison Before/After
Latency dropped from ~200 ms to ~20 ms with reduced jitter.
3. Server System Disk‑IO Issue Resolution
3.1 Hardware Background
When WiredTiger flushed large amounts of data, writes >500 MB/s caused util to hit 100 % and write throughput to drop to zero, indicating hardware bottleneck.
NVMe SSD spec shows 2 GB/s write capability, but the server only achieved 500 MB/s.
3.2 Post‑Upgrade Performance
Migrating primary nodes to servers with 2 GB/s SSDs further reduced latency to 2‑4 ms across three business workloads.
Kernel version mismatch was identified as the cause of NVMe under‑performance; upgrading to kernel 3.10.0‑957.27.2.el7.x86_64 restored full SSD throughput.
4. Summary and Open Issues
Combined service‑executor tuning, storage‑engine adjustments, and hardware upgrades lowered average latency from hundreds of milliseconds to 2‑4 ms, achieving multi‑order‑of‑magnitude performance gains. Remaining jitter will be addressed in a future article.
OPPO Internet Technology https://segmentfault.com/a/1190000021268588
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.