Why Adding Background Indexes Can Crash a Latency‑Sensitive MongoDB Cluster—and How to Prevent It
A latency‑sensitive MongoDB cluster experienced severe jitter and connection exhaustion when multiple background indexes were added sequentially, revealing how index builds overload replica nodes, trigger alerts, and can be mitigated with proper index‑addition strategies and the noIndexBuildRetry option.
Business Background
The company stores core revenue‑critical data in a MongoDB cluster. Any latency spike can cause client‑side timeouts, directly impacting revenue. The workload is read‑heavy, read‑write separated, with peak traffic of 80‑100k operations per second and a data size of about 1 billion documents.
Cluster Architecture
The deployment uses a single sharded cluster with one shard consisting of a replica set of five nodes (one primary and four secondaries). This design tolerates two node failures and improves read throughput by directing reads to secondaries.
Problem Discovery
After adding three background indexes sequentially via the management platform, the monitoring system raised latency alerts (>20 ms) on all secondary nodes while the primary remained normal. Connection‑count alerts also appeared, eventually exhausting the max connections and preventing mongo shell access.
Investigation Process
Attempted to connect via mongo shell and received a network error:
MongoDB shell version v3.6.13
connecting to: mongodb://x.x.x.x:20001/test?gssapiServiceName=mongodb
2021-04-29T11:09:15.049+0800 E QUERY [thread1] Error: network error while attempting to run command 'isMaster' on host x.x.x.x:20001' :
connect@src/mongo/shell/mongo.js:263:13
@(connect):1:6
exception: connect failedSince the shell could not connect, the team inspected the underlying server logs, which showed that all connections were exhausted.
System monitoring revealed extremely high disk I/O on the secondary nodes.
Further analysis of mongod logs confirmed that index builds were consuming the I/O.
Root Cause Confirmation
Adding indexes in background caused each secondary to read the collection data and build the index, generating heavy disk I/O. Because three indexes were being built concurrently on the secondaries, the I/O load spiked, leading to latency spikes and connection‑count exhaustion.
Resolution Steps
Attempted to kill the index‑build operations via killOp, but the connection pool was exhausted.
Killed the mongod process and restarted it; however, the index build resumed automatically.
MongoDB provides the --noIndexBuildRetry flag to skip rebuilding indexes after an unclean shutdown.
mongod -f /home/service/mongodb/conf/mongod_20001.conf --noIndexBuildRetryUsing this flag allowed the secondary to start without re‑executing the interrupted index builds, and the service recovered quickly.
createIndex Core Workflow
When a client issues db.collection.createIndex(..., {background:true}), the primary builds the index, returns OK to the client, writes an oplog entry, and the secondaries replay the oplog to build the index locally.
Primary queries the collection and builds the index.
After completion, the primary sends an OK response.
An oplog entry for the index build is created.
Secondaries fetch the oplog and replay the index build.
Why the Issue Appeared
Because the primary returned OK after its own build finished, the secondaries continued building. When the third index finished on the primary, all three indexes were simultaneously building on the secondaries, overwhelming disk I/O and triggering latency alerts.
Mitigation Strategies for Latency‑Sensitive Workloads
Sequential Index Completion : Ensure that an index is fully built on all secondaries before starting the next one. Newer MongoDB versions serialize background index builds on secondaries, reducing contention.
Isolated Index Build : Remove a secondary from the replica set, start it as a standalone node, build the index without the background flag (faster), then re‑add it to the replica set.
Both methods allow index addition without noticeable impact on the production workload.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
