Diagnosing a MongoDB Shard Connection Storm that Caused Replication Lag and Automatic Failover
The article details a MongoDB 3.4 sharded cluster incident where a sudden connection storm overwhelmed the primary, leading to replication lag, automatic failover, and how monitoring, log analysis with mtools, and a custom log‑rotation script were used to diagnose and resolve the issue.
On June 5 at around 22:30, alerts indicated replication lag on shard2 of a MongoDB 3.4 sharded cluster (3 mongos + 4 shards, each with 1 primary and 2 secondaries). The primary node (priority 2) normally stays active when the network is healthy.
Grafana metrics showed a sharp rise in CPU and memory usage, while QPS dropped to zero and the number of connections surged, suggesting a connection storm that overloaded the primary.
Log inspection revealed massive connection creation failures and thread‑creation errors (e.g., pthread_create failed: Resource temporarily unavailable ), confirming the primary’s loss of responsiveness. The secondary’s heartbeat connections also failed, triggering an election that temporarily promoted a secondary to primary before the original primary recovered.
To automate log management, a shell script ( logrotate_mongo.sh ) was created for a user with the hostManager role. The script runs hourly, flushes the MongoDB log via db.runCommand({logRotate:1}) , moves the rotated log to a backup directory, and deletes logs older than seven days.
[root@ mongod]# more logrotate_mongo.sh
#!/bin/sh
MONGO_CMD=/usr/local/mongodb/bin/mongo
KEEP_DAY=7
#flush mongod log
... (script content omitted for brevity) ...Log file sizes between 18:00 and 23:00 grew dramatically (e.g., 71 MB at 22:00 and 215 MB at 23:00), indicating abnormal load.
Using mtools , the team parsed the logs to confirm the timeline of events, including primary loss, secondary promotion, rollback, recovery, and final re‑election.
... mloginfo output showing RSSTATE transitions ...Further analysis of mongos logs identified which application servers generated the connection surge. During 22:00‑23:00, IP 172.31.0.78 opened over 14 000 connections, with similar spikes from other servers, confirming the connection storm originated from the application layer.
CONNECTIONS
total opened: 58261
total closed: 58868
...
172.31.0.78 opened: 14041 closed: 14576
172.31.0.21 opened: 13898 closed: 14284
...The incident was handed over to developers for further investigation of the offending services.
In summary, the case was straightforward to diagnose using two tools: a custom log‑rotation shell script and MongoDB’s official mtools . These utilities enabled rapid extraction of key diagnostic information and significantly improved operational efficiency.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.