Databases 8 min read

Diagnosing a MongoDB Shard Connection Storm that Caused Replication Lag and Automatic Failover

The article details a MongoDB 3.4 sharded cluster incident where a sudden connection storm overwhelmed the primary, leading to replication lag, automatic failover, and how monitoring, log analysis with mtools, and a custom log‑rotation script were used to diagnose and resolve the issue.

Aikesheng Open Source Community

Jun 21, 2021

Diagnosing a MongoDB Shard Connection Storm that Caused Replication Lag and Automatic Failover

On June 5 at around 22:30, alerts indicated replication lag on shard2 of a MongoDB 3.4 sharded cluster (3 mongos + 4 shards, each with 1 primary and 2 secondaries). The primary node (priority 2) normally stays active when the network is healthy.

Grafana metrics showed a sharp rise in CPU and memory usage, while QPS dropped to zero and the number of connections surged, suggesting a connection storm that overloaded the primary.

Log inspection revealed massive connection creation failures and thread‑creation errors (e.g., pthread_create failed: Resource temporarily unavailable), confirming the primary’s loss of responsiveness. The secondary’s heartbeat connections also failed, triggering an election that temporarily promoted a secondary to primary before the original primary recovered.

To automate log management, a shell script ( logrotate_mongo.sh) was created for a user with the hostManager role. The script runs hourly, flushes the MongoDB log via db.runCommand({logRotate:1}), moves the rotated log to a backup directory, and deletes logs older than seven days.

[root@ mongod]# more logrotate_mongo.sh
#!/bin/sh

MONGO_CMD=/usr/local/mongodb/bin/mongo
KEEP_DAY=7

#flush mongod log
... (script content omitted for brevity) ...

Log file sizes between 18:00 and 23:00 grew dramatically (e.g., 71 MB at 22:00 and 215 MB at 23:00), indicating abnormal load.

Using mtools, the team parsed the logs to confirm the timeline of events, including primary loss, secondary promotion, rollback, recovery, and final re‑election.

... mloginfo output showing RSSTATE transitions ...

Further analysis of mongos logs identified which application servers generated the connection surge. During 22:00‑23:00, IP 172.31.0.78 opened over 14 000 connections, with similar spikes from other servers, confirming the connection storm originated from the application layer.

CONNECTIONS
    total opened: 58261
    total closed: 58868
    ...
172.31.0.78   opened: 14041  closed: 14576
172.31.0.21   opened: 13898  closed: 14284
...

The incident was handed over to developers for further investigation of the offending services.

In summary, the case was straightforward to diagnose using two tools: a custom log‑rotation shell script and MongoDB’s official mtools. These utilities enabled rapid extraction of key diagnostic information and significantly improved operational efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

MongoDB log rotation Database Monitoring Connection Storm Mtools Replication Lag

Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.