Why MongoDB mongos Proxies Crash Under Load and How to Fix It
A high‑traffic Java service using MongoDB experienced intermittent latency spikes and a full‑scale outage caused by excessive connection churn, kernel‑level random‑number generation bottlenecks, and mis‑configured client timeouts, which were diagnosed through log analysis, packet captures, and performance testing, leading to concrete mitigation steps.
Problem Background
A core Java long‑connection service stores data in MongoDB. Hundreds of client machines connect to a single MongoDB cluster. The system experienced repeated performance jitter and a later "avalanche" outage where traffic dropped to zero and could not recover automatically.
Cluster Architecture
The deployment spans three data‑centers with a multi‑active setup: each site runs its own mongos proxies and a set of data nodes (A‑site: 1 primary + 1 secondary, B‑site: 2 secondaries, C‑site: 1 arbiter). Clients are configured to connect to the proxy of the same site; proxies are stateless, so a failure in one site should not affect the others.
Fault Investigation
1. Storage node slow‑log analysis
CPU, memory, I/O and load on all mongod nodes were normal. Slow‑log thresholds were set to 30 ms; no slow‑log entries appeared during jitter periods, indicating storage nodes were not the cause.
2. mongos proxy analysis
After migrating the cluster to a custom monitoring platform, QPS spikes aligned with jitter events. mongos logs showed massive connection‑establish and connection‑close activity within a single second (thousands of connections).
During these spikes the proxy CPU load was dominated by system (sy %) time, while user (us %) time remained low, and load averages reached hundreds.
3. Connection‑churn root cause
Packet captures revealed that each new connection performed a db.isMaster() call followed by SASL authentication. The first SASL step generates random numbers by reading /dev/urandom. Because mongos creates a dedicated thread per connection, many threads simultaneously read /dev/urandom, contending on a kernel spinlock and driving system‑CPU (sy %) to 100 %.
Simulation of the Fault
Modify mongos source to delay every request by 600 ms.
Run two mongos instances on the same machine, distinguished by port.
Launch 6 000 concurrent client connections with a 500 ms timeout.
The simulation reproduced the rapid connect‑disconnect pattern and the associated CPU spike.
Kernel Version Impact
On Linux 2.6 kernels, sy % CPU reached 100 % with as few as 1 500 concurrent connections. On Linux 3.10, the same workload caused sy % to rise gradually, reaching ~30 % at 20 000 connections, indicating a performance improvement but not a complete fix.
Root‑Cause Analysis
Each new client connection triggers SASL authentication, which reads /dev/urandom to generate a nonce. Multiple threads reading this file simultaneously lock on _spin_lock_irqsave, causing kernel‑mode CPU saturation.
class PseudoRandom {
uint32_t _x;
uint32_t _y;
uint32_t _z;
uint32_t _w;
};Relevant source locations in MongoDB where /dev/urandom is accessed include the server‑side SASL SCRAM‑SHA1 first step and the client‑side counterpart.
Relevant Source Files
https://github.com/y123456yz/reading-and-annotate-mongodb-3.6/blob/master/mongo/src/mongo/platform/random.cpp
https://github.com/y123456yz/reading-and-annotate-mongodb-3.6/blob/master/mongo/src/mongo/transport/service_executor_adaptive.cpp
https://github.com/y123456yz/reading-and-annotate-mongodb-3.6/blob/master/mongo/src/mongo/transport/service_executor_synchronous.cpp
Mitigation Strategies
Standardize client configuration: set timeout to seconds, configure all mongos proxies, avoid single‑point proxy usage.
Increase the number of mongos proxies to distribute connection load.
Replace kernel‑mode random number generation with a user‑space Xorshift algorithm (see Wikipedia Xorshift). The new algorithm eliminates the /dev/urandom contention.
After applying the user‑space random generator, system‑CPU dropped dramatically and proxy throughput improved several‑fold under short‑connection workloads.
Conclusion
The outage was caused by a combination of mis‑configured client timeouts, excessive connection churn, and a kernel‑level random‑number generation bottleneck in mongos. Normalizing client settings and moving random number generation to user space resolves the high system‑CPU issue and prevents future avalanche failures.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
