Databases 25 min read

Why MongoDB mongos Proxies Crash Under Load and How to Fix It

A high‑traffic Java service using MongoDB experienced intermittent latency spikes and a full‑scale outage caused by excessive connection churn, kernel‑level random‑number generation bottlenecks, and mis‑configured client timeouts, which were diagnosed through log analysis, packet captures, and performance testing, leading to concrete mitigation steps.

dbaplus Community
dbaplus Community
dbaplus Community
Why MongoDB mongos Proxies Crash Under Load and How to Fix It

Problem Background

A core Java long‑connection service stores data in MongoDB. Hundreds of client machines connect to a single MongoDB cluster. The system experienced repeated performance jitter and a later "avalanche" outage where traffic dropped to zero and could not recover automatically.

Cluster Architecture

The deployment spans three data‑centers with a multi‑active setup: each site runs its own mongos proxies and a set of data nodes (A‑site: 1 primary + 1 secondary, B‑site: 2 secondaries, C‑site: 1 arbiter). Clients are configured to connect to the proxy of the same site; proxies are stateless, so a failure in one site should not affect the others.

Fault Investigation

1. Storage node slow‑log analysis

CPU, memory, I/O and load on all mongod nodes were normal. Slow‑log thresholds were set to 30 ms; no slow‑log entries appeared during jitter periods, indicating storage nodes were not the cause.

2. mongos proxy analysis

After migrating the cluster to a custom monitoring platform, QPS spikes aligned with jitter events. mongos logs showed massive connection‑establish and connection‑close activity within a single second (thousands of connections).

During these spikes the proxy CPU load was dominated by system (sy %) time, while user (us %) time remained low, and load averages reached hundreds.

3. Connection‑churn root cause

Packet captures revealed that each new connection performed a db.isMaster() call followed by SASL authentication. The first SASL step generates random numbers by reading /dev/urandom. Because mongos creates a dedicated thread per connection, many threads simultaneously read /dev/urandom, contending on a kernel spinlock and driving system‑CPU (sy %) to 100 %.

Simulation of the Fault

Modify mongos source to delay every request by 600 ms.

Run two mongos instances on the same machine, distinguished by port.

Launch 6 000 concurrent client connections with a 500 ms timeout.

The simulation reproduced the rapid connect‑disconnect pattern and the associated CPU spike.

Kernel Version Impact

On Linux 2.6 kernels, sy % CPU reached 100 % with as few as 1 500 concurrent connections. On Linux 3.10, the same workload caused sy % to rise gradually, reaching ~30 % at 20 000 connections, indicating a performance improvement but not a complete fix.

Root‑Cause Analysis

Each new client connection triggers SASL authentication, which reads /dev/urandom to generate a nonce. Multiple threads reading this file simultaneously lock on _spin_lock_irqsave, causing kernel‑mode CPU saturation.

class PseudoRandom {
    uint32_t _x;
    uint32_t _y;
    uint32_t _z;
    uint32_t _w;
};

Relevant source locations in MongoDB where /dev/urandom is accessed include the server‑side SASL SCRAM‑SHA1 first step and the client‑side counterpart.

Relevant Source Files

https://github.com/y123456yz/reading-and-annotate-mongodb-3.6/blob/master/mongo/src/mongo/platform/random.cpp

https://github.com/y123456yz/reading-and-annotate-mongodb-3.6/blob/master/mongo/src/mongo/transport/service_executor_adaptive.cpp

https://github.com/y123456yz/reading-and-annotate-mongodb-3.6/blob/master/mongo/src/mongo/transport/service_executor_synchronous.cpp

Mitigation Strategies

Standardize client configuration: set timeout to seconds, configure all mongos proxies, avoid single‑point proxy usage.

Increase the number of mongos proxies to distribute connection load.

Replace kernel‑mode random number generation with a user‑space Xorshift algorithm (see Wikipedia Xorshift). The new algorithm eliminates the /dev/urandom contention.

After applying the user‑space random generator, system‑CPU dropped dramatically and proxy throughput improved several‑fold under short‑connection workloads.

Conclusion

The outage was caused by a combination of mis‑configured client timeouts, excessive connection churn, and a kernel‑level random‑number generation bottleneck in mongos. Normalizing client settings and moving random number generation to user space resolves the high system‑CPU issue and prevents future avalanche failures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Database OptimizationLinux kernelMongoDBrandom numbermongos
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.