How We Fixed MongoDB Outages and Boosted Performance in Production
This article outlines MongoDB's key features, describes a real‑world outage caused by misconfigured connection limits, details the root‑cause analysis and temporary remediation, and presents a comprehensive set of configuration, sharding, and hardware optimizations that dramatically improved the system's reliability and throughput.
MongoDB Features
MongoDB is a scalable, high‑performance document‑oriented NoSQL database. It offers schema‑free storage, rich indexing (single‑key, multikey, array, text, geospatial, TTL, etc.), range and regex queries, and built‑in GridFS for large files. Replication sets provide data safety, while sharding enables horizontal scaling for large workloads.
Incident Overview
In an internal host‑monitoring system that collects IO, CPU, memory, filesystem, and network metrics at intervals ranging from seconds to ten minutes, the MongoDB cluster (initially three shards) began experiencing intermittent connection failures. Users reported that certain time windows could not connect to MongoDB, though the issue self‑recovered after a few minutes, and overall response times degraded.
Root‑Cause Analysis
Log inspection revealed warnings about exhausted connections. The maximum number of connections is limited by the smaller of maxConns (configured to 3000) and 80% of the OS's maximum open file descriptors. The server's ulimit for open files was 1024, so the effective connection limit was 819, far below the workload demand.
During peak sampling periods, the number of concurrent connections approached this limit, causing the observed intermittent outages.
Temporary Remedy
Because MongoDB replica sets allow easy parameter changes, the team increased the open files limit to 64000 on each node and restarted the MongoDB service, immediately alleviating the connection‑exhaustion symptom.
In‑Depth Post‑Mortem
The team questioned whether the usage pattern, configuration, and performance monitoring were appropriate. A deeper audit uncovered several issues:
All OS parameters were left at default values.
MongoDB configuration relied mostly on defaults.
Only a few collections were sharded, and the shard key was a time field, causing most writes to target a single shard.
Optimization Measures
Based on the findings, the following adjustments were applied:
System Parameter Tuning
mongo soft nofile 64000
mongo hard nofile 64000
mongo soft nproc 32000
mongo hard nproc 32000
fs.file-max=98000
kernel.pid_max=64000
kernel.threads-max=64000
vm.max_map_count=128000Additional changes included disabling NUMA, turning off Transparent Huge Pages, and setting appropriate readahead values (0 for WiredTiger).
Disk and Filesystem Settings
Use XFS with noatime,nodiratime for data and index directories.
Prefer noop I/O scheduler for MongoDB workloads.
MongoDB Configuration
Avoid single‑node deployments; use replica sets for fault tolerance.
Run one MongoDB instance per server to prevent resource contention.
Deploy mongos routers on application servers and enable multi‑route queries.
Enable compression (Snappy by default, optional Zlib) on WiredTiger.
Separate data and index paths onto different physical disks and enable directoryPerDB.
Set a sufficiently large oplogSize to avoid replication interruptions.
Enable authentication (with the understanding that it adds some overhead).
Select an appropriate shard key that distributes inserts evenly, supports locality for CRUD, provides fine‑grained chunk splitting, and is indexed. Good shard keys have high cardinality, are not monotonically increasing, and avoid hot chunks.
Index Management
Ensure essential indexes exist to prevent full collection scans.
Prefer compound indexes over many single‑field indexes, paying attention to field order.
Use TTL indexes where applicable.
Monitoring and Maintenance
Enable MongoDB profiling (e.g., profile=1, slowms=200) to capture slow operations.
Configure log rotation to prevent log files from growing unchecked.
Adjust driver settings for read/write splitting and failover.
Results
After applying the above changes, extensive load testing showed a substantial performance increase. The revised shard key ( {Locality:1, search:1}) and tuned system parameters reduced latency and eliminated the connection‑exhaustion outages.
Key Takeaways
Establish robust operational processes and incident response mechanisms.
Standardize deployment templates (single instance per host, replica sets, proper sharding).
Allocate sufficient memory to hold indexes and hot data.
Use SSDs and RAID‑10 for high I/O workloads.
Prefer multi‑core CPUs; WiredTiger scales with core count.
Synchronize time across nodes with NTP.
Disable NUMA and Transparent Huge Pages for predictable performance.
Fine‑tune OS limits (open files, processes, kernel parameters) to match MongoDB's needs.
Choose shard keys wisely to avoid hot spots and ensure balanced distribution.
Maintain appropriate indexes and avoid index overuse.
Enable profiling and log rotation for observability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
