Databases 14 min read

How We Fixed MongoDB Outages and Boosted Performance in Production

This article outlines MongoDB's key features, describes a real‑world outage caused by misconfigured connection limits, details the root‑cause analysis and temporary remediation, and presents a comprehensive set of configuration, sharding, and hardware optimizations that dramatically improved the system's reliability and throughput.

dbaplus Community

Dec 11, 2018

How We Fixed MongoDB Outages and Boosted Performance in Production

MongoDB Features

MongoDB is a scalable, high‑performance document‑oriented NoSQL database. It offers schema‑free storage, rich indexing (single‑key, multikey, array, text, geospatial, TTL, etc.), range and regex queries, and built‑in GridFS for large files. Replication sets provide data safety, while sharding enables horizontal scaling for large workloads.

Incident Overview

In an internal host‑monitoring system that collects IO, CPU, memory, filesystem, and network metrics at intervals ranging from seconds to ten minutes, the MongoDB cluster (initially three shards) began experiencing intermittent connection failures. Users reported that certain time windows could not connect to MongoDB, though the issue self‑recovered after a few minutes, and overall response times degraded.

Root‑Cause Analysis

Log inspection revealed warnings about exhausted connections. The maximum number of connections is limited by the smaller of maxConns (configured to 3000) and 80% of the OS's maximum open file descriptors. The server's ulimit for open files was 1024, so the effective connection limit was 819, far below the workload demand.

During peak sampling periods, the number of concurrent connections approached this limit, causing the observed intermittent outages.

Temporary Remedy

Because MongoDB replica sets allow easy parameter changes, the team increased the open files limit to 64000 on each node and restarted the MongoDB service, immediately alleviating the connection‑exhaustion symptom.

In‑Depth Post‑Mortem

The team questioned whether the usage pattern, configuration, and performance monitoring were appropriate. A deeper audit uncovered several issues:

All OS parameters were left at default values.

MongoDB configuration relied mostly on defaults.

Only a few collections were sharded, and the shard key was a time field, causing most writes to target a single shard.

Optimization Measures

Based on the findings, the following adjustments were applied:

System Parameter Tuning

mongo soft nofile 64000
mongo hard nofile 64000
mongo soft nproc 32000
mongo hard nproc 32000

fs.file-max=98000
kernel.pid_max=64000
kernel.threads-max=64000
vm.max_map_count=128000

Additional changes included disabling NUMA, turning off Transparent Huge Pages, and setting appropriate readahead values (0 for WiredTiger).

Disk and Filesystem Settings

Use XFS with noatime,nodiratime for data and index directories.

Prefer noop I/O scheduler for MongoDB workloads.

MongoDB Configuration

Avoid single‑node deployments; use replica sets for fault tolerance.

Run one MongoDB instance per server to prevent resource contention.

Deploy mongos routers on application servers and enable multi‑route queries.

Enable compression (Snappy by default, optional Zlib) on WiredTiger.

Separate data and index paths onto different physical disks and enable directoryPerDB.

Set a sufficiently large oplogSize to avoid replication interruptions.

Enable authentication (with the understanding that it adds some overhead).

Select an appropriate shard key that distributes inserts evenly, supports locality for CRUD, provides fine‑grained chunk splitting, and is indexed. Good shard keys have high cardinality, are not monotonically increasing, and avoid hot chunks.

Index Management

Ensure essential indexes exist to prevent full collection scans.

Prefer compound indexes over many single‑field indexes, paying attention to field order.

Use TTL indexes where applicable.

Monitoring and Maintenance

Enable MongoDB profiling (e.g., profile=1, slowms=200) to capture slow operations.

Configure log rotation to prevent log files from growing unchecked.

Adjust driver settings for read/write splitting and failover.

Results

After applying the above changes, extensive load testing showed a substantial performance increase. The revised shard key ( {Locality:1, search:1}) and tuned system parameters reduced latency and eliminated the connection‑exhaustion outages.

Key Takeaways

Establish robust operational processes and incident response mechanisms.

Standardize deployment templates (single instance per host, replica sets, proper sharding).

Allocate sufficient memory to hold indexes and hot data.

Use SSDs and RAID‑10 for high I/O workloads.

Prefer multi‑core CPUs; WiredTiger scales with core count.

Synchronize time across nodes with NTP.

Disable NUMA and Transparent Huge Pages for predictable performance.

Fine‑tune OS limits (open files, processes, kernel parameters) to match MongoDB's needs.

Choose shard keys wisely to avoid hot spots and ensure balanced distribution.

Maintain appropriate indexes and avoid index overuse.

Enable profiling and log rotation for observability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance database Sharding configuration Ops MongoDB

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.