Databases 9 min read

How I Resolved a Critical MongoDB Sharding Failure in a Billion‑Row Cluster

This article recounts a real‑world MongoDB sharding incident on a core cluster with billions of records, detailing the root causes, step‑by‑step remediation using removeShard and movePrimary, and the lessons learned to prevent similar outages.

dbaplus Community

Jun 17, 2021

How I Resolved a Critical MongoDB Sharding Failure in a Billion‑Row Cluster

Background

A core MongoDB sharded cluster that stores a billion‑row table suffered an outage after a change to the userbucket database. The cluster consisted of three shards, each implemented as a replica set, and handled low read/write traffic (peak QPS 40‑60 k).

Cluster Overview

The cluster hosted two databases: userbucket – stores routing information for all users (≈300 documents). feeds_xxxxxxx – contains the billion‑row collection feeds_xxxxxxx.collection1, which is the only sharded collection.

Most read traffic targets shard 1; shards 2 and 3 hold only a small fraction of data. The diagram below shows the original topology.

Identified Issues

Shards 2 and 3 were largely idle and used low‑IO SATA disks, which could degrade performance of the userbucket reads.

During a movePrimary operation the primary shard for userbucket was moved from shard 3 to shard 1, but several mongos routers were not refreshed, leaving them with stale routing tables.

Remediation Procedure

Connect to any mongos instance (e.g., mongos1).

Run the movePrimary command to transfer the primary of userbucket from shard 3 to shard 1: sh.movePrimary("userbucket", "shard001") On every other mongos instance (e.g., mongos2, mongos3) force a routing table refresh: db.adminCommand({"flushRouterConfig": 1}) After confirming that the routing tables are consistent, remove the unused shards to reclaim resources:

sh.removeShard("shard002")
sh.removeShard("shard003")

User Impact

After moving the primary, some users reported that a portion of the core data became inaccessible. Only the 300 routing documents in userbucket had been updated; the stale mongos routers still held the old routing metadata, causing intermittent failures when the application tried to resolve user routes.

Flushing the router configuration on all mongos instances (or restarting the stale routers) restored full service.

Root Cause Analysis

The outage was traced to incomplete monitoring of mongos instances. Several routers were omitted from the monitoring list, so their routing tables were never refreshed after the movePrimary. Requests routed through those stale routers could not locate the required data.

Best Practices for Safe movePrimary

Method 1: Single‑router migration

Shut down all mongos routers except one. Perform the movePrimary on the remaining router, then restart the other routers. Before starting, query the config.mongos collection to ensure no hidden routers are present.

Method 2: Chunk‑driven migration

If the primary shard to be removed still hosts critical databases, enable sharding on those collections first. When removeShard is executed, MongoDB will automatically migrate the chunks to other shards. All routers receive updated routing information via the chunk versioning mechanism, eliminating the need for manual router refreshes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cluster troubleshooting MongoDB movePrimary removeShard

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Cluster Overview

Identified Issues

Remediation Procedure

User Impact

Root Cause Analysis

Best Practices for Safe movePrimary

Method 1: Single‑router migration

Method 2: Chunk‑driven migration

dbaplus Community

How this landed with the community

Was this worth your time?

0 Comments

Method 1: Single‑router migration

Method 2: Chunk‑driven migration