Databases 9 min read

How I Resolved a Critical MongoDB Sharding Failure in a Billion‑Row Cluster

This article recounts a real‑world MongoDB sharding incident on a core cluster with billions of records, detailing the root causes, step‑by‑step remediation using removeShard and movePrimary, and the lessons learned to prevent similar outages.

dbaplus Community
dbaplus Community
dbaplus Community
How I Resolved a Critical MongoDB Sharding Failure in a Billion‑Row Cluster

Background

A core MongoDB sharded cluster that stores a billion‑row table suffered an outage after a change to the userbucket database. The cluster consisted of three shards, each implemented as a replica set, and handled low read/write traffic (peak QPS 40‑60 k).

Cluster Overview

The cluster hosted two databases: userbucket – stores routing information for all users (≈300 documents). feeds_xxxxxxx – contains the billion‑row collection feeds_xxxxxxx.collection1, which is the only sharded collection.

Most read traffic targets shard 1; shards 2 and 3 hold only a small fraction of data. The diagram below shows the original topology.

Cluster topology
Cluster topology

Identified Issues

Shards 2 and 3 were largely idle and used low‑IO SATA disks, which could degrade performance of the userbucket reads.

During a movePrimary operation the primary shard for userbucket was moved from shard 3 to shard 1, but several mongos routers were not refreshed, leaving them with stale routing tables.

Remediation Procedure

Connect to any mongos instance (e.g., mongos1).

Run the movePrimary command to transfer the primary of userbucket from shard 3 to shard 1: sh.movePrimary("userbucket", "shard001") On every other mongos instance (e.g., mongos2, mongos3) force a routing table refresh: db.adminCommand({"flushRouterConfig": 1}) After confirming that the routing tables are consistent, remove the unused shards to reclaim resources:

sh.removeShard("shard002")
sh.removeShard("shard003")
Removal steps
Removal steps

User Impact

After moving the primary, some users reported that a portion of the core data became inaccessible. Only the 300 routing documents in userbucket had been updated; the stale mongos routers still held the old routing metadata, causing intermittent failures when the application tried to resolve user routes.

User impact
User impact

Flushing the router configuration on all mongos instances (or restarting the stale routers) restored full service.

Root Cause Analysis

The outage was traced to incomplete monitoring of mongos instances. Several routers were omitted from the monitoring list, so their routing tables were never refreshed after the movePrimary. Requests routed through those stale routers could not locate the required data.

Root cause diagram
Root cause diagram

Best Practices for Safe movePrimary

Method 1: Single‑router migration

Shut down all mongos routers except one. Perform the movePrimary on the remaining router, then restart the other routers. Before starting, query the config.mongos collection to ensure no hidden routers are present.

Method 2: Chunk‑driven migration

If the primary shard to be removed still hosts critical databases, enable sharding on those collections first. When removeShard is executed, MongoDB will automatically migrate the chunks to other shards. All routers receive updated routing information via the chunk versioning mechanism, eliminating the need for manual router refreshes.

Method comparison
Method comparison
Workflow diagram
Workflow diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ClustertroubleshootingMongoDBmovePrimaryremoveShard
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.