How I Resolved a Critical MongoDB Sharding Failure in a Billion‑Row Cluster
This article recounts a real‑world MongoDB sharding incident on a core cluster with billions of records, detailing the root causes, step‑by‑step remediation using removeShard and movePrimary, and the lessons learned to prevent similar outages.
Background
A core MongoDB sharded cluster that stores a billion‑row table suffered an outage after a change to the userbucket database. The cluster consisted of three shards, each implemented as a replica set, and handled low read/write traffic (peak QPS 40‑60 k).
Cluster Overview
The cluster hosted two databases: userbucket – stores routing information for all users (≈300 documents). feeds_xxxxxxx – contains the billion‑row collection feeds_xxxxxxx.collection1, which is the only sharded collection.
Most read traffic targets shard 1; shards 2 and 3 hold only a small fraction of data. The diagram below shows the original topology.
Identified Issues
Shards 2 and 3 were largely idle and used low‑IO SATA disks, which could degrade performance of the userbucket reads.
During a movePrimary operation the primary shard for userbucket was moved from shard 3 to shard 1, but several mongos routers were not refreshed, leaving them with stale routing tables.
Remediation Procedure
Connect to any mongos instance (e.g., mongos1).
Run the movePrimary command to transfer the primary of userbucket from shard 3 to shard 1: sh.movePrimary("userbucket", "shard001") On every other mongos instance (e.g., mongos2, mongos3) force a routing table refresh: db.adminCommand({"flushRouterConfig": 1}) After confirming that the routing tables are consistent, remove the unused shards to reclaim resources:
sh.removeShard("shard002")
sh.removeShard("shard003")User Impact
After moving the primary, some users reported that a portion of the core data became inaccessible. Only the 300 routing documents in userbucket had been updated; the stale mongos routers still held the old routing metadata, causing intermittent failures when the application tried to resolve user routes.
Flushing the router configuration on all mongos instances (or restarting the stale routers) restored full service.
Root Cause Analysis
The outage was traced to incomplete monitoring of mongos instances. Several routers were omitted from the monitoring list, so their routing tables were never refreshed after the movePrimary. Requests routed through those stale routers could not locate the required data.
Best Practices for Safe movePrimary
Method 1: Single‑router migration
Shut down all mongos routers except one. Perform the movePrimary on the remaining router, then restart the other routers. Before starting, query the config.mongos collection to ensure no hidden routers are present.
Method 2: Chunk‑driven migration
If the primary shard to be removed still hosts critical databases, enable sharding on those collections first. When removeShard is executed, MongoDB will automatically migrate the chunks to other shards. All routers receive updated routing information via the chunk versioning mechanism, eliminating the need for manual router refreshes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
