Databases 13 min read

MongoDB Cluster Point-in-Time Recovery (PITR) Procedure Using Shard and Config Server Restoration

This article demonstrates a step‑by‑step point‑in‑time recovery of a MongoDB sharded cluster by restoring shard instances, replaying oplog entries to a specific timestamp, updating metadata, and finally rebuilding the config server and mongos to achieve a consistent snapshot with max(id)=15.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
MongoDB Cluster Point-in-Time Recovery (PITR) Procedure Using Shard and Config Server Restoration

1. Background

Most online examples cover PITR for a single‑node MongoDB instance, and the official documentation provides recovery steps for a MongoDB cluster but lacks a concrete PITR workflow. This article presents an experimental setup that simulates an online environment and performs a PITR on a MongoDB sharded cluster.

Original cluster topology

172.16.129.170 shard1 27017 shard2 27018 config 37017 mongos 47017
172.16.129.171 shard1 27017 shard2 27018 config 37017 mongos 47017
172.16.129.172 shard1 27017 shard2 27018 config 37017 mongos 47017

For the demo we restore each shard as a single instance (sufficient for developer queries). The restored topology becomes:

172.16.129.173 shard1 27017 shard2 27018 config 37017 mongos 47017
172.16.129.174 config 37017
172.16.129.175 config 37017

The MongoDB version is Percona 4.2.13. Both each shard and the config server run scheduled hot‑backup scripts and oplog‑backup scripts. Test data is created via mongos by creating a hashed‑sharded collection and inserting 10 documents:

use admin
db.runCommand({"enablesharding":"renkun"})
sh.shardCollection("renkun.user", { id: "hashed" } )
use renkun
var tmp = [];
for(var i =0; i<10; i++){
  tmp.push({ 'id':i, "name":"Kun " + i});
}
db.user.insertMany(tmp);

After taking physical hot‑backups of shard1, shard2 and the config server, another 10 documents are inserted, bringing the total to 20 rows (id 0‑19). The goal is to restore the cluster to the point where max(id)=15.

2. Restoring Shard Instances

2.1 Identify the target snapshot

For each shard we dump the oplog BSON file to JSON and locate the entry that inserted id=15:

bsondump oplog.rs.bson > oplog.rs.json
more oplog.rs.json | egrep "\"op\":\"i\",\"ns\":\"renkun\.user\"" | grep "\"Kun\n15\""

The matching entry on shard2 shows a timestamp {"t":1623408268,"i":6} . Because mongorestore --oplogLimit uses an open interval, we add one to the increment, resulting in 1623408268:7 as the effective limit.

2.2 Create a temporary restoration user

Log in as root on each shard and create a user with the __system role, named internal_restore . This user is required for oplog replay, modifying admin.system.version , and dropping the local database.

use admin
db.createUser({ user: "internal_restore", pwd: "internal_restore", roles: ["__system"] })

Log in as internal_restore and drop the local database:

use local
db.dropDatabase()

2.3 Replay oplog up to the target timestamp

mongorestore -h 127.0.0.1 -u internal_restore -p "internal_restore" --port 27017 \
  --oplogReplay --oplogLimit "1623408268:7" --authenticationDatabase admin \
  /data/backup/202106111849_27017/local/oplog.rs.bson

mongorestore -h 127.0.0.1 -u internal_restore -p "internal_restore" --port 27018 \
  --oplogReplay --oplogLimit "1623408268:7" --authenticationDatabase admin \
  /data/backup/202106111850_27018/local/oplog.rs.bson

Verify the restored data on each shard:

--shard1 27017
> db.user.find().sort({"id":1})
{ "_id" : ObjectId("..."), "id" : 3, "name" : "Kun 3" }
... (records up to id 12)

--shard2 27018
> db.user.find().sort({"id":1})
{ "_id" : ObjectId("..."), "id" : 0, "name" : "Kun 0" }
... (records up to id 15)

After replay, the shards contain 16 documents with the maximum id equal to 15.

2.4 Update admin.system.version

Because the shards have changed from three‑node replica sets to single instances, the metadata must be corrected. Using internal_restore on each shard:

use admin
db.system.version.deleteOne({ _id: "minOpTimeRecovery" })
db.system.version.find({"_id" : "shardIdentity" })
db.system.version.updateOne(
  { "_id" : "shardIdentity" },
  { $set : { "configsvrConnectionString" : "configdb/172.16.129.173:37017,172.16.129.174:37017,172.16.129.175:37017" } }
)

Before deletion the old records still pointed to the original config servers (172.16.129.170‑172.16.129.172).

3. Restoring the Config Server

Transfer the physical backup of the config server, extract it to the data directory, and start it as a single instance. Create the same internal_restore user with the __system role.

3.1 Replay oplog

mongorestore -h 127.0.0.1 -u internal_restore -p "internal_restore" --port 37017 \
  --oplogReplay --oplogLimit "1623408268:7" --authenticationDatabase admin \
  /data/backup/202106111850_37017/local/oplog.rs.bson

3.2 Modify metadata

Log in as internal_restore and update the shard entries in the config.shards collection to point to the restored shard hosts:

use local
db.dropDatabase()
use config
db.shards.find()
db.shards.updateOne({ "_id" : "repl" }, { $set : { "host" : "172.16.129.173:27017" } })
db.shards.updateOne({ "_id" : "repl2" }, { $set : { "host" : "172.16.129.173:27018" } })
db.shards.find()

3.3 Start the config server replica set

Stop the single‑instance config server, then start it in replica‑set mode with the following configuration snippet:

sharding:
  clusterRole: configsvr
replication:
  oplogSizeMB: 10240
  replSetName: configdb

Initialize the replica set and add the two remaining members:

rs.initiate()
rs.add("172.16.129.174:37017")
rs.add("172.16.129.175:37017")

The config server is now a three‑node replica set.

4. Configuring mongos

Copy the original mongos configuration file and adjust the sharding and net.bindIp parameters to reflect the new config server addresses:

sharding:
  configDB: "configdb/172.16.129.173:37017,172.16.129.174:37017,172.16.129.175:37017"
net:
  port: 47017
  bindIp: 127.0.0.1,172.16.129.173

After starting mongos , query the renkun.user collection; it returns 16 documents with max(id)=15, confirming a successful PITR.

mongos> use renkun
switched to db renkun
mongos> db.user.find().sort({"id":1})
{ "_id" : ObjectId("..."), "id" : 0, "name" : "Kun 0" }
... (up to id 15)

5. Summary

MongoDB 4.x introduced multi‑document transactions, and version 4.2 added cross‑shard transactions. Restoring data involved in such transactions requires special tools not covered in this guide; therefore the presented PITR procedure applies to non‑transactional workloads.

BackupMongoDBrestoreOplogPITRSharded Cluster
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.