Databases 9 min read

Diagnosing and Resolving MongoDB and Redis Replication Failures: Oplog Issues and Recovery Strategies

This article analyzes a MongoDB replication failure caused by missing oplog entries, compares logical initialization repair methods with Redis replication, and provides practical steps—including oplog size adjustment and snapshot techniques—to restore synchronization and prevent similar issues in production environments.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
Diagnosing and Resolving MongoDB and Redis Replication Failures: Oplog Issues and Recovery Strategies

Background : In a production environment running MongoDB 4.4.14 with a PSS architecture, a secondary node remained in STARTUP2 state with an optime of 1970, preventing data sync. Copying the data directory via scp did not resolve the issue.

Analysis : Directly copying the data directory can cause inconsistency; the appropriate fix is a logical initialization. After killing the faulty instance, clearing its data directory, and restarting, the node still failed to sync, prompting log inspection. The logs showed repeated OplogStartMissing errors, indicating the primary's minimum oplog position was ahead of the secondary's required position.

Root Causes : Large data volume leading to prolonged sync, causing older oplog entries to be overwritten. High write activity during sync, also causing oplog rollover.

Resolution : Increase the oplog size temporarily (e.g., double it) and restart the secondary. After the adjustment, replication recovered.

Practical Oplog Size Reference (see [1]):

{"t":{"$date":"2023-08-22T19:01:15.574+08:00"},"s":"I","c":"INITSYNC","id":21192,"ctx":"ReplCoordExtern-10","msg":"Initial sync status and statistics","attr":{"status":"failed","statistics":{"failedInitialSyncAttempts":10,"maxFailedInitialSyncAttempts":10,"initialSyncStart":{"$date":"2023-08-22T09:51:56.710Z"},"totalInitialSyncElapsedMillis":4158864,"initialSyncAttempts":[{"durationMillis":418787,"status":"OplogStartMissing: error fetching oplog during initial sync :: caused by :: Our last optime fetched: { ts: Timestamp(1692698292, 26), t: 44 }. source's GTE: { ts: Timestamp(1692698294, 42), t: 44 }","syncSource":"master:27017","rollBackId":8,"operationsRetried":0,"totalTimeUnreachableMillis":0}, ...]}}

Further Discussion : Alternative repair methods include using LVM snapshots or locking the primary with db.fsyncLock() before copying data, though the latter may block writes and is not recommended. Comparison with Redis replication: both use a full data copy followed by incremental sync, but Redis relies on psync / sync and bgsave for snapshots, while MongoDB uses oplog cloning.

Redis Specifics : The client-output-buffer-limit (default 512M) and repl-backlog-size affect replication stability; both can be tuned via config set commands.

Summary : By understanding oplog mechanics and adjusting its size, as well as considering snapshot‑based recovery options, administrators can effectively resolve replication stalls in both MongoDB and Redis clusters.

References : [1] Oplog size guidelines: https://mongoing.com/blog/oplog-size [2] Filesystem snapshot backup for MongoDB: https://www.mongodb.com/docs/v4.4/tutorial/backup-with-filesystem-snapshots/

RedisReplicationMongoDBDatabase RecoveryOplogLogical Initialization
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.