Databases 12 min read

How Zhihu Solved MongoDB Scaling Pain Points with a Cloud Migration

This article details Zhihu's security anti‑fraud system challenges with its self‑managed MongoDB cluster, the strategic move to Alibaba Cloud MongoDB services, the step‑by‑step migration plan, and the operational and performance benefits achieved after the successful cloud transition.

Xiaolei Talks DB
Xiaolei Talks DB
Xiaolei Talks DB
How Zhihu Solved MongoDB Scaling Pain Points with a Cloud Migration

Self‑Managed MongoDB On‑Premises Cluster Technical Pain Points

Zhihu's security anti‑fraud system stores massive, complex data in MongoDB. Rapid growth caused four main issues: continuous node expansion due to storage pressure, hotspot sharding from unbalanced data rules, increasingly long backup intervals raising data‑loss risk, and frequent manual operations threatening stability.

Optimal Cloud MongoDB Service Solution

To address these problems, Zhihu collaborated with Alibaba Cloud experts and proposed:

Decouple storage and compute, improve data distribution : Switch from tag‑based passive scheduling to chunk‑based proactive pre‑splitting, adjusting rules during migration for balanced distribution.

Elastic IOPS scaling : Adopt storage performance elasticity to match workload spikes, increasing IOPS during peaks and reducing it during troughs for cost efficiency.

Snapshot‑based backup : Replace physical backups with cloud‑disk snapshot backups, enabling high‑frequency (every 15 minutes) backups and fine‑grained recovery.

The chosen solution is Alibaba Cloud MongoDB on ESSD cloud disks with AutoPL elastic performance and snapshot backup.

MongoDB Overall Migration Plan and Technical Challenges

Moving hundreds of terabytes with minute‑level cut‑over windows required strict principles:

Separate migration of DB cluster and ETL pipelines to avoid external variable interference.

Flexible, controllable sync rate using Alibaba Cloud DTS for dynamic throttling without impacting the source.

Scripted cut‑over execution to reduce manual steps from minutes to seconds and ensure repeatability.

Pre‑cut‑over rehearsals and emergency rollback plans to identify and mitigate risks such as driver connection string issues.

The detailed steps include environment preparation, target instance creation, permission setup, forward and reverse DTS tasks, data validation, cut‑over preparation, script development, and final production cut‑over with monitoring and resource cleanup.

Benefits After Migration to Cloud

The migration completed in one and a half months with zero business downtime, delivering:

Significant resource cost savings through flexible compute‑storage pairing.

Operational efficiency gains via a unified cloud DB management platform.

Resolution of historic technical issues like uneven data distribution.

Improved high‑availability with multi‑AZ deployment and robust backup/recovery.

Final Thoughts

Despite thorough preparation, unexpected issues such as ineffective connection‑string updates, uneven pre‑splitting, and performance impacts from new sharding strategies arose. These were resolved through redeployment, disabling the balancer during pre‑splitting, and extensive performance testing before cut‑over.

performancecloud computingoperationsscalabilityBackupdatabase migrationMongoDB
Xiaolei Talks DB
Written by

Xiaolei Talks DB

Sharing daily database operations insights, from distributed databases to cloud migration. Author: Dai Xiaolei, with 10+ years of DB ops and development experience. Your support is appreciated.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.