How OPPO Turned MongoDB from a Niche DB into a Core, Scalable, Multi‑Data‑Center System
This article details OPPO's journey of adopting MongoDB at massive scale, addressing common misconceptions, designing multi‑data‑center active‑active architectures, optimizing thread models and migration processes, and applying performance‑boosting techniques that yielded multi‑fold improvements for trillion‑row clusters.
Background
When joining OPPO, several high‑traffic services using MongoDB suffered frequent timeouts and jitter, prompting a migration plan to MySQL. After a month of intensive work on the service layer, storage engine, and deployment configurations, jitter was eliminated and MongoDB remained the primary data store.
Active‑Active Multi‑Data‑Center Architecture
Three deployment patterns were evaluated:
Same‑city three‑data‑center (1mongod+1mongod+1mongod) – each data‑center runs at least two proxy instances for high availability; clients use the nearest read preference. Cross‑region writes appear only when data‑centers are geographically separated.
Same‑city two‑data‑center (2mongod+2mongod+1arbiter) – adds an arbiter to reduce election overhead; the same read‑preference behavior applies.
Inter‑region three‑data‑center (1mongod+1mongod+1mongod) – uses tag‑based routing so that writes are directed to the shard whose primary resides in the same region, eliminating cross‑region writes.
Thread‑Model Bottlenecks and Optimizations
The default model creates one thread per client connection, leading to excessive memory and CPU consumption when connections reach 100 k. MongoDB provides an adaptive thread‑pool model that splits each request into tasks (network I/O, disk I/O) placed in a global queue processed by a pool of worker threads. The global lock on the queue becomes a contention point.
Optimized multi‑queue model partitions the global queue by session hash, reducing lock contention and improving throughput.
// Enable adaptive thread pool (pseudo‑code)
setParameter: 1,
"wiredTigerEngineRuntimeConfig": "cache_size=35GB, eviction=(threads_min=4,threads_max=12)"Parallel Chunk Migration for Cluster Expansion
To expand a 3‑node shard to 6 nodes, the migration workflow is:
Select chunks to move (S = min(M, N)).
Acquire a lock on config.locks for the target collection.
Notify source shards to start migration.
After migration, wait 10 s and repeat.
Key bottlenecks: distributed‑lock contention, a slow chunk on any source shard, and the fixed 10 s pause. Optimizations include disabling auto‑split or increasing chunksize, running migrations for different shards in parallel, and moving the delay into per‑shard logic with a configurable CLI parameter.
Performance‑Boosting Case Studies
Case 1 – Billion‑Row Cluster
Pre‑sharding and write‑load balancing.
WriteConcern {w:"majority"} and read‑write separation.
Disable enableMajorityReadConcern to reduce latency.
Storage‑engine cache tuning:
eviction_target: 75%
eviction_trigger: 97%
eviction_dirty_target: 3%
eviction_dirty_trigger: 25%
evict.threads_min: 4
evict.threads_max: 16Checkpoint interval reduced to avoid I/O spikes: checkpoint=(wait=30,log_size=1GB) Sharding the system.sessions collection and staggering updates eliminated massive write spikes.
Enable tcmalloc (gperftools) and increase release rate to mitigate page‑heap fragmentation.
Case 2 – Trillion‑Row Cluster
Original schema stored a single characteristic with millions of sub‑documents in an array, causing massive I/O. Redesign steps:
Group sub‑documents under a group array per characteristic (≈100× read reduction).
Add a hashNum field and split the large array into ~500 shards, each holding a manageable number of sub‑documents, staying below the 64 MB document limit.
Result: dramatic latency reduction; write bottlenecks addressed by limiting array size and using $addToSet wisely.
Cost‑Saving Migration Example
To move several hundred billion rows from a high‑IO SSD cluster to a low‑IO SATA cluster:
Copy data to an intermediate SSD cluster.
Transfer from the SSD cluster to the target SATA servers.
Load data into the destination cluster.
Using zlib compression (4.5‑7.5×) instead of the default snappy (2.2‑3.5×) achieved an ~8:1 storage reduction, allowing the workload to run on a handful of SATA nodes instead of hundreds of SSD nodes.
Future Outlook – MongoDB + SQL Fusion
MongoDB 4.2 introduced distributed transactions; 4.4 plans further NewSQL features. To ease developer adoption, a proposal is to extend mongos to translate 5‑10 % of SQL statements into native MongoDB queries, covering ~90 % of use cases while keeping per‑database least‑privilege accounts.
Operational Tools and Practices
mongostat --discover -h ip:port -u user -p pass --authenticationDatabase=admin– monitors all shard nodes.
Slow‑log analysis via
tail mongod.log -n 1000000 | grep ms | grep COLLSCAN | grep -v "getMore" | grep -v "oplog.rs".
Kill long‑running ops:
db.currentOp({"secs_running":{$gt:5}}).inprog.forEach(function(i){db.killOp(i.opid)}).
Session tagging for write routing:
sh.addShardTag('shard01','junior')
sh.addTagRange('test.users', {age: MinKey}, {age:20}, 'junior')Security: per‑database users, readWrite roles without destructive privileges, IP whitelist/blacklist, audit and traffic‑control (enterprise or Percona MongoDB), hot‑backup instead of mongodump.
Shard‑key design: ensure data dispersion and keep related reads/writes on the same shard; use range sharding for range queries.
Capacity alerts: disk >80 % → add shard; write >35 k ops/s or read >40 k ops/s on SSD → add node; CPU >80 % → scale CPU.
Key Takeaways
Systematic analysis of MongoDB internals and proper configuration can eliminate jitter and scale to trillion‑row workloads.
Active‑active multi‑data‑center designs with tag‑based routing avoid cross‑region writes.
Adaptive multi‑queue thread model and storage‑engine tuning dramatically improve throughput.
Parallel chunk migration and data‑model redesign enable N‑fold cluster expansion.
Compression and hardware‑tier selection provide massive cost savings.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
