Databases 45 min read

Master MongoDB: From Basics to Advanced Performance and Scaling

This comprehensive guide covers MongoDB's core concepts—including schema‑free design, high availability, sharding, storage engine internals, indexing, and performance tuning—while providing practical code examples, configuration steps, and best‑practice tips for developers and architects looking to master the database in production environments.

Tencent Architect

Dec 14, 2021

Master MongoDB: From Basics to Advanced Performance and Scaling

Introduction

MongoDB is a powerful distributed storage engine that natively supports high availability, sharding, and flexible design. Its design philosophy separates core server capabilities from client‑side decisions, providing flexibility but increasing usage complexity compared with MySQL. This article offers a full‑stack overview of MongoDB theory and practice, targeting beginners and developers who want deeper knowledge.

Knowledge Map Overview

The author organizes MongoDB knowledge into three parts: basic knowledge, application integration, and advanced knowledge.

1. Basic Knowledge

1.1 No Schema

No Schema brings strong expressive power, easier development and iteration, and reduced operational cost because schema changes do not require DDL statements.

MongoDB also provides optional schema validation via JSON Schema. Example:

db.createCollection("saky_test_validation",{validator:{
  $and:[
    {name:{$type:"string"}},
    {status:{$in:["INIT","DEL"]}}
  ]
}})

db.createCollection("saky_test_validation", {
   validator: {
      $jsonSchema: {
         bsonType: "object",
         required: ["name","status"],
         properties: {
            name: {bsonType:"string",description:"must be a string and is required"},
            status: {enum:["INIT","DEL"],description:"can only be one of the enum values and is required"}
         }
      }
   }
})

1.2 High Availability

MongoDB achieves high availability through replica sets composed of at least three members: a primary, one or more secondaries, and optionally an arbiter.

Primary handles all writes; if it fails, a new primary is elected.

Secondary replicates from the primary; multiple secondaries improve read scalability.

Arbiter participates in elections without storing data, useful for cost saving or multi‑zone deployments.

Key mechanisms:

Oplog – a write‑ahead log similar to MySQL binlog; the primary writes changes to memory and oplog, flushing to disk every 100 ms. Secondaries pull oplog entries to stay in sync.

Checkpoint – every 60 s the in‑memory state is flushed to disk, enabling fast recovery after a crash.

Node election – ensures the new primary has the most up‑to‑date data.

Consequences:

After a crash, data up to the last checkpoint is recovered.

Data after the last checkpoint can be replayed from the oplog.

At most 100 ms of writes can be lost; using WriteConcern=majority can eliminate loss.

1.2.1 Replica Set

A replica set consists of:

Primary : the write entry point.

Secondary : one or more nodes that replicate from the primary.

Arbiter : a voting node without data, used to achieve a majority vote while saving resources.

Typical three‑node configurations are PSS or PSA; PSA saves cost but may affect write guarantees after primary loss.

1.2.2 Read/Write Concerns

WriteConcern controls when a write is considered successful. The w parameter specifies how many nodes must acknowledge the write (0‑none, 1‑primary, majority, or a specific count). The j flag forces the primary to journal the write to disk before acknowledging.

ReadConcern determines from which node reads are served. Options include:

primary – read from the primary.

primaryPreferred – read primary if available, otherwise secondary.

secondary – read from a secondary.

secondaryPreferred – read secondary if available, otherwise primary.

nearest – read from the nearest node (secondary only).

ReadConcern levels (local, available, majority, linearizable, snapshot) balance freshness versus durability.

2. Application Integration

2.1 Basic Performance Tests

Performance tests on a 4‑core, 8 GB sharded cluster show:

Compression ratios: Snappy ≈ 3× MySQL, Zlib ≈ 6×.

Write throughput peaks at ~3000 QPS; performance degrades as data grows, then stabilizes.

Read latency for shard‑key queries stays ~2 ms, while indexed queries depend on Mongos capacity (≈10 ms at 1400 QPS on an 8‑core, 16 GB Mongos).

2.2 Sharding Selection

Choosing the number of Mongos instances, number of shards, shard key, and sharding algorithm (range vs. hash) is guided by the performance data. Pre‑splitting and proper chunk sizing help avoid frequent splits and migrations.

2.3 spring-data-mongo

Using Spring Boot with MongoDB requires adding the starter dependency, configuring connection parameters, and optionally customizing MongoTemplate to set WriteConcern or batch options.

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-data-mongodb</artifactId>
</dependency>

spring:
  data:
    mongodb:
      host: ${MONGO_HOST}
      port: ${MONGO_PORT}
      database: ${MONGO_DB}
      username: ${MONGO_USER}
      password: ${MONGO_PASS}

Custom MongoTemplate example to set WriteConcern.MAJORITY and to use unordered bulk inserts:

@Configuration
public class MyMongoConfig {
    @Primary
    @Bean
    public MongoTemplate mongoTemplate(MongoDbFactory factory, MongoConverter converter){
        MongoTemplate tmpl = new MongoTemplate(factory, converter);
        tmpl.setWriteConcern(WriteConcern.MAJORITY);
        return tmpl;
    }
}

2.3.2 Batch Operations

MongoDB provides two batch approaches:

Ordered (default) – the batch stops on the first error; successful operations before the error remain.

Unordered – operations run in parallel; errors do not stop the batch, only the failing statements are reported.

Example of overriding the ordered flag in a custom MongoTemplate:

public class MyMongoTemplate extends MongoTemplate {
    @Override
    protected List<Object> insertDocumentList(String collectionName, List<Document> documents) {
        InsertManyOptions options = new InsertManyOptions().ordered(false);
        collection.insertMany(documents, options);
        return MappedDocument.toIds(documents);
    }
}

2.3.3 Common Pitfalls

Pre‑splitting – creating initial chunks reduces split and migrate overhead during bulk inserts.

In‑memory sorting – avoid queries that require sorting beyond the index order; otherwise MongoDB will perform costly in‑memory sorts.

Chain replication – reduces write load on the primary but can increase write latency when WriteConcern=majority.

3. Advanced Knowledge

3.1 WiredTiger Storage Engine

WiredTiger uses B+‑tree pages. Each page contains three internal lists:

WT_ROW – data loaded from disk.

WT_UPDATE – modifications made after the page was loaded.

WT_INSERT – new inserts after the page was loaded.

Page lifecycle states include DIST (on disk), DELETE, READING, MEM (in memory), LOCKED (eviction lock), LOOKASIDE, and LIMBO.

Key processes:

Reconcile – during checkpoint, merges in‑memory changes into a new page that is flushed to disk.

Evict – LRU‑based eviction of pages when memory pressure rises.

3.2 Chunk Management

Chunks are logical data units (default 64 MB, configurable 1‑1024 MB) used for sharding. Chunk metadata includes _id, ns, min, max, shard, and history. Chunks split when size or document count exceeds thresholds, and the balancer migrates chunks to keep shard distribution even.

Pre‑splitting can be done with:

sh.shardCollection("saky_db.saky_table", {"_id":"hashed"}, false, {numInitialChunks:8192*<number_of_shards>})

3.3 Consistency & High Availability

MongoDB follows the CAP/BASE trade‑offs. Replica‑set elections are based on a Raft‑inspired algorithm with three states: leader (primary), candidate, and follower (secondary). Voting rules favor the node with the most up‑to‑date oplog. After a candidate wins, a catch‑up phase ensures the new primary synchronizes any missing oplog entries before becoming writable.

Synchronization source selection prefers nodes that:

are alive (heartbeat received),

are not more than 30 s behind the primary,

have an oplog window that covers the follower’s latest timestamp.

3.4 Indexes

Supported index types:

Single‑field

Compound (prefix rule applies)

Multikey (arrays)

Hash (used with hashed sharding)

Geospatial (2D/2DSphere)

Text (available but slower)

Important considerations:

Compound indexes must match query sort order to avoid in‑memory sorting.

Background index builds prevent blocking but concurrent builds can exhaust CPU; monitor and create indexes sequentially.

The explain command provides three sections: queryPlanner, executionStats, and allPlansExecution, helping developers analyze index usage and performance.

Conclusion

This article equips readers with foundational concepts, practical configuration steps, performance insights, and deep architectural knowledge needed to use MongoDB efficiently in production environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Indexing Sharding replication Database Design MongoDB NoSQL

Written by

Tencent Architect

We share technical insights on storage, computing, and access, and explore industry-leading product technologies together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.