Databases 24 min read

Milvus Storage Tuning in Practice: 25× Query Speedup and Three Tricks to Cut Memory Usage by Half

This article walks through Milvus 2.3‑2.6.x storage optimizations—Mmap, tiered storage, and clustering compaction—explaining their principles, configuration hierarchy, benchmark results, and concrete deployment templates that together can boost query performance up to 25‑fold while halving memory consumption.

Shuge Unlimited

Apr 29, 2026

Milvus Storage Tuning in Practice: 25× Query Speedup and Three Tricks to Cut Memory Usage by Half

1. Storage modes: Full‑load vs Tiered storage

Milvus loads a collection with load_collection. In full‑load mode (default before version 2.6.4) all field data, index files and metadata are cached in memory, giving low query latency but high memory consumption. When the dataset exceeds the memory capacity of a QueryNode, loading fails.

Tiered storage (available from Milvus 2.6.4) decouples metadata caching from segment loading:

Only lightweight metadata (schema, index info, block mapping) is cached on load_collection.

Field data and index files are fetched on demand.

Cold data is evicted automatically, freeing space for hot data.

Typical trade‑offs:

Load speed : full‑load is slow because it copies all data; tiered storage is fast because it loads only metadata.

Memory usage : full‑load consumes a large amount of RAM; tiered storage keeps RAM usage low.

First‑query latency : full‑load gives the lowest latency; tiered storage may add latency for the first access of cold data.

Applicable data size : full‑load works when the total size is smaller than the QueryNode memory; tiered storage is designed for datasets that exceed a single node’s memory.

Choose full‑load when the collection is small and memory is abundant; otherwise adopt tiered storage.

2. Mmap (memory‑mapped files)

Milvus introduced Mmap support in version 2.3. Mmap maps a disk file into the process address space, allowing reads/writes to operate directly on the file. Four control levels are provided (global → collection → field → index) with the following priority: field/index > collection > global.

queryNode:
  mmap:
    scalarField: false   # scalar field raw data
    scalarIndex: false   # scalar index (only inverted index supported)
    vectorField: false   # vector raw data
    vectorIndex: false   # vector index
    mmapDirPath: /milvus/mmap_data

After changing a collection‑level or index‑level Mmap setting, the collection must be released ( release_collection) and loaded again ( load_collection) for the change to take effect.

When to enable Mmap : large datasets with uneven query frequency, many cold records, or cost‑sensitive workloads where a modest latency increase is acceptable.

When not to enable Mmap : latency‑critical online services or environments with abundant memory.

3. Tiered storage four‑stage workflow

Stage 1 – Lazy Load

load_collection

caches only segment metadata (schema, index info, block mapping). No data blocks are loaded, so the call returns instantly.

Stage 2 – Warm‑up

Warm‑up pre‑loads specified fields or indexes before they become queryable, eliminating the cold‑start delay. Configuration priority is field/index > collection > cluster. Only two options exist for each target: sync (pre‑load) or disable (skip).

client.create_collection(
    collection_name="products",
    schema=schema,
    properties={
        "warmup.scalarField": "sync",
        "warmup.scalarIndex": "sync",
        "warmup.vectorField": "disable",
        "warmup.vectorIndex": "sync",
    }
)

Stage 3 – Partial Load

During query execution Milvus loads required data blocks on demand. Field data is loaded at block granularity; indexes are loaded as whole segments. This step is automatic.

Stage 4 – Eviction

When memory or disk usage exceeds configured watermarks, Milvus evicts least‑recently‑used data. Two eviction modes are available:

Synchronous eviction : triggered during a query that exceeds limits; the query pauses until space is reclaimed.

Asynchronous eviction : a background thread periodically cleans up without blocking queries.

Both modes can be enabled simultaneously.

4. Eviction tuning: watermarks and TTL

Watermarks define upper and lower resource thresholds. Exceeding the high watermark triggers eviction; dropping below the low watermark stops it.

queryNode:
  segcore:
    tieredStorage:
      evictionEnabled: true
      backgroundEvictionEnabled: true
      memoryLowWatermarkRatio: 0.75
      memoryHighWatermarkRatio: 0.80
      diskLowWatermarkRatio: 0.75
      diskHighWatermarkRatio: 0.80
      cacheTtl: 604800   # 7 days (seconds)

Guidelines (derived from Milvus documentation):

Keep the high‑low gap between 0.05 – 0.10 to avoid frequent eviction or overly long eviction cycles.

Do not set the high watermark above 0.80 to preserve buffer for traffic spikes.

Start with the default 75 % – 80 % range and adjust based on observed utilization.

TTL (Time‑To‑Live) controls how long unused data stays cached; it works only when backgroundEvictionEnabled: true. Typical choices:

Short TTL (hours‑day) for highly dynamic data.

Long TTL (days) for relatively stable data.

TTL = 0 (disabled) for pure hot‑data scenarios.

5. Clustering compaction (query‑performance multiplier)

Clustering compaction physically reorganizes data according to a clustering key, enabling intelligent segment pruning during queries. Example: using user_id as the clustering key allows Milvus to skip 99 % of segments for user_id = 1000 queries.

Benchmark (2000 万 768‑dim LAION dataset, clustering key = key of type Int64)

No filter – prune 0 %, latency 1685 ms, QPS 17.75

key > 200 && key < 800 – prune 40.2 %, latency 1045 ms, QPS 28.38

key > 200 && key < 600 – prune 59.8 %, latency 829 ms, QPS 35.78

key > 200 && key < 400 – prune 79.5 %, latency 550 ms, QPS 54.00

key == 1000 – prune 99 %, latency 68 ms, QPS 431.41 (≈ 25× speedup)

Exact‑match queries see the largest benefit; queries without a filter on the clustering key gain no improvement.

Configuration steps (Python SDK):

from pymilvus import CollectionSchema, FieldSchema, DataType
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
    FieldSchema(name="user_id", dtype=DataType.INT64, is_clustering_key=True),
    FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=1000),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768)
]
schema = CollectionSchema(fields, description="User behavior data")

Supported clustering‑key types: Int8, Int16, Int32, Int64, Float, Double, VarChar.

Manual compaction:

# Trigger clustering compaction
collection.compact(is_clustering=True)
# Check status
state = collection.get_compaction_state(is_clustering=True)
# Wait until finished
collection.wait_for_compaction_completed(is_clustering=True)

Automatic compaction can be enabled in dataCoord.compaction with tunable intervals and thresholds (e.g., triggerInterval, minInterval, maxInterval).

Key considerations for selecting a clustering key:

High‑frequency filter field (e.g., tenant ID, timestamp).

High‑cardinality field (uniform value distribution).

Business‑critical field (tenant ID for multi‑tenant, timestamp for time‑series).

If a collection already defines a partition key, it can be reused as the clustering key via usePartitionKeyAsClusteringKey: true.

6. Scenario‑based configuration templates

Scenario 1 – Real‑time low‑latency retrieval

queryNode:
  segcore:
    tieredStorage:
      warmup:
        scalarField: sync
        scalarIndex: sync
        vectorField: disable   # on‑demand
        vectorIndex: sync
      evictionEnabled: true
      backgroundEvictionEnabled: true
      memoryLowWatermarkRatio: 0.75
      memoryHighWatermarkRatio: 0.80
      diskLowWatermarkRatio: 0.75
      diskHighWatermarkRatio: 0.80
      cacheTtl: 0   # disable TTL

Vector indexes are pre‑loaded ( sync) to avoid cold‑start latency; raw vectors stay on disk and are fetched as needed.

Scenario 2 – Offline batch analysis

queryNode:
  segcore:
    tieredStorage:
      warmup:
        scalarField: disable
        scalarIndex: disable
        vectorField: disable
        vectorIndex: disable
      evictionEnabled: true
      backgroundEvictionEnabled: true
      memoryLowWatermarkRatio: 0.70
      memoryHighWatermarkRatio: 0.85
      diskLowWatermarkRatio: 0.70
      diskHighWatermarkRatio: 0.85
      cacheTtl: 86400   # 1 day

All warm‑up disabled for fast loading; a wider watermark range tolerates resource spikes; TTL of one day releases stale data after the batch job.

Scenario 3 – Mixed deployment

# Online collection – critical data pre‑loaded
client.create_collection(
    collection_name="online_search",
    schema=schema,
    properties={
        "warmup.scalarField": "sync",
        "warmup.scalarIndex": "sync",
        "warmup.vectorField": "disable",
        "warmup.vectorIndex": "sync",
    }
)

# Offline collection – everything on‑demand
client.create_collection(
    collection_name="offline_analysis",
    schema=schema,
    properties={
        "warmup.scalarField": "disable",
        "warmup.scalarIndex": "disable",
        "warmup.vectorField": "disable",
        "warmup.vectorIndex": "disable",
    }
)

Global watermarks are set to middle values; per‑collection warm‑up decides the latency vs. resource trade‑off.

7. Tuning advice and common pitfalls

Start conservatively : enable Mmap first and observe impact; if memory is still insufficient, switch to tiered storage; enable clustering compaction only when queries contain filters on the clustering key.

Monitor key metrics : cache hit rate, memory/disk utilization, eviction frequency, P99 query latency.

Configuration interactions : warmup=sync + mmap=true stores pre‑loaded data on disk; warmup=sync + mmap=false stores it in memory. Mixing them unintentionally can nullify expected latency gains.

Release before reload : after changing collection‑level Mmap settings, always call release_collection then load_collection.

Clustering compaction cost : compaction consumes CPU and I/O; schedule during low‑traffic periods and set a reasonable minInterval to avoid frequent runs.

Over‑commit ratio : tiered storage may allow total data size > memory, but keep the over‑commit ratio below 0.7 to prevent excessive eviction and latency spikes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Vector Database Performance Tuning storage optimization Milvus mmap tiered storage clustering compaction

Written by

Shuge Unlimited

Formerly "Ops with Skill", now officially upgraded. Fully dedicated to AI, we share both the why (fundamental insights) and the how (practical implementation). From technical operations to breakthrough thinking, we help you understand AI's transformation and master the core abilities needed to shape the future. ShugeX: boundless exploration, skillful execution.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. Storage modes: Full‑load vs Tiered storage

2. Mmap (memory‑mapped files)

3. Tiered storage four‑stage workflow

Stage 1 – Lazy Load

Stage 2 – Warm‑up

Stage 3 – Partial Load

Stage 4 – Eviction

4. Eviction tuning: watermarks and TTL

5. Clustering compaction (query‑performance multiplier)

6. Scenario‑based configuration templates

Scenario 1 – Real‑time low‑latency retrieval

Scenario 2 – Offline batch analysis

Scenario 3 – Mixed deployment

7. Tuning advice and common pitfalls

Shuge Unlimited

How this landed with the community

Was this worth your time?

0 Comments

Stage 1 – Lazy Load

Stage 2 – Warm‑up

Stage 3 – Partial Load

Stage 4 – Eviction

Scenario 1 – Real‑time low‑latency retrieval

Scenario 2 – Offline batch analysis

Scenario 3 – Mixed deployment