Milvus Storage Tuning in Practice: 25× Query Speedup and Three Tricks to Cut Memory Usage by Half
This article walks through Milvus 2.3‑2.6.x storage optimizations—Mmap, tiered storage, and clustering compaction—explaining their principles, configuration hierarchy, benchmark results, and concrete deployment templates that together can boost query performance up to 25‑fold while halving memory consumption.
1. Storage modes: Full‑load vs Tiered storage
Milvus loads a collection with load_collection. In full‑load mode (default before version 2.6.4) all field data, index files and metadata are cached in memory, giving low query latency but high memory consumption. When the dataset exceeds the memory capacity of a QueryNode, loading fails.
Tiered storage (available from Milvus 2.6.4) decouples metadata caching from segment loading:
Only lightweight metadata (schema, index info, block mapping) is cached on load_collection.
Field data and index files are fetched on demand.
Cold data is evicted automatically, freeing space for hot data.
Typical trade‑offs:
Load speed : full‑load is slow because it copies all data; tiered storage is fast because it loads only metadata.
Memory usage : full‑load consumes a large amount of RAM; tiered storage keeps RAM usage low.
First‑query latency : full‑load gives the lowest latency; tiered storage may add latency for the first access of cold data.
Applicable data size : full‑load works when the total size is smaller than the QueryNode memory; tiered storage is designed for datasets that exceed a single node’s memory.
Choose full‑load when the collection is small and memory is abundant; otherwise adopt tiered storage.
2. Mmap (memory‑mapped files)
Milvus introduced Mmap support in version 2.3. Mmap maps a disk file into the process address space, allowing reads/writes to operate directly on the file. Four control levels are provided (global → collection → field → index) with the following priority: field/index > collection > global.
queryNode:
mmap:
scalarField: false # scalar field raw data
scalarIndex: false # scalar index (only inverted index supported)
vectorField: false # vector raw data
vectorIndex: false # vector index
mmapDirPath: /milvus/mmap_dataAfter changing a collection‑level or index‑level Mmap setting, the collection must be released ( release_collection) and loaded again ( load_collection) for the change to take effect.
When to enable Mmap : large datasets with uneven query frequency, many cold records, or cost‑sensitive workloads where a modest latency increase is acceptable.
When not to enable Mmap : latency‑critical online services or environments with abundant memory.
3. Tiered storage four‑stage workflow
Stage 1 – Lazy Load
load_collectioncaches only segment metadata (schema, index info, block mapping). No data blocks are loaded, so the call returns instantly.
Stage 2 – Warm‑up
Warm‑up pre‑loads specified fields or indexes before they become queryable, eliminating the cold‑start delay. Configuration priority is field/index > collection > cluster. Only two options exist for each target: sync (pre‑load) or disable (skip).
client.create_collection(
collection_name="products",
schema=schema,
properties={
"warmup.scalarField": "sync",
"warmup.scalarIndex": "sync",
"warmup.vectorField": "disable",
"warmup.vectorIndex": "sync",
}
)Stage 3 – Partial Load
During query execution Milvus loads required data blocks on demand. Field data is loaded at block granularity; indexes are loaded as whole segments. This step is automatic.
Stage 4 – Eviction
When memory or disk usage exceeds configured watermarks, Milvus evicts least‑recently‑used data. Two eviction modes are available:
Synchronous eviction : triggered during a query that exceeds limits; the query pauses until space is reclaimed.
Asynchronous eviction : a background thread periodically cleans up without blocking queries.
Both modes can be enabled simultaneously.
4. Eviction tuning: watermarks and TTL
Watermarks define upper and lower resource thresholds. Exceeding the high watermark triggers eviction; dropping below the low watermark stops it.
queryNode:
segcore:
tieredStorage:
evictionEnabled: true
backgroundEvictionEnabled: true
memoryLowWatermarkRatio: 0.75
memoryHighWatermarkRatio: 0.80
diskLowWatermarkRatio: 0.75
diskHighWatermarkRatio: 0.80
cacheTtl: 604800 # 7 days (seconds)Guidelines (derived from Milvus documentation):
Keep the high‑low gap between 0.05 – 0.10 to avoid frequent eviction or overly long eviction cycles.
Do not set the high watermark above 0.80 to preserve buffer for traffic spikes.
Start with the default 75 % – 80 % range and adjust based on observed utilization.
TTL (Time‑To‑Live) controls how long unused data stays cached; it works only when backgroundEvictionEnabled: true. Typical choices:
Short TTL (hours‑day) for highly dynamic data.
Long TTL (days) for relatively stable data.
TTL = 0 (disabled) for pure hot‑data scenarios.
5. Clustering compaction (query‑performance multiplier)
Clustering compaction physically reorganizes data according to a clustering key, enabling intelligent segment pruning during queries. Example: using user_id as the clustering key allows Milvus to skip 99 % of segments for user_id = 1000 queries.
Benchmark (2000 万 768‑dim LAION dataset, clustering key = key of type Int64)
No filter – prune 0 %, latency 1685 ms, QPS 17.75
key > 200 && key < 800 – prune 40.2 %, latency 1045 ms, QPS 28.38
key > 200 && key < 600 – prune 59.8 %, latency 829 ms, QPS 35.78
key > 200 && key < 400 – prune 79.5 %, latency 550 ms, QPS 54.00
key == 1000 – prune 99 %, latency 68 ms, QPS 431.41 (≈ 25× speedup)
Exact‑match queries see the largest benefit; queries without a filter on the clustering key gain no improvement.
Configuration steps (Python SDK):
from pymilvus import CollectionSchema, FieldSchema, DataType
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="user_id", dtype=DataType.INT64, is_clustering_key=True),
FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=1000),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768)
]
schema = CollectionSchema(fields, description="User behavior data")Supported clustering‑key types: Int8, Int16, Int32, Int64, Float, Double, VarChar.
Manual compaction:
# Trigger clustering compaction
collection.compact(is_clustering=True)
# Check status
state = collection.get_compaction_state(is_clustering=True)
# Wait until finished
collection.wait_for_compaction_completed(is_clustering=True)Automatic compaction can be enabled in dataCoord.compaction with tunable intervals and thresholds (e.g., triggerInterval, minInterval, maxInterval).
Key considerations for selecting a clustering key:
High‑frequency filter field (e.g., tenant ID, timestamp).
High‑cardinality field (uniform value distribution).
Business‑critical field (tenant ID for multi‑tenant, timestamp for time‑series).
If a collection already defines a partition key, it can be reused as the clustering key via usePartitionKeyAsClusteringKey: true.
6. Scenario‑based configuration templates
Scenario 1 – Real‑time low‑latency retrieval
queryNode:
segcore:
tieredStorage:
warmup:
scalarField: sync
scalarIndex: sync
vectorField: disable # on‑demand
vectorIndex: sync
evictionEnabled: true
backgroundEvictionEnabled: true
memoryLowWatermarkRatio: 0.75
memoryHighWatermarkRatio: 0.80
diskLowWatermarkRatio: 0.75
diskHighWatermarkRatio: 0.80
cacheTtl: 0 # disable TTLVector indexes are pre‑loaded ( sync) to avoid cold‑start latency; raw vectors stay on disk and are fetched as needed.
Scenario 2 – Offline batch analysis
queryNode:
segcore:
tieredStorage:
warmup:
scalarField: disable
scalarIndex: disable
vectorField: disable
vectorIndex: disable
evictionEnabled: true
backgroundEvictionEnabled: true
memoryLowWatermarkRatio: 0.70
memoryHighWatermarkRatio: 0.85
diskLowWatermarkRatio: 0.70
diskHighWatermarkRatio: 0.85
cacheTtl: 86400 # 1 dayAll warm‑up disabled for fast loading; a wider watermark range tolerates resource spikes; TTL of one day releases stale data after the batch job.
Scenario 3 – Mixed deployment
# Online collection – critical data pre‑loaded
client.create_collection(
collection_name="online_search",
schema=schema,
properties={
"warmup.scalarField": "sync",
"warmup.scalarIndex": "sync",
"warmup.vectorField": "disable",
"warmup.vectorIndex": "sync",
}
)
# Offline collection – everything on‑demand
client.create_collection(
collection_name="offline_analysis",
schema=schema,
properties={
"warmup.scalarField": "disable",
"warmup.scalarIndex": "disable",
"warmup.vectorField": "disable",
"warmup.vectorIndex": "disable",
}
)Global watermarks are set to middle values; per‑collection warm‑up decides the latency vs. resource trade‑off.
7. Tuning advice and common pitfalls
Start conservatively : enable Mmap first and observe impact; if memory is still insufficient, switch to tiered storage; enable clustering compaction only when queries contain filters on the clustering key.
Monitor key metrics : cache hit rate, memory/disk utilization, eviction frequency, P99 query latency.
Configuration interactions : warmup=sync + mmap=true stores pre‑loaded data on disk; warmup=sync + mmap=false stores it in memory. Mixing them unintentionally can nullify expected latency gains.
Release before reload : after changing collection‑level Mmap settings, always call release_collection then load_collection.
Clustering compaction cost : compaction consumes CPU and I/O; schedule during low‑traffic periods and set a reasonable minInterval to avoid frequent runs.
Over‑commit ratio : tiered storage may allow total data size > memory, but keep the over‑commit ratio below 0.7 to prevent excessive eviction and latency spikes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Shuge Unlimited
Formerly "Ops with Skill", now officially upgraded. Fully dedicated to AI, we share both the why (fundamental insights) and the how (practical implementation). From technical operations to breakthrough thinking, we help you understand AI's transformation and master the core abilities needed to shape the future. ShugeX: boundless exploration, skillful execution.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
