Building and Optimizing a Milvus‑Based Vector Search Platform
This article describes the background, technical selection, architecture, deployment, performance tuning, and operational practices of a Milvus‑driven vector retrieval platform, including cloud‑native deployment, index choices, capacity planning, and real‑world application cases that improve recall latency and resource efficiency.
With the rapid development of computer and machine‑learning technologies, feature vectors have become a common way to describe multimedia data, and vector retrieval has turned into a universal demand across many online services.
At Home (the automotive data platform), nine separate vector retrieval engines were deployed, leading to resource waste, high maintenance cost, custom development effort, and performance bottlenecks, especially with the Vearch engine.
After evaluating open‑source solutions, the team selected Milvus as the underlying engine for a new vector search platform because of its stable, highly available, maintainable, feature‑rich, and high‑performance architecture.
Milvus 2.x Architecture : The system is built as micro‑services with distinct roles—ETCD for metadata, object storage for vectors, Proxy as a unified access layer, DataNode/DataCoord for writes, IndexNode/IndexCoord for index building, QueryNode/QueryCoord for queries, and RootCoord for DDL coordination and global timestamps. Worker nodes (IndexNode, QueryNode, DataNode) handle the actual vector operations, while Coord nodes manage task distribution.
Cloud‑Native Support : Milvus is stateless; metadata lives in ETCD and vector data in object storage, allowing native Kubernetes deployment with dynamic scaling of individual roles based on load.
Infrastructure & Deployment : The platform runs Milvus clusters on the Home cloud K8s cluster, leveraging Prometheus and Elasticsearch for monitoring and logging. Indexes used include IVF‑FLAT for small collections (≈100k vectors) and HNSW for large‑scale data, with recommendations on nlist/nprobe tuning.
Replica & Sharding Strategy : For small collections a single shard with multiple replicas is advised; for large collections (millions of vectors) the number of shards is limited by memory, and replica count should match the number of QueryNode instances. Capacity planning from performance tests shows each QueryNode (12 CPU/16 GB) can sustain ~500 QPS, each Proxy (4 CPU/8 GB) ~1200 QPS, and a full platform instance ~3500 QPS.
Platform Optimizations :
Weak‑consistency queries no longer request timestamps from RootCoord; timestamps are distributed locally by the Proxy, reducing one RPC round‑trip and lowering TP99 latency.
When allocating Segments, the rebalance checker is disabled so that Segments are placed on the shard‑leader node, eliminating cross‑node queries for large collections.
Application Case – Non‑Plain‑Text Recall : The platform supports 23 algorithmic vector models and 24 non‑plain‑text recall pipelines, handling vector ingestion from Hive, periodic sync to Milvus AB tables, and online model deployment. The new system reduces recall timeout rates by 3‑7× and overall latency by 55% compared to Vearch, while simplifying integration and cutting deployment time in half.
Future Plans : Implement Milvus upsert functionality (currently unreleased) and collaborate with the community to improve strong‑consistency query latency for deduplication scenarios.
HomeTech
HomeTech tech sharing
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.