Implementing Full‑Text Document Search with Elasticsearch and Milvus
This article describes how to combine Elasticsearch’s keyword matching with Milvus’s vector‑based semantic search to build a scalable document search service, covering data preprocessing, architecture, query handling, custom scoring, DSL configuration, and result merging.
After introducing the growing need for efficient document search in large teams, the article presents a practical implementation that integrates Elasticsearch (ES) and Milvus to handle around 80,000 documents with a click‑through rate of roughly 65%.
The architecture consists of two main pipelines: a data‑cleaning and indexing stage that splits documents, classifies sections, generates embeddings, and stores them in Milvus, and a query stage where the backend first creates a vector representation of the user’s query, calls Milvus for vector search, and then invokes ES for keyword‑based recall.
Custom scoring is applied to boost results based on content type, e.g., titles receive a weight of 2, h2 headings 1.5, and body text 1. The final ranking merges ES and Milvus results after normalizing scores, de‑duplicating, and applying additional weighting to favor Milvus‑derived hits.
For the vector‑search part, the article shows a Node.js SDK call to Milvus with parameters such as partition_names , nprobe , metric_type , limit , offset , and optional filter . It explains why batch retrieval and client‑side pagination are used to avoid duplicate fragments caused by document slicing.
The ES side relies on a JSON DSL that combines multiple sub‑queries: title, content, code block, and enhanced searches. Title queries use match and match_phrase_prefix with minimum_should_match: '80%' and slop: 2 ; content uses strict match_phrase ; code blocks use match_phrase_prefix and wildcard . Additional enhancements add wildcard clauses for non‑Chinese terms and numeric or variable names.
Result merging follows a five‑step process: (1) decide the proportion of ES vs. Milvus hits (default 6:4, adjusted to 8:2 for variable‑name queries); (2) normalize scores by dividing by each engine’s maximum; (3) apply a multiplier >1 to ES scores to keep ES results from outranking Milvus when desired; (4) boost Milvus scores above a threshold of 0.7; and (5) de‑duplicate and re‑rank the combined list.
In conclusion, the combined ES + Milvus approach yields better relevance than using either engine alone, though the current click‑through rate remains modest and further DSL tuning, model optimization, and feature enhancements are planned.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.