Backend Development 12 min read

Implementing Full‑Text Document Search with Elasticsearch and Milvus

This article describes how to combine Elasticsearch’s keyword matching with Milvus’s vector‑based semantic search to build a scalable document search service, covering data preprocessing, architecture, query handling, custom scoring, DSL configuration, and result merging.

Rare Earth Juejin Tech Community

Jul 18, 2024

Implementing Full‑Text Document Search with Elasticsearch and Milvus

After introducing the growing need for efficient document search in large teams, the article presents a practical implementation that integrates Elasticsearch (ES) and Milvus to handle around 80,000 documents with a click‑through rate of roughly 65%.

The architecture consists of two main pipelines: a data‑cleaning and indexing stage that splits documents, classifies sections, generates embeddings, and stores them in Milvus, and a query stage where the backend first creates a vector representation of the user’s query, calls Milvus for vector search, and then invokes ES for keyword‑based recall.

Custom scoring is applied to boost results based on content type, e.g., titles receive a weight of 2, h2 headings 1.5, and body text 1. The final ranking merges ES and Milvus results after normalizing scores, de‑duplicating, and applying additional weighting to favor Milvus‑derived hits.

For the vector‑search part, the article shows a Node.js SDK call to Milvus with parameters such as partition_names, nprobe, metric_type, limit, offset, and optional filter. It explains why batch retrieval and client‑side pagination are used to avoid duplicate fragments caused by document slicing.

The ES side relies on a JSON DSL that combines multiple sub‑queries: title, content, code block, and enhanced searches. Title queries use match and match_phrase_prefix with minimum_should_match: '80%' and slop: 2; content uses strict match_phrase; code blocks use match_phrase_prefix and wildcard. Additional enhancements add wildcard clauses for non‑Chinese terms and numeric or variable names.

Result merging follows a five‑step process: (1) decide the proportion of ES vs. Milvus hits (default 6:4, adjusted to 8:2 for variable‑name queries); (2) normalize scores by dividing by each engine’s maximum; (3) apply a multiplier >1 to ES scores to keep ES results from outranking Milvus when desired; (4) boost Milvus scores above a threshold of 0.7; and (5) de‑duplicate and re‑rank the combined list.

In conclusion, the combined ES + Milvus approach yields better relevance than using either engine alone, though the current click‑through rate remains modest and further DSL tuning, model optimization, and feature enhancements are planned.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch Milvus vector search Search Architecture Full-Text Search

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.