Enterprise Semantic Search: Key Q&A on Scoring, Recall, LSH, Chunking, and Embedding Dimensions
This article answers practical questions about enterprise semantic search, explaining how Reciprocal Rank Fusion normalizes mixed scoring, how to control vector result size, the trade‑offs of LSH parameters, word‑ and sentence‑based chunking strategies with version‑specific defaults, and flexible embedding dimensionality.
Scoring Normalization with RRF
Vector scores typically lie in the 0‑1 range, while keyword scores (e.g., TF‑IDF, BM25) can be unbounded, making direct weighting meaningless. The core issue is the inconsistent scoring dimensions of hybrid search. Reciprocal Rank Fusion (RRF), introduced as a paid feature in Elasticsearch 8.9, solves this by using a "ranking democracy" mechanism that requires no tuning and works across unrelated relevance indicators.
Recall and Result Size Controls
When a vector search returns the total number of hits, the desired top‑N results can be limited by setting size: 10, which restricts the final output to ten documents. However, the semantic query may also specify candidates: 50, causing the vector stage to retrieve fifty candidates before merging with keyword results, potentially exceeding the ten‑result limit. Aligning candidates with size improves efficiency, and increasing candidates while keeping size unchanged can raise result quality.
LSH Parameter Guidance
LSH (Locality‑Sensitive Hashing) uses two key parameters:
L : the number of hash tables; increasing L improves recall but adds storage cost and query latency.
k : the number of hash functions per table; increasing k improves precision but reduces recall and raises computation cost.
Chunking Large Documents for Vectorization
Long source fields degrade embedding accuracy and exceed model token limits. The solution is to split documents into smaller chunks and embed each chunk separately.
Word‑based Chunking
max_chunk_size : maximum number of words per chunk (required).
overlap : number of overlapping words between consecutive chunks (required, ≤ ½ max_chunk_size).
Mechanism: fill a chunk to the maximum size, then start the next chunk, overlapping the specified word count to preserve context.
Sentence‑based Chunking
max_chunk_size : maximum number of words per chunk (required).
sentence_overlap : number of overlapping sentences between chunks (required, 0 or 1).
Mechanism: split input into blocks that contain complete sentences; each block (except the first) shares the overlapping sentences with the previous block, prioritizing sentence integrity over full block fill.
Default settings changed after Elasticsearch 8.16:
Post‑8.16: strategy = sentence chunking, max_chunk_size = 250, sentence_overlap = 1.
Pre‑8.16: strategy = word chunking, max_chunk_size = 250, overlap = 1.
Embedding Dimensionality
The models nomic‑embed‑text‑v1 and nomic‑embed‑text‑v1.5 default to 768 dimensions. Using Matryoshka Representation Learning, these models support flexible dimensions ranging from 64 to 768, allowing users to choose 256 or 512 to reduce storage and compute costs with minimal performance loss.
References:
Elasticsearch RRF documentation: https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion
Elastic chunking blog: https://www.elastic.co/search-labs/blog/elasticsearch-chunking-inference-api-endpoints
LSH overview: https://medium.com/@sarthakjoshi_9398/understanding-locality-sensitive-hashing-lsh-a-powerful-technique-for-similarity-search-a95b090bdc4a
Elasticsearch 8.16 release notes: https://discuss.elastic.co/t/what-s-new-in-elastic-8-16/370418
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Mingyi World Elasticsearch
The leading WeChat public account for Elasticsearch fundamentals, advanced topics, and hands‑on practice. Join us to dive deep into the ELK Stack (Elasticsearch, Logstash, Kibana, Beats).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
