Elasticsearch: BM25, TF‑IDF, Dense Vectors, kNN, L2 & Cosine Distances, RRF

This article provides a comprehensive technical guide to Elasticsearch’s core retrieval models—BM25 and TF‑IDF—while detailing modern vector‑based search using dense_vector, kNN, L2 and cosine distances, and demonstrates how to combine keyword and semantic results through hybrid search and Reciprocal Rank Fusion (RRF) with practical configuration examples.

Tech Freedom Circle
Tech Freedom Circle
Tech Freedom Circle
Elasticsearch: BM25, TF‑IDF, Dense Vectors, kNN, L2 & Cosine Distances, RRF

BM25 and TF‑IDF Basics

TF‑IDF (Term Frequency‑Inverse Document Frequency) scores a term by the product of its frequency in a document and the inverse of the number of documents that contain the term:

TF = term frequency in the document
IDF = log(total_documents / (documents_containing_term + 1))
TF‑IDF = TF * IDF

TF‑IDF is simple and works for basic keyword search, but it cannot capture synonyms, is vulnerable to keyword stuffing, and does not consider document length.

BM25 replaces TF‑IDF as the default similarity in Elasticsearch (since 5.x). It adds term‑frequency saturation and document‑length normalization, controlled by two tunable parameters: k1 (default 1.2) – controls how quickly term‑frequency contributions saturate. b (default 0.75) – controls the strength of length normalization.

The BM25 scoring formula is:

score(d, q) = Σ [ IDF(t) × (TF(t,d) × (k1 + 1)) / (TF(t,d) + k1 × (1 - b + b × |d| / avg_dl)) ]

Example query (default BM25):

GET /products/_search
{
  "query": {
    "match": {
      "description": "lightweight breathable dress"
    }
  }
}

Dense Vector and Semantic Search

Modern language models (e.g., BERT, Sentence‑BERT) encode a sentence into a fixed‑length float array called an embedding . Elasticsearch stores embeddings in a dense_vector field.

Mapping example (384‑dimensional title vectors, cosine similarity):

PUT /my_vector_index
{
  "mappings": {
    "properties": {
      "title_vector": {
        "type": "dense_vector",
        "dims": 384,
        "similarity": "cosine"
      },
      "content_vector": {
        "type": "dense_vector",
        "dims": 768,
        "similarity": "cosine"
      }
    }
  }
}

Typical dense vectors (illustrative):

"I like apples" → [0.8, 0.3, -0.1, 0.9]

"I love fruit" → [0.7, 0.4, 0.0, 0.85]

"The weather is nice" → [-0.2, 0.6, 0.9, 0.1]

k‑Nearest Neighbors (kNN) retrieves the k most similar vectors. A typical kNN query:

GET /my_vector_index/_search
{
  "knn": {
    "field": "title_vector",
    "query_vector": [0.1, 0.2, ..., 0.384],
    "k": 10,
    "num_candidates": 100
  }
}

Parameters: field – the dense_vector field to search. query_vector – the embedding of the user query. k – number of nearest neighbors to return. num_candidates – how many vectors to examine before final ranking (higher improves recall but increases latency).

Distance Metrics

L2 (Euclidean) distance : L2(v1, v2) = sqrt( Σ (v1_i - v2_i)^2 ). Sensitive to vector magnitude; useful for image features.

Cosine distance : cosine_similarity(v1, v2) = (v1·v2) / (||v1|| × ||v2||) and cosine_distance = 1 - cosine_similarity. Normalizes vectors, the default for textual embeddings.

Dot product (inner product) : dot(v1, v2) = Σ a_i * b_i. When vectors are already normalized, dot product equals cosine similarity and is computationally cheaper.

Metric selection guideline:

If vectors are normalized → use cosine or dot_product (fastest).

If vectors are not normalized → use l2_norm.

When unsure → cosine is a safe default for text.

Hybrid Search Design

A hybrid index combines keyword search (BM25) and semantic search (kNN). Example index definition that supports both:

PUT my-hybrid-index
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_construction": 128,
      "number_of_shards": 1
    }
  },
  "mappings": {
    "properties": {
      "title": { "type": "text", "analyzer": "ik_max_word" },
      "content": { "type": "text", "analyzer": "ik_max_word" },
      "title_vector": { "type": "dense_vector", "dims": 384, "similarity": "cosine" },
      "content_vector": { "type": "dense_vector", "dims": 768, "similarity": "cosine" }
    }
  }
}

Three query modes:

Pure BM25 – exact keyword matching.

GET my-hybrid-index/_search
{
  "query": { "match": { "title": "deep learning" } }
}

Pure kNN – semantic similarity only.

GET my-hybrid-index/_search
{
  "knn": {
    "field": "title_vector",
    "query_vector": [0.1, -0.2, ..., 0.384],
    "k": 10,
    "num_candidates": 100
  }
}

Hybrid (RRF fusion) – combines BM25 and kNN rankings without manual weight tuning.

GET my-hybrid-index/_search
{
  "size": 20,
  "query": { "match": { "title": "deep learning" } },
  "knn": {
    "field": "title_vector",
    "query_vector": [0.1, -0.2, ..., 0.384],
    "k": 20,
    "num_candidates": 100
  },
  "rank": {
    "rrf": { "window_size": 100, "rank_constant": 20 }
  }
}

Reciprocal Rank Fusion (RRF) scores a document d by the sum of the reciprocal of its rank in each result list: RRF_score(d) = Σ [ 1 / (k + rank_i(d) ) ] Typical smoothing constant k = 60. Example calculation:

Document A: rank 2 in BM25, rank 3 in kNN → score ≈ 1/(60+2) + 1/(60+3) ≈ 0.032.

Document B: rank 10 in BM25, rank 1 in kNN → score ≈ 1/(60+10) + 1/(60+1) ≈ 0.0307.

Because Document A ranks well in both lists, it outranks Document B after fusion.

Performance Optimisation

Index‑level settings that affect kNN speed and memory: knn.algo_param.ef_construction – higher values improve index quality at the cost of RAM (128 is a balanced default). knn.algo_param.m – number of connections per HNSW layer (default 16). number_of_shards and number_of_replicas – adjust for data volume and fault tolerance.

Query‑time tuning: num_candidates – larger values increase recall but also latency. k – set according to the required result size.

Apply filters (e.g., term or range) to reduce the search space before kNN execution.

Frequently Asked Questions

Can a single field support both BM25 and kNN? No. BM25 works on text/keyword fields, while kNN requires a dense_vector field.

Is the similarity parameter mutable? No. It is fixed at mapping creation; changing it requires re‑creating the index.

Do BM25 parameters need tuning? Defaults ( k1=1.2, b=0.75) work for most cases; adjust only for specific relevance issues.

How to choose a distance function?

Normalized text embeddings → cosine (or dot_product if already normalized).

Non‑normalized vectors (e.g., raw image features) → l2_norm.

Uncertain → cosine is a safe default for textual semantics.

Key Takeaways

Use BM25 for exact keyword relevance; use kNN with dense vectors for semantic similarity.

Define the distance metric at mapping time – it cannot be changed later.

Hybrid search with RRF provides a simple, weight‑free way to combine both signals while preserving relevance.

Monitor latency and recall; tune ef_construction, num_candidates, and k as data volume grows.

ElasticsearchBM25semantic searchkNNTF-IDFdense_vectorRRF
Tech Freedom Circle
Written by

Tech Freedom Circle

Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.