Deep Dive into Elasticsearch semantic_text, dense_vector, and sparse_vector

This article explains how Elasticsearch supports vector search through three field types—semantic_text, dense_vector, and sparse_vector—detailing their definitions, ideal use cases, query syntax, advantages, limitations, and guidance for selecting the right type in real‑world search applications.

Mingyi World Elasticsearch
Mingyi World Elasticsearch
Mingyi World Elasticsearch
Deep Dive into Elasticsearch semantic_text, dense_vector, and sparse_vector

Embedding vectors

An embedding vector converts unstructured data (text, image, etc.) into a numeric vector that captures semantic information, enabling similarity calculations such as cosine similarity.

Elasticsearch vector field types

dense_vector

The dense_vector field stores dense vectors generated by deep‑learning models (e.g., OpenAI text-embedding-ada-002 – 1536 dimensions, Hugging Face all-MiniLM-L6-v2 – 384 dimensions). It is suitable when:

Embeddings are produced externally (e.g., Hugging Face, Cohere) and need to be persisted.

Custom similarity functions such as cosineSimilarity, dotProduct, or l2norm are required.

Complex search scenarios (RAG, recommendation, personalized search) demand high flexibility.

Querying dense_vector

POST test/_search
{
  "knn": {
    "field": "my_dense_vector",
    "k": 10,
    "num_candidates": 50,
    "query_vector": [0.1, 0.2, -0.3]
  }
}

For custom scoring, combine script_score with a similarity function:

DELETE products
PUT /products
{
  "mappings": {
    "properties": {
      "description": {"type": "text", "analyzer": "standard"},
      "description_vector": {
        "type": "dense_vector",
        "dims": 4,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}

POST /products/_bulk
{ "index": {"_id": "1"} }
{ "description": "轻便夏季背包", "description_vector": [0.5, -0.3, 0.2, 0.1] }
{ "index": {"_id": "2"} }
{ "description": "耐用旅行背包", "description_vector": [-0.1, 0.4, -0.2, 0.3] }
{ "index": {"_id": "3"} }
{ "description": "紧凑型日常背包", "description_vector": [0.3, 0.1, -0.4, 0.2] }

POST /products/_search
{
  "knn": {
    "field": "description_vector",
    "query_vector": [0.4, -0.2, 0.3, 0.1],
    "k": 3,
    "num_candidates": 10
  },
  "_source": ["description"]
}

Pros

High flexibility, supports any external model.

Customizable ranking via script scoring.

Cons

Requires manual generation and storage of vectors.

Higher configuration and maintenance effort.

sparse_vector

The sparse_vector field stores vectors where most dimensions are zero and only a few tokens have non‑zero weights. It is typically produced by token‑level models such as ELSER or SPLADE and is useful for:

Word‑level precise matching.

Hybrid search strategies that combine semantic and token matching.

Scenarios demanding storage efficiency and transparent token contributions.

Index mapping and bulk insert example

PUT products_0703
{
  "mappings": {
    "properties": {
      "description": {"type": "text", "analyzer": "standard"},
      "description_vector": {"type": "sparse_vector"}
    }
  }
}

POST products_0703/_bulk
{ "index": {"_id": "1"} }
{ "description": "轻便夏季背包", "description_vector": {"轻便": 0.5, "夏季": 0.3, "背包": 0.2} }
{ "index": {"_id": "2"} }
{ "description": "耐用旅行背包", "description_vector": {"耐用": 0.4, "旅行": 0.3, "背包": 0.2} }
{ "index": {"_id": "3"} }
{ "description": "紧凑型日常背包", "description_vector": {"紧凑": 0.4, "日常": 0.3, "背包": 0.2} }

Querying sparse_vector

POST products_0703/_search
{
  "query": {
    "sparse_vector": {
      "field": "description_vector",
      "query_vector": {"轻便": 0.5, "背包": 0.3}
    }
  },
  "_source": ["description"]
}

If a trained model is available, the inference endpoint can generate sparse vectors automatically:

{
  "query": {
    "sparse_vector": {
      "field": "field_sparse",
      "inference_id": "elser_inference",
      "query": "搜索文本"
    }
  }
}

Pros

High storage efficiency (only non‑zero values stored).

Precise token‑level matching; easy to debug because token contributions are explicit.

Well suited for hybrid semantic‑token search.

Cons

Requires a model that supports sparse vectors; generation logic is more complex.

Semantic coverage may be limited compared with dense embeddings.

semantic_text

The semantic_text field is a newer, paid‑license feature that automatically generates and stores embedding vectors via an Elasticsearch inference endpoint, removing the need for manual vector handling.

Typical use cases:

Rapid prototyping for newcomers to semantic search.

Projects that prefer automatic embedding generation without external pipelines.

Long‑text handling with built‑in chunking.

Reduced engineering effort thanks to pre‑configured mappings and inference endpoints.

Querying semantic_text

{
  "query": {
    "semantic": {
      "field": "semantic_text_field",
      "query": "搜索文本"
    }
  }
}

Pros

Easy to use; embeddings are generated automatically.

Supports text chunking for large documents.

Minimal configuration required.

Cons

Does not accept externally generated embeddings.

Flexibility is limited to the inference models provided by Elasticsearch.

Choosing the appropriate field type

Selection depends on the required control over embedding generation, the complexity of the ranking logic, and storage considerations:

dense_vector – highest flexibility; external embeddings and custom similarity functions; higher operational cost.

sparse_vector – efficient storage; precise token‑level matching; requires a sparse‑vector model.

semantic_text – fastest start; automatic embedding generation; limited to built‑in models and a paid license.

Embedding generation per field type

dense_vector

: external dense embeddings (e.g., OpenAI, Hugging Face) must be generated before indexing. sparse_vector: typically generated by token‑based models (ELSER, SPLADE) or via an inference endpoint. semantic_text: fully handled by Elasticsearch’s inference endpoint at index and query time.

Practical e‑commerce search example

Prototype quickly : use semantic_text with a chosen inference model (e.g., all-MiniLM-L6-v2) to obtain semantic search with minimal configuration.

Improve ranking : add a dense_vector field, generate high‑quality embeddings externally, and query with kNN to refine result ordering.

Hybrid precision : incorporate a sparse_vector field using an ELSER model to boost token‑level relevance.

Conclusion

Elasticsearch provides three vector field types that address different needs: semantic_text – ideal for rapid development and automatic embedding handling. dense_vector – offers the most flexibility for complex, custom similarity scenarios. sparse_vector – excels at precise token‑level matching and efficient storage, making it suitable for hybrid search strategies.

References

Elasticsearch Semantic Text Field Type – https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-semantic-query.html

Dense Vector Field Type – https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/dense-vector

Sparse Vector Field Type – https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/sparse-vector

Elasticsearch Semantic Query – https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-semantic-query

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Elasticsearchvector searchEmbeddingkNNdense_vectorsemantic_textsparse_vector
Mingyi World Elasticsearch
Written by

Mingyi World Elasticsearch

The leading WeChat public account for Elasticsearch fundamentals, advanced topics, and hands‑on practice. Join us to dive deep into the ELK Stack (Elasticsearch, Logstash, Kibana, Beats).

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.