Deep Dive into Elasticsearch semantic_text, dense_vector, and sparse_vector
This article explains how Elasticsearch supports vector search through three field types—semantic_text, dense_vector, and sparse_vector—detailing their definitions, ideal use cases, query syntax, advantages, limitations, and guidance for selecting the right type in real‑world search applications.
Embedding vectors
An embedding vector converts unstructured data (text, image, etc.) into a numeric vector that captures semantic information, enabling similarity calculations such as cosine similarity.
Elasticsearch vector field types
dense_vector
The dense_vector field stores dense vectors generated by deep‑learning models (e.g., OpenAI text-embedding-ada-002 – 1536 dimensions, Hugging Face all-MiniLM-L6-v2 – 384 dimensions). It is suitable when:
Embeddings are produced externally (e.g., Hugging Face, Cohere) and need to be persisted.
Custom similarity functions such as cosineSimilarity, dotProduct, or l2norm are required.
Complex search scenarios (RAG, recommendation, personalized search) demand high flexibility.
Querying dense_vector
POST test/_search
{
"knn": {
"field": "my_dense_vector",
"k": 10,
"num_candidates": 50,
"query_vector": [0.1, 0.2, -0.3]
}
}For custom scoring, combine script_score with a similarity function:
DELETE products
PUT /products
{
"mappings": {
"properties": {
"description": {"type": "text", "analyzer": "standard"},
"description_vector": {
"type": "dense_vector",
"dims": 4,
"index": true,
"similarity": "cosine"
}
}
}
}
POST /products/_bulk
{ "index": {"_id": "1"} }
{ "description": "轻便夏季背包", "description_vector": [0.5, -0.3, 0.2, 0.1] }
{ "index": {"_id": "2"} }
{ "description": "耐用旅行背包", "description_vector": [-0.1, 0.4, -0.2, 0.3] }
{ "index": {"_id": "3"} }
{ "description": "紧凑型日常背包", "description_vector": [0.3, 0.1, -0.4, 0.2] }
POST /products/_search
{
"knn": {
"field": "description_vector",
"query_vector": [0.4, -0.2, 0.3, 0.1],
"k": 3,
"num_candidates": 10
},
"_source": ["description"]
}Pros
High flexibility, supports any external model.
Customizable ranking via script scoring.
Cons
Requires manual generation and storage of vectors.
Higher configuration and maintenance effort.
sparse_vector
The sparse_vector field stores vectors where most dimensions are zero and only a few tokens have non‑zero weights. It is typically produced by token‑level models such as ELSER or SPLADE and is useful for:
Word‑level precise matching.
Hybrid search strategies that combine semantic and token matching.
Scenarios demanding storage efficiency and transparent token contributions.
Index mapping and bulk insert example
PUT products_0703
{
"mappings": {
"properties": {
"description": {"type": "text", "analyzer": "standard"},
"description_vector": {"type": "sparse_vector"}
}
}
}
POST products_0703/_bulk
{ "index": {"_id": "1"} }
{ "description": "轻便夏季背包", "description_vector": {"轻便": 0.5, "夏季": 0.3, "背包": 0.2} }
{ "index": {"_id": "2"} }
{ "description": "耐用旅行背包", "description_vector": {"耐用": 0.4, "旅行": 0.3, "背包": 0.2} }
{ "index": {"_id": "3"} }
{ "description": "紧凑型日常背包", "description_vector": {"紧凑": 0.4, "日常": 0.3, "背包": 0.2} }Querying sparse_vector
POST products_0703/_search
{
"query": {
"sparse_vector": {
"field": "description_vector",
"query_vector": {"轻便": 0.5, "背包": 0.3}
}
},
"_source": ["description"]
}If a trained model is available, the inference endpoint can generate sparse vectors automatically:
{
"query": {
"sparse_vector": {
"field": "field_sparse",
"inference_id": "elser_inference",
"query": "搜索文本"
}
}
}Pros
High storage efficiency (only non‑zero values stored).
Precise token‑level matching; easy to debug because token contributions are explicit.
Well suited for hybrid semantic‑token search.
Cons
Requires a model that supports sparse vectors; generation logic is more complex.
Semantic coverage may be limited compared with dense embeddings.
semantic_text
The semantic_text field is a newer, paid‑license feature that automatically generates and stores embedding vectors via an Elasticsearch inference endpoint, removing the need for manual vector handling.
Typical use cases:
Rapid prototyping for newcomers to semantic search.
Projects that prefer automatic embedding generation without external pipelines.
Long‑text handling with built‑in chunking.
Reduced engineering effort thanks to pre‑configured mappings and inference endpoints.
Querying semantic_text
{
"query": {
"semantic": {
"field": "semantic_text_field",
"query": "搜索文本"
}
}
}Pros
Easy to use; embeddings are generated automatically.
Supports text chunking for large documents.
Minimal configuration required.
Cons
Does not accept externally generated embeddings.
Flexibility is limited to the inference models provided by Elasticsearch.
Choosing the appropriate field type
Selection depends on the required control over embedding generation, the complexity of the ranking logic, and storage considerations:
dense_vector – highest flexibility; external embeddings and custom similarity functions; higher operational cost.
sparse_vector – efficient storage; precise token‑level matching; requires a sparse‑vector model.
semantic_text – fastest start; automatic embedding generation; limited to built‑in models and a paid license.
Embedding generation per field type
dense_vector: external dense embeddings (e.g., OpenAI, Hugging Face) must be generated before indexing. sparse_vector: typically generated by token‑based models (ELSER, SPLADE) or via an inference endpoint. semantic_text: fully handled by Elasticsearch’s inference endpoint at index and query time.
Practical e‑commerce search example
Prototype quickly : use semantic_text with a chosen inference model (e.g., all-MiniLM-L6-v2) to obtain semantic search with minimal configuration.
Improve ranking : add a dense_vector field, generate high‑quality embeddings externally, and query with kNN to refine result ordering.
Hybrid precision : incorporate a sparse_vector field using an ELSER model to boost token‑level relevance.
Conclusion
Elasticsearch provides three vector field types that address different needs: semantic_text – ideal for rapid development and automatic embedding handling. dense_vector – offers the most flexibility for complex, custom similarity scenarios. sparse_vector – excels at precise token‑level matching and efficient storage, making it suitable for hybrid search strategies.
References
Elasticsearch Semantic Text Field Type – https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-semantic-query.html
Dense Vector Field Type – https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/dense-vector
Sparse Vector Field Type – https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/sparse-vector
Elasticsearch Semantic Query – https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-semantic-query
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Mingyi World Elasticsearch
The leading WeChat public account for Elasticsearch fundamentals, advanced topics, and hands‑on practice. Join us to dive deep into the ELK Stack (Elasticsearch, Logstash, Kibana, Beats).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
