Elasticsearch Distributed Search Mechanisms: query_then_fetch and dfs_query_then_fetch
Elasticsearch provides two search types—query_then_fetch (default) and dfs_query_then_fetch—each involving a multi-step process where the client node distributes queries to relevant shards, shards execute searches using local or global term frequencies, aggregate results, and retrieve full documents, with noted trade‑offs.
Elasticsearch Distributed Search Mechanisms
Elasticsearch has two search types, referred to as search_type:
• query_then_fetch (default) •
dfs_query_then_fetchquery_then_fetch
The execution steps are:
User initiates a search request, which reaches a node in the cluster.
The query is sent to all relevant shard partitions.
Each shard independently executes the query, performs sorting and pagination, and scores using the shard's own Local term/document frequencies.
Each shard returns only metadata (e.g., _id and _score) to the coordinating node.
The coordinating node merges the shard results, re‑sorts, paginates, and selects the final result documents (metadata only).
Based on the metadata, the coordinating node fetches the full document data from the appropriate shards on disk.
The detailed document data is assembled and returned to the user.
Drawback: Because each shard scores using its own term frequencies rather than global frequencies, uneven data distribution can cause scoring bias and affect relevance.
dfs_query_then_fetch
The dfs_query_then_fetch mechanism is similar to query_then_fetch but differs in two key ways:
User initiates a search request to a node in the cluster.
A preliminary phase gathers global term/document frequencies from all shards.
The query is then sent to all relevant shards.
Each shard executes the query independently, using Global term/document frequencies for scoring.
Shards return only metadata (e.g., _id and _score) to the coordinating node.
The coordinating node merges, sorts, paginates, and selects the final result documents (metadata only).
Based on the metadata, the coordinating node fetches the full document data from the appropriate shards.
The detailed documents are assembled and returned to the user.
Drawback: This method consumes more resources and is generally not recommended unless global scoring is essential.
Tips
Although Elasticsearch offers two search types, the default query_then_fetch is usually sufficient.
For modest data volumes (e.g., 20‑50 GB), configuring a single primary shard per index can simplify searching.
When the document source is not needed, setting _source: false avoids unnecessary network and disk I/O.
Deep pagination (e.g., using from + size to retrieve the last 10 records of the first 10 k) can be resource‑intensive because each shard processes up to the requested size before the coordinating node merges and sorts the results.
System Architect Go
Programming, architecture, application development, message queues, middleware, databases, containerization, big data, image processing, machine learning, AI, personal growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
