Big Data 8 min read

Demystifying Elasticsearch: How Clusters Start, Join, and Process Reads/Writes

This article explains Elasticsearch’s core mechanisms, covering the cluster startup sequence, node discovery and election, handling of failed nodes, cluster management APIs, and the detailed read/write processes including coordinating nodes, shard allocation, memory buffering, transaction logs, and query‑then‑fetch execution.

360 Zhihui Cloud Developer

Dec 26, 2017

Demystifying Elasticsearch: How Clusters Start, Join, and Process Reads/Writes

Startup Process

When an Elasticsearch node starts, it sends a request to the list of hosts defined in discovery.zen.ping.unicast.hosts to obtain a DiscoveryNode list.

Master Election

If no master is found, the node enters the master‑election process.

Startup Tasks

After a node becomes the master, it launches scheduled tasks.

Cluster Management

Once the cluster is up, users can manage and monitor it via the REST API.

New Node Join Process

When a new node starts, it discovers other nodes using multicast (or unicast if configured) and establishes connections, as illustrated below.

After joining, the node sends a multicast request to find other cluster members and forms connections.

Failure Detection

The master continuously monitors all nodes. If a node becomes unreachable within a configured timeout, it is marked as failed, triggering rebalancing. Lost primary shards are promoted from replica shards, and a new master is elected if needed.

Cluster Management and Monitoring

Users can query cluster health ( GET /cluster/health), retrieve settings ( GET /cluster/settings), and view node and index statistics via the API.

Write Operation Principle

The coordinating node uses the document ID (or routing) to determine the target shard: shard = hash(document_id) % (num_of_primary_shards) Each shard writes the request to a memory buffer, periodically flushing it to the filesystem cache (refresh). To guarantee durability, the transaction log (translog) records every operation; when the cache is flushed to disk, the translog entry is cleared. Flush occurs either on a timer (default 30 minutes) or when the translog grows beyond 512 MB.

Read Operation Principle

Search follows a two‑phase “Query‑then‑Fetch” process. In the query phase, the coordinating node broadcasts the request to all shard copies (primary and replica). Each shard executes the query locally, builds a priority queue of the top from + size documents, and returns document IDs and scores to the coordinating node.

In the fetch phase, the coordinating node determines which documents need to be retrieved, sends GET requests to the relevant shards, and merges the results before returning the final response to the client.

Note: The “DFS Query‑then‑Fetch” variant performs a preliminary phase to gather term and document frequencies for more accurate scoring, at the cost of performance.

Summary

The article covered Elasticsearch’s internal workings, from cluster startup and node discovery to read/write mechanics and monitoring APIs, highlighting its distributed architecture, fault tolerance, and real‑time search capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Search Engine Elasticsearch Node Discovery Read/Write

Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.