Big Data 4 min read

Data Consistency Strategies for Big Data Applications: Simple Replication, HDFS Pipeline, and Elasticsearch

The article explains three approaches to ensuring data consistency in big‑data systems—basic multi‑node replication, HDFS pipeline replication, and Elasticsearch primary‑replica replication—detailing their workflows, advantages, and drawbacks.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Data Consistency Strategies for Big Data Applications: Simple Replication, HDFS Pipeline, and Elasticsearch

When developing big‑data applications, data consistency and high availability are critical, so redundant replicas are used; however, maintaining consistency among these replicas is a key challenge. This article summarizes three consistency strategies and illustrates the architectures employed by HDFS and Elasticsearch.

1. Simple multi‑node replication

The request is dispatched to multiple nodes, each node writes the data and replies; once a predefined number of nodes have successfully written, the write is considered successful.

Advantages: Write latency is determined by the slowest node.

Disadvantages: High network I/O because the client must send data to every node.

2. HDFS replica write consistency

HDFS uses a chained pipeline: the client writes to the first DataNode, which forwards the data to the next DataNode, and so on, forming a pipeline that propagates the write down the chain and then acknowledges back up to the client.

Advantages: Guarantees strong consistency.

Disadvantages: All replicas must successfully write before the operation is considered successful, leading to lower throughput.

3. Elasticsearch replica write consistency

The write is first sent to the primary node; after the primary succeeds, it forwards the request to all replica nodes. The primary waits for acknowledgments from all replicas before responding to the client.

Advantages: Good performance; write latency equals primary write time plus the maximum replica write time.

Disadvantages: Relies on the primary node; large data volumes can stress network I/O.

Welcome to like, bookmark, and share the post!

Enjoyed the article? Click "Read Again" below! 👇

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ElasticsearchData ConsistencyHDFS
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.