Big Data 11 min read

Overview of the Hadoop Ecosystem and Modern Big Data Technologies

This article provides a comprehensive overview of Hadoop and its surrounding ecosystem, detailing core components, storage principles, key algorithms, and a wide range of modern big‑data technologies such as Spark, Flink, Kafka, NoSQL databases, and cloud‑based processing platforms.

Architecture Digest

Mar 28, 2016

Overview of the Hadoop Ecosystem and Modern Big Data Technologies

The article begins with a background on Hadoop, noting its evolution over more than a decade and introducing the post‑Hadoop era where complementary technologies like NoSQL are used alongside it.

It explains Hadoop’s core concepts: HDFS for reliable, distributed storage; MapReduce for locality‑aware processing; and Amazon EMR as a managed service that runs on EC2 and S3, suitable for occasional large‑scale jobs.

Key ecosystem projects are listed, including Pig, Hive, HBase, Sqoop, Flume, ZooKeeper, and distributions such as Cloudera, Hortonworks, and MapR, each with its own strengths.

The principles section discusses data storage hierarchy, locality, and the trade‑offs between memory, disk, and SSD for performance and durability.

Several probabilistic algorithms used in big data are described: HyperLogLog for cardinality estimation, Bloom Filter for membership testing (with false positives but no false negatives), and Count‑Min Sketch for frequency estimation.

The CAP theorem is introduced, followed by a brief survey of distributed‑system algorithms and concepts such as Paxos, Gossip protocol, Quorum, vector clocks, Byzantine fault tolerance, and two‑phase commit.

Advanced storage and query systems are covered: Google’s Spanner, F1, and Dremel; Spark for in‑memory batch, graph, and stream processing; Flink for SQL‑style optimization; and Kafka as a real‑time data pipeline, with a note on Confluent Platform.

Streaming frameworks Storm and Samza are presented, along with the Lambda architecture that combines batch and stream processing, and Summingbird which unifies the two via a single programming model.

The article then surveys NoSQL databases, highlighting Cassandra’s column‑family model and eventual consistency, and discusses SQL‑on‑Hadoop projects such as Hive, Spark SQL, Impala, Presto, Tajo, and Drill.

Other notable technologies include Druid for real‑time analytics, the Berkeley Data Analytics Stack (BDAS) projects Mesos, Tachyon, and BlinkDB, and cloud data warehousing with Amazon Redshift.

Overall, the piece serves as a high‑level reference for architects and engineers designing large‑scale data processing systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data processing kafka NoSQL Spark Hadoop

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.