Exploring the Apache Big Data Ecosystem: Hadoop, Spark, Flink, and More
This article surveys the rapidly evolving big data landscape by reviewing a wide range of Apache projects—including Hadoop, Spark, Flink, HBase, Kudu, Impala, Kafka, and others—detailing their core components, architectures, strengths, and typical use‑cases for building distributed data platforms.
Introduction
In recent years the big data industry has grown rapidly, spawning many distributed products and architectures. The author shares tools and impressions gathered from practical experience, aiming to sketch a panoramic view of the distributed ecosystem.
Industry Landscape
Matt Turck’s 2019 AI and big data industry diagram (from his blog) maps companies and data‑related products, most of which are open‑source projects under the Apache Foundation.
Apache Hadoop
Hadoop’s ecosystem includes HDFS, MapReduce, YARN, and HBase. HDFS stores data in blocks across NameNode (NN) and DataNode (DN) nodes, default block size 128 MB, with replication (default 1‑backup‑3). Hadoop 2.x introduced standby NN for high availability (managed by ZKFC) and Federation to eliminate the NN bottleneck.
YARN manages cluster resources via a ResourceManager (RM) and NodeManagers (NM). Applications launch an ApplicationMaster (AM) that requests containers from RM, which are allocated on NM nodes.
Apache HBase & Kudu
HBase is a distributed column‑store with Write‑Ahead Log (WAL) for durability and Log‑Structured Merge (LSM) trees for efficient writes. It uses HMaster and RegionServers, coordinated by Zookeeper. Kudu offers similar functionality but does not rely on Zookeeper and uses its own file format.
Apache Spark
Spark, originating from UC Berkeley, accelerates batch processing by keeping intermediate data in memory and using a DAG to parallelize tasks. It also provides Spark Streaming, Structured Streaming, SparkSQL, and MLlib. However, Spark’s high memory consumption can affect stability compared to MapReduce.
Apache Flink
Flink, developed by Data Artisans (now part of Alibaba), is a true stream‑processing engine supporting both batch and streaming workloads. Key features include state management, checkpointing, windowing, and watermarks.
Apache Impala
Impala is a C++‑based, in‑memory SQL query engine for HDFS, HBase, and Kudu, offering faster query performance than traditional MapReduce but sees limited adoption compared to Spark.
Apache Zookeeper
Zookeeper provides distributed coordination services such as locks, configuration management, and leader election, using the ZAB protocol and a leader‑follower architecture.
Apache Sqoop
Sqoop transfers data between relational databases and HDFS, supporting import and export with many parameters; Sqoop 2 adds a more complex architecture.
Apache Flume
Flume is a distributed data ingestion tool with Source, Channel, and Sink components, supporting various data sources (files, Netcat, JMS, HTTP) and sinks (HBase, HDFS, Kafka, etc.).
Apache Kafka
Kafka is a distributed messaging system that evolved into a streaming platform with Kafka Streaming. It stores messages in ordered partitions, uses disk‑sequential writes and mmap for high throughput.
Apache Ranger & Sentry
Both provide fine‑grained security for the big data stack. Sentry integrates via plugins into Impala, Hive, HDFS, etc., while Ranger supports a broader set of components (HBase, Hive, YARN, Storm, Solr, Kafka, Atlas) through Ranger Admin and plugins.
Apache Atlas
Atlas manages metadata and data lineage, supporting sources like Hive, Sqoop, and Storm, and offers both batch and hook‑based metadata ingestion.
Apache Kylin
Kylin is an OLAP‑oriented distributed data warehouse that builds pre‑computed cubes stored in HBase, providing multi‑dimensional analysis and integration with BI tools such as Tableau and Superset.
Apache Hive & Tez
Hive provides a SQL‑like interface on HDFS, originally using MapReduce, later optimized with Hive on Spark and Hive on Tez (which adds DAG‑based parallelism).
Apache Presto
Presto is an in‑memory distributed query engine supporting many connectors for federated queries. It excels in low‑latency analytics but can suffer from resource contention and lacks a mature web UI.
Apache Parquet & ORC
Parquet and ORC are columnar storage formats optimized for analytical workloads, offering better compression and scan efficiency than row‑oriented storage. ORC generally outperforms Parquet, though Parquet is widely used in data lake solutions.
Apache Griffin
Griffin, an eBay‑originated data quality monitoring platform, provides data validation, alerting, and visual reporting for ETL pipelines.
Apache Zeppelin
Zeppelin is an online notebook similar to Jupyter, supporting multiple interpreters (Spark, Flink, Hive, etc.) and enabling collaborative data exploration and visualization.
Apache Superset
Superset is an open‑source data visualization tool for building dashboards, comparable to Redash and Metabase.
Tableau
Tableau is a commercial BI platform offering drag‑and‑drop dashboard creation, extensive data source support, and robust user management.
TPCx‑BB
TPCx‑BB is a benchmark for big data systems that simulates an online retail workload, measuring performance through a series of SQL operations on large datasets.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
