Big Data 20 min read

Exploring the Apache Big Data Ecosystem: Hadoop, Spark, Flink, and More

This article surveys the rapidly evolving big data landscape by reviewing a wide range of Apache projects—including Hadoop, Spark, Flink, HBase, Kudu, Impala, Kafka, and others—detailing their core components, architectures, strengths, and typical use‑cases for building distributed data platforms.

Alibaba Cloud Developer

Jul 22, 2020

Exploring the Apache Big Data Ecosystem: Hadoop, Spark, Flink, and More

Introduction

In recent years the big data industry has grown rapidly, spawning many distributed products and architectures. The author shares tools and impressions gathered from practical experience, aiming to sketch a panoramic view of the distributed ecosystem.

Industry Landscape

Matt Turck’s 2019 AI and big data industry diagram (from his blog) maps companies and data‑related products, most of which are open‑source projects under the Apache Foundation.

Apache Hadoop

Hadoop’s ecosystem includes HDFS, MapReduce, YARN, and HBase. HDFS stores data in blocks across NameNode (NN) and DataNode (DN) nodes, default block size 128 MB, with replication (default 1‑backup‑3). Hadoop 2.x introduced standby NN for high availability (managed by ZKFC) and Federation to eliminate the NN bottleneck.

YARN manages cluster resources via a ResourceManager (RM) and NodeManagers (NM). Applications launch an ApplicationMaster (AM) that requests containers from RM, which are allocated on NM nodes.

Apache HBase & Kudu

HBase is a distributed column‑store with Write‑Ahead Log (WAL) for durability and Log‑Structured Merge (LSM) trees for efficient writes. It uses HMaster and RegionServers, coordinated by Zookeeper. Kudu offers similar functionality but does not rely on Zookeeper and uses its own file format.

Apache Spark

Spark, originating from UC Berkeley, accelerates batch processing by keeping intermediate data in memory and using a DAG to parallelize tasks. It also provides Spark Streaming, Structured Streaming, SparkSQL, and MLlib. However, Spark’s high memory consumption can affect stability compared to MapReduce.

Apache Flink

Flink, developed by Data Artisans (now part of Alibaba), is a true stream‑processing engine supporting both batch and streaming workloads. Key features include state management, checkpointing, windowing, and watermarks.

Apache Impala

Impala is a C++‑based, in‑memory SQL query engine for HDFS, HBase, and Kudu, offering faster query performance than traditional MapReduce but sees limited adoption compared to Spark.

Apache Zookeeper

Zookeeper provides distributed coordination services such as locks, configuration management, and leader election, using the ZAB protocol and a leader‑follower architecture.

Apache Sqoop

Sqoop transfers data between relational databases and HDFS, supporting import and export with many parameters; Sqoop 2 adds a more complex architecture.

Apache Flume

Flume is a distributed data ingestion tool with Source, Channel, and Sink components, supporting various data sources (files, Netcat, JMS, HTTP) and sinks (HBase, HDFS, Kafka, etc.).

Apache Kafka

Kafka is a distributed messaging system that evolved into a streaming platform with Kafka Streaming. It stores messages in ordered partitions, uses disk‑sequential writes and mmap for high throughput.

Apache Ranger & Sentry

Both provide fine‑grained security for the big data stack. Sentry integrates via plugins into Impala, Hive, HDFS, etc., while Ranger supports a broader set of components (HBase, Hive, YARN, Storm, Solr, Kafka, Atlas) through Ranger Admin and plugins.

Apache Atlas

Atlas manages metadata and data lineage, supporting sources like Hive, Sqoop, and Storm, and offers both batch and hook‑based metadata ingestion.

Apache Kylin

Kylin is an OLAP‑oriented distributed data warehouse that builds pre‑computed cubes stored in HBase, providing multi‑dimensional analysis and integration with BI tools such as Tableau and Superset.

Apache Hive & Tez

Hive provides a SQL‑like interface on HDFS, originally using MapReduce, later optimized with Hive on Spark and Hive on Tez (which adds DAG‑based parallelism).

Apache Presto

Presto is an in‑memory distributed query engine supporting many connectors for federated queries. It excels in low‑latency analytics but can suffer from resource contention and lacks a mature web UI.

Apache Parquet & ORC

Parquet and ORC are columnar storage formats optimized for analytical workloads, offering better compression and scan efficiency than row‑oriented storage. ORC generally outperforms Parquet, though Parquet is widely used in data lake solutions.

Apache Griffin

Griffin, an eBay‑originated data quality monitoring platform, provides data validation, alerting, and visual reporting for ETL pipelines.

Apache Zeppelin

Zeppelin is an online notebook similar to Jupyter, supporting multiple interpreters (Spark, Flink, Hive, etc.) and enabling collaborative data exploration and visualization.

Apache Superset

Superset is an open‑source data visualization tool for building dashboards, comparable to Redash and Metabase.

Tableau

Tableau is a commercial BI platform offering drag‑and‑drop dashboard creation, extensive data source support, and robust user management.

TPCx‑BB

TPCx‑BB is a benchmark for big data systems that simulates an online retail workload, measuring performance through a series of SQL operations on large datasets.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Big Data data processing Streaming storage apache

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.