Big Data 19 min read

Overview of Apache Big Data Ecosystem Tools

The article surveys the Apache big‑data ecosystem, covering Hadoop’s storage and resource management, column stores HBase and Kudu, compute engines Spark, Flink, Impala, and Presto, coordination via ZooKeeper, ingestion with Sqoop and Flume, messaging Kafka, security Ranger and Sentry, metadata Atlas, OLAP Kylin, Hive, quality tool Griffin, notebooks Zeppelin, visualizations Superset and Tableau, the TPCx‑BB benchmark, and ends with an Alibaba analysis competition notice.

Amap Tech

Jul 23, 2020

Overview of Apache Big Data Ecosystem Tools

This article provides a panoramic overview of the Apache big‑data ecosystem, describing the purpose, core components, and practical observations of many widely used projects.

Apache Hadoop : The Hadoop project includes HDFS, MapReduce, YARN, and HBase. HDFS stores data in blocks across NameNode (NN) and DataNode (DN) clusters, with standby NN for high availability and federation for scaling metadata services. YARN manages cluster resources via a ResourceManager and NodeManagers, launching ApplicationMasters to request containers.

Apache HBase & Kudu : HBase is a distributed column‑store built on Hadoop, using Write‑Ahead Logs (WAL) and Log‑Structured Merge Trees (LSM) for durability and write efficiency. Kudu offers a similar storage model but does not rely on ZooKeeper and stores data in its own file format.

Apache Spark : A fast distributed compute engine that improves on MapReduce by keeping intermediate data in memory and using DAG scheduling. Spark supports batch processing, Spark Streaming (micro‑batch), Structured Streaming, SparkSQL, and MLlib. It consumes more memory and can be less stable than MapReduce.

Apache Flink : A true stream‑processing engine (unlike Spark’s micro‑batch) that also supports batch workloads. Key features include state management, checkpointing, windowing, and watermarks.

Apache Impala : A C++‑based, in‑memory SQL query engine for HDFS, HBase, and Kudu, offering low‑latency analytics but seeing limited adoption compared with Spark.

Apache ZooKeeper : Provides coordination services such as distributed locks, configuration management, and leader election using the ZAB protocol with a leader‑follower architecture.

Apache Sqoop : Facilitates bulk data transfer between relational databases and HDFS, supporting both import and export operations.

Apache Flume : A distributed data ingestion service with Source‑Channel‑Sink architecture, supporting various data sources (files, Netcat, JMS, HTTP) and sinks (HBase, HDFS, Kafka, etc.).

Apache Kafka : A distributed messaging system originally written in Scala, now a streaming platform with high‑throughput, ordered partitions, and zero‑copy I/O via mmap.

Apache Ranger & Sentry : Security frameworks for fine‑grained access control across Hadoop components (HDFS, HBase, Hive, YARN, etc.).

Apache Atlas : Metadata governance tool that tracks data lineage, schema, and lifecycle information, supporting Hive, Sqoop, Storm, and more.

Apache Kylin : An OLAP engine that builds pre‑computed cubes stored in HBase for fast multidimensional analytics, integrating with BI tools like Tableau and Superset.

Apache Hive & Tez : Hive provides SQL‑like querying on HDFS data, originally using MapReduce, later optimized with Hive on Spark and Tez (DAG‑based execution) for better performance.

Presto : A distributed, in‑memory SQL query engine (originated at Facebook) that can query multiple data sources via connectors, but can be resource‑intensive and lacks a mature UI.

Parquet & ORC : Columnar storage formats optimized for analytical workloads; Parquet is widely used (e.g., Delta Lake), while ORC often offers better performance but less compatibility.

Apache Griffin : An open‑source data quality monitoring platform (originated at eBay) that provides validation, alerting, and visual dashboards.

Apache Zeppelin : An interactive notebook similar to Jupyter, supporting many back‑ends (Spark, Flink, Hive, etc.) for collaborative data exploration.

Apache Superset : An open‑source data visualization tool for building dashboards, comparable to Redash or Metabase.

Tableau : A commercial BI platform offering drag‑and‑drop dashboard creation, extensive data source support, and enterprise‑grade user management.

TPCx‑BB : A benchmark from the Transaction Processing Performance Council that simulates an online retail workload to evaluate big‑data cluster performance.

At the end of the article, a promotional notice about an Alibaba‑initiated algorithm competition for vehicle video analysis is included.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Analytics Apache Data Governance

Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.