Master the Complete Big Data Ecosystem in One Article
This article provides a comprehensive overview of the big data ecosystem, detailing nine core technology categories—from data collection and storage to computation, analysis, scheduling, and underlying infrastructure—along with tool comparisons, selection guidelines to help readers quickly build a complete big data knowledge system.
As the big data industry evolves, its ecosystem of technologies continuously iterates. This article presents a complete knowledge system of the big data ecosystem, organized into nine core categories.
1. Data Collection Framework
Data collection (also called data synchronization) aggregates massive, scattered data from the Internet, mobile Internet, and IoT for further processing.
Flume, Logstash, FileBeat – commonly used for real‑time log data collection (details in Table 1).
Sqoop, DataX – used for offline collection from relational databases (details in Table 2).
Cannal, Maxwell – used for real‑time collection from relational databases (details in Table 3).
2. Data Storage Framework
HDFS – solves massive data storage but does not support single‑record updates.
HBase – a distributed NoSQL database built on HDFS; supports updates but not traditional SQL.
Kudu – a hybrid component between HDFS and HBase, supporting both updates and SQL analytics, though its adoption is limited.
Kafka – provides high‑throughput temporary buffering for massive data.
3. Distributed Resource Management Framework
Traditional IT resources are fixed, but big‑data workloads demand dynamic scaling. Distributed resource managers such as YARN, Kubernetes, and Mesos address this need (see Figure 5).
4. Data Computation Framework
Data computation is divided into offline and real‑time processing.
(1) Offline Computation – evolved through three major generations:
MapReduce – the first‑generation engine for large‑scale distributed parallel processing.
Tez – has low visibility in practice and is rarely used alone.
Spark – features in‑memory computation, keeping intermediate results in RAM to avoid disk I/O, thus greatly improving performance. It offers many high‑level operators suitable for iterative and complex calculations.
(2) Real‑time Computation – typical scenario: Alibaba’s Double‑11 sales dashboard, where transaction totals are updated instantly.
Storm – used for real‑time distributed computation.
Flink – a newer engine with better performance and ecosystem than Storm.
Spark Streaming – provides second‑level real‑time capabilities within Spark.
5. Data Analysis Framework
Typical offline OLAP engines: Hive, Impala, Kylin. Typical real‑time OLAP engines: ClickHouse, Druid, Doris.
Hive – stable but moderate execution efficiency.
Impala – high execution efficiency due to in‑memory processing, but stability is average.
Kylin – provides millisecond‑level responses on petabyte‑scale data via pre‑computation.
Druid & Doris – support high concurrency; Druid’s SQL is limited, ClickHouse supports non‑standard SQL, Doris supports standard SQL.
6. Task Scheduling Framework
Tools such as Azkaban, Ooize, DolphinScheduler support routine timed tasks and complex multi‑level dependency jobs, offering distributed execution and stable performance. Their selection criteria are shown in Table 7.
7. Underlying Infrastructure Framework
Zookeeper provides essential services such as namespace management and configuration, and is used by Hadoop HA, HBase, Kafka, and other components.
8. Data Retrieval Framework
For full‑text search, compare Lucene, Solr, and Elasticsearch across usability, extensibility, stability, cluster operation difficulty, integration depth, and community activity (see Table 8).
9. Big Data Cluster Installation & Management Framework
To transition from traditional data processing to big data, enterprises need a stable platform comprising components such as Flume, Kafka, Hadoop, Hive, HBase, Spark, and Flink, often deployed on hundreds or thousands of machines.
Manual installation is labor‑intensive and prone to version conflicts. Vendors therefore offer integrated platforms:
HDP (Hortonworks Data Platform) – open‑source, free, uses Ambari for UI‑based installation.
CDH (Cloudera Distribution Including Apache Hadoop) – commercial, uses Cloudera Manager; offers a 30‑day trial, then requires a license for advanced features.
CDP (Cloudera Data Platform) – released after Cloudera’s acquisition of Hortonworks; integrates the best components of HDP and CDH and supports private and hybrid cloud deployments from version 7.0 onward.
The content above is extracted from the book "Big Data Technology and Architecture Illustrated – Practical Guide".
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Past Memory Big Data
A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
