Big Data 11 min read

Master the Big Data Ecosystem: 9 Core Technology Frameworks Explained

This article provides a comprehensive overview of the big data ecosystem, detailing nine essential technology categories—including data collection, storage, computation, analysis, resource management, retrieval, underlying infrastructure, and cluster installation—while comparing popular tools and illustrating their typical use‑cases with diagrams.

Python Crawling & Data Mining

Aug 12, 2022

Master the Big Data Ecosystem: 9 Core Technology Frameworks Explained

As the big data industry matures, its ecosystem continuously evolves. The author, drawing on personal experience in China’s big data sector, presents a complete knowledge map covering nine core technology categories.

1. Data Collection Framework

Data collection (also called data synchronization) aggregates massive, scattered data from the Internet, mobile, and IoT into a unified source. Common tools are:

Flume, Logstash, FileBeat – real‑time log collection (see Table 1).

Sqoop, DataX – offline extraction from relational databases (see Table 2).

Cannal, Maxwell – real‑time extraction from relational databases (see Table 3).

2. Data Storage Framework

Rapid data growth drives distributed storage solutions such as:

HDFS – massive batch storage; does not support per‑record updates.

HBase – HDFS‑based NoSQL store with update capability, but no SQL support.

Kudu – bridges HDFS and HBase, offering both updates and SQL‑style analytics.

Kafka – high‑throughput message queue for temporary buffering.

3. Distributed Resource Management

Traditional static server resources cannot meet the dynamic demands of big‑data tasks. Modern frameworks such as YARN, Kubernetes, and Mesos provide elastic resource allocation (see Figure 5).

4. Data Computation Framework

Computation splits into offline and real‑time processing.

Offline : MapReduce (first generation), Tez (rarely used), Spark (in‑memory, high performance).

Real‑time : Storm, Flink (new generation, superior performance), Spark Streaming.

5. Data Analysis Framework

Typical OLAP engines for offline analysis include Hive, Impala, Kylin ; for real‑time analysis, ClickHouse, Druid, Doris . Their strengths and differences are summarized in Tables 5 and 6.

6. Underlying Infrastructure

Zookeeper provides essential coordination services (namespace, configuration) for components such as Hadoop HA, HBase, and Kafka.

7. Data Retrieval Framework

Full‑text search engines are evaluated on usability, scalability, stability, and community activity. The comparison of Lucene, Solr, Elasticsearch is shown in Table 8.

8. Big‑Data Cluster Installation & Management

Building a reliable big‑data platform requires integrating dozens of components (Flume, Kafka, Hadoop, Hive, HBase, Spark, Flink, etc.) across hundreds of machines. Integrated distributions simplify this task:

HDP – Hortonworks Data Platform, open‑source, managed via Ambari.

CDH – Cloudera Distribution, commercial with Cloudera Manager (30‑day trial).

CDP – Cloudera Data Platform, merges HDP and CDH, supports private and hybrid clouds.

Overall, a complete big‑data platform comprises data collection, storage, computation, analysis, monitoring, and management layers, each with multiple technology choices that can be combined to meet specific business requirements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data collection data storage cluster management

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.