Big Data 6 min read

An Introduction to Big Data Concepts, Hadoop Ecosystem, and Common Frameworks

This article provides a comprehensive overview of big data fundamentals, including the 4V characteristics, the Hadoop 2.0 layered architecture, a comparison between Hadoop and Spark, classification of common big‑data tools, and the typical offline and real‑time data processing workflows.

360 Quality & Efficiency

Oct 15, 2018

An Introduction to Big Data Concepts, Hadoop Ecosystem, and Common Frameworks

In the era of big data, the term has become ubiquitous, and this article introduces its basic definition: data sets that are extremely large in volume and complex in variety, making traditional databases inadequate for storage, management, and processing.

The main characteristics of big data are summarized as the 4V model—Volume, Variety, Velocity, and Veracity—with some extensions to a 6V model that adds Valence (connectivity) and Value.

The Hadoop 2.0 ecosystem is presented as a four‑layer architecture: a storage layer (HDFS), a resource and data‑management layer (YARN), a compute engine layer (MapReduce), and a query‑analysis layer (Hive, Pig). Spark is highlighted as a component that spans both the compute engine and query layers.

A direct comparison shows Hadoop as a comprehensive distributed data infrastructure providing storage, computation, and resource scheduling, while Spark is a specialized processing engine that relies on external storage systems such as HDFS.

Common big‑data concepts are grouped into six categories:

Computing frameworks: offline (Hadoop MapReduce, Spark) and real‑time (Storm, Spark Streaming, Flink).

Storage frameworks: file systems (HDFS, Tachyon, KFS) and NoSQL databases (HBase, MongoDB, Redis), plus search engines (Elasticsearch, Solr).

Resource management: YARN, Mesos.

Log collection: Flume, Logstash.

Messaging systems: Kafka, StormMQ, ZeroMQ, RabbitMQ.

Query and analysis tools: Hive, Impala, Pig, Presto, Phoenix, SparkSQL, Drill, Kylin, Druid.

Big‑data tasks are divided into offline (batch) and real‑time (streaming) jobs, each typically involving three stages: data source extraction, data transformation, and loading into the target system.

The article explains that offline tasks often use HDFS, Hive, or MySQL as sources, while real‑time tasks commonly rely on Kafka. Transformation operations include aggregations (group‑by) and joins, illustrated with a retention‑analysis example that emphasizes the importance of join and aggregation operators.

Finally, the distinction between latency (offline vs. real‑time) and processing mode (batch vs. streaming) is clarified, noting that frameworks like Spark Streaming implement micro‑batch processing to achieve near‑real‑time performance.

The article concludes by summarizing these foundational concepts as essential knowledge for anyone beginning to work with big data platforms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data processing ETL frameworks Spark Hadoop

Written by

360 Quality & Efficiency

360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.