Big Data Overview: Definitions, Applications, Technology Stack, and Core Components (Hadoop, HDFS, MapReduce, YARN, Hive, HBase)
This comprehensive article explains big data concepts, definitions from Gartner and IBM, real‑world use cases, the Hadoop ecosystem architecture, and detailed introductions to HDFS, MapReduce, YARN, Hive, and HBase, including practical examples and shell commands.
Author: songzeng, Tencent IEG Growth Platform Backend Development Engineer
Big Data Overview
Definition of Big Data
Gartner research institute defines “big data” as information assets that require new processing models to achieve stronger decision‑making, insight discovery, and process optimization, and that must adapt to massive, rapidly growing, and diverse data.
McKinsey Global Institute describes big data as a data set whose size, acquisition, storage, management, and analysis capabilities far exceed those of traditional database tools, characterized by four features: massive scale, fast data flow, diverse data types, and low value density.
IBM proposes the 5V characteristics of big data:
Volume , Velocity , Variety , Value (low value density), and Veracity .
Big data is like a rapidly flowing river of sand; we know valuable nuggets exist, but extracting them requires skill and repeated refinement.
Big Data Application Scenarios
Big‑data techniques have been used historically both in China and abroad; they are not exclusive to modern times.
For example, Sun Bin’s “decrease the fire pits” strategy in the Battle of Maling exploited the opponent’s data‑analysis ability.
Today, big data plays a crucial role in many industries, such as traffic‑jam prediction and agricultural pest control.
Big Data Technology Map
The big‑data processing pipeline involves many technologies, listed below:
Data Ingestion
Sqoop, Flume, Logstash, Kibana
Data Storage
HDFS, Ceph
Data Warehouse
HBase, MongoDB, Cassandra, Neo4j, Redis
Data Computing
Hadoop MapReduce, Spark, Storm, Flink
Data Analysis
Hive, Pig, Impala, Presto, Kylin, Druid
Resource Management
Mesos, Yarn
Cluster Management & Monitoring
Ambari, Ganglia, Nagios, Cloudera Manager
Machine Learning
Mahout, Spark MLLib
Others
Kafka, StormMQ, Zookeeper
The highlighted technologies will be described in detail in this article.
Common Six Components
The following six components cover data storage, computing, analysis, resource scheduling, and data warehousing, representing the typical big‑data processing workflow.
Hadoop
Definition: Hadoop is an open‑source software framework under the Apache umbrella, written in Java, that provides a platform for developing and running large‑scale data processing applications.
Main purpose: It addresses the storage of massive data and the analysis/computation of massive data.
Consider the example of counting the occurrences of each word in a billion‑word dataset.
1. Single‑machine traversal
2. Multi‑threaded traversal
3. Manually split the data across machines, process, then manually aggregate results
The first method is slow, the second is complex and still limited, and the third is cumbersome. Hadoop automates the third approach: it distributes the workload across a cluster, handling data splitting and result aggregation, making such tasks trivial.
Hadoop now refers not only to the core framework but to the broader Hadoop ecosystem, which includes auxiliary projects such as Zookeeper for high availability and Hive for easier data access.
Hadoop Evolution History
2002 – After the first Internet bubble burst, Doug Cutting (Doug·Cutting) started an open‑source search engine project called Nutch.
Two years later, Google published the GFS and MapReduce papers. Cutting realized his project needed to adopt these ideas, leading to a Java implementation inspired by Google’s C++ design.
In 2006, after several years at a startup, Cutting joined Yahoo, which was interested in the underlying GFS/MapReduce concepts. The project was renamed Hadoop, after his son’s toy elephant.
This marked Hadoop’s first evolution.
Hadoop Architecture
Initially Hadoop consisted of HDFS (storage) and MapReduce (computation). MapReduce handled both computation and resource scheduling, while HDFS handled storage.
Later, to decouple computation from resource management, Hadoop 2.0 introduced YARN.
Hadoop Advantages
High reliability: multiple data replicas ensure no data loss when a node fails.
High scalability: can easily add thousands of cheap machines to a cluster.
High efficiency: dynamic data movement balances load across nodes.
High fault tolerance: failed tasks are automatically reassigned.
Understanding HDFS and MapReduce deepens appreciation of these advantages.
HDFS
Definition: Hadoop Distributed File System (HDFS) is a distributed file system implementation suitable for running on commodity hardware.
Main purpose: It stores massive amounts of data across hundreds or thousands of machines, provides high throughput, and maintains replicas for fault tolerance and reliability.
HDFS aims for high throughput for sequential reads rather than low‑latency random access.
From a user’s perspective, HDFS appears as a single file system; the underlying distribution across machines is hidden.
HDFS Architecture
A typical HDFS cluster consists of a NameNode (metadata server) and multiple DataNodes (storage servers). The NameNode maintains the namespace and directory tree, while DataNodes store actual data blocks.
The Secondary NameNode assists in checkpointing the NameNode metadata.
DataNodes store file blocks (default 128 MB) and handle read/write/delete requests from the NameNode.
HDFS DataNode
Example: two racks, each with three DataNodes, each storing a data block.
Key questions:
How is the block size chosen? Larger blocks reduce addressing overhead; a rule of thumb is block size ≈ (addressing time × transfer speed).
What is the replication strategy? HDFS stores three replicas: one on a random node, two on different nodes within the same rack, avoiding the rack of the first replica.
HDFS Write Process
1. Client contacts NameNode via RPC to request file creation.
2. NameNode checks existence and permissions, then creates metadata.
3. Client splits the file into 128 MB blocks and asks NameNode for a list of three DataNodes for each block.
4. Client streams data to the first DataNode, which forwards it to the next, forming a pipeline.
5. Data is broken into packets, sent through the pipeline, and acknowledgments are returned. Failed packets are retried.
If a DataNode fails, it reports to the NameNode, which initiates recovery.
HDFS Read Process
1. Client contacts NameNode to request a file.
2. NameNode returns the locations of the file’s blocks.
3. Client contacts the nearest DataNode for the first block.
4. DataNode streams the block to the client.
5. Client repeats for remaining blocks, handling any failed DataNodes by requesting alternate replicas.
HDFS Advantages
High reliability: automatic replication and automatic recovery of lost replicas.
Handles massive data scales (GB, TB, PB) and millions of files.
Runs on cheap hardware, leveraging replication for reliability.
HDFS Disadvantages
Despite its strengths, HDFS has limitations:
Not suitable for low‑latency (millisecond) access.
Inefficient for storing a large number of small files.
Does not support concurrent writes or random file modifications.
MapReduce
Definition: Hadoop MapReduce is a distributed programming framework that serves as the core engine for building data‑analysis applications on Hadoop.
Main purpose: It combines user‑written business logic with built‑in components into a distributed program that runs on a Hadoop cluster.
MapReduce abstracts parallel computation using the “divide‑and‑conquer” principle, assuming the data can be split into independent chunks.
MapReduce Design Philosophy
Map tasks read input splits from storage (e.g., HDFS), process them, and output key‑value pairs. A shuffle phase merges, sorts, and groups these pairs before they are fed to Reduce tasks, which aggregate the results.
MapReduce Advantages
1. Easy to program: developers only need to write map and reduce functions.
2. Good scalability: adding more machines expands processing capacity.
3. High fault tolerance: failed tasks are automatically reassigned.
4. Suitable for offline processing of petabyte‑scale data.
MapReduce Disadvantages
1. Not suitable for real‑time computation.
2. Not ideal for DAG‑style workflows because each stage writes to disk, causing high I/O overhead.
YARN
Definition: Apache Hadoop YARN (Yet Another Resource Negotiator) is a generic resource management system that provides unified resource scheduling for upper‑level applications.
YARN Basic Architecture
Key components:
ResourceManager – the “queen” that allocates resources globally.
NodeManager – the “kingdoms” that execute tasks and manage local resources.
ApplicationMaster – the “general” that requests resources for a specific application (e.g., a Map task or a Reduce task).
Container – a logical bundle of CPU, memory, and disk allocated to a task.
Workflow: client submits a job, ResourceManager allocates ApplicationMasters, which then request containers from NodeManagers to run map and reduce tasks.
Hive
Definition: Hive is a data‑warehouse tool built on Hadoop that maps structured data files to tables and provides SQL‑like query capabilities.
Main purpose: It translates SQL‑like statements into MapReduce jobs.
Hive Design
Hive stores metadata (database, table, column, partition information) in a relational database (e.g., MySQL) while the actual data resides in HDFS.
Hive Execution Flow
User submits HQL via client.
Hive parses the query and creates a query plan.
Hive converts the plan into a MapReduce job.
The job runs on Hadoop.
Hive Architecture
Key steps in the driver:
Parser converts SQL string to an abstract syntax tree (AST).
Compiler generates a logical execution plan.
Optimizer refines the logical plan.
Executor transforms the logical plan into a physical plan (MapReduce or Spark).
Hive Data Types
Hive supports primitive types (int, string, float, timestamp) and complex collection types such as structs, maps, and arrays.
Case Study
Requirement: Store a JSON document with strings, arrays, maps, and structs in HDFS, create a Hive table to map it, and query the data.
{
"name":"laowang",
"friends":["lilei","hanmeimei"],
"children":{"xiao wang":18,"xiao xiao wang":8},
"address":{"street":"tao yuan cun","city":"shenzhen"}
}Steps:
Create a Hive table with appropriate column definitions and delimiters.
Create a local test file (test.txt) matching the table format.
Load the file into HDFS.
Query the table.
Hive Advantages
1. SQL‑like syntax enables rapid development.
2. No need to write MapReduce code, reducing learning cost.
3. Extensible to other storage engines such as HBase.
Hive Limitations
1. Limited expressive power; complex algorithms may still require custom MapReduce.
2. Relatively low performance and high latency because queries are executed as MapReduce jobs; tuning can be difficult.
HBase
Definition: HBase (Hadoop Database) is a high‑reliability, high‑performance, column‑oriented, scalable distributed storage system.
Problem solved: It stores massive data at low cost while supporting high‑concurrency random writes and real‑time queries.
HBase is a NoSQL database built on top of HDFS; HDFS provides the underlying storage, while HBase offers table‑like access.
Basic Concepts
Column‑oriented storage stores values of the same column together, improving analytical queries that need only a subset of columns.
Column families group related columns; a table defines families but not individual columns, allowing flexible schema evolution.
NoSQL refers to “Not only SQL” databases that store semi‑structured data such as JSON or XML. Types include key‑value stores (Redis), document stores (MongoDB), column‑family stores (HBase), and graph stores (Neo4j).
HBase Data Model
Key elements:
Table – rows and columns; each column can have multiple versions.
RowKey – primary key, rows are stored lexicographically.
Column Family – groups columns; columns are identified by qualifiers.
Column – identified by family:qualifier.
Logical Structure
Region – a range of rows split from a table; regions are distributed across RegionServers.
Store – stores data for a single column family within a region.
Physical Structure
Data is stored as Key‑Value entries: rowkey, column family, column qualifier, timestamp, type, and value.
MemStore buffers writes in memory; when it reaches a threshold, data is flushed to disk as StoreFiles (HFiles) stored on HDFS.
HLog (Write‑Ahead Log) records every write sequentially to ensure durability.
HBase Architecture
Components:
Client – entry point, communicates via RPC with HMaster and RegionServers.
HMaster – master node managing table metadata, region assignment, and load balancing; multiple masters provide HA via Zookeeper.
RegionServer – serves I/O requests; manages multiple Regions.
Zookeeper – stores metadata, coordinates master election, and tracks region locations.
RegionServer Internals
Each Region contains multiple Stores (one per column family). A Store consists of MemStore, StoreFiles (HFiles), and a Write‑Ahead Log (HLog).
HBase Summary
HBase is a NoSQL database for massive data storage on HDFS.
A row consists of a RowKey and one or more columns; columns can be added arbitrarily.
Writes, deletes, and updates are versioned using timestamps.
Reads and writes go through Zookeeper to locate the appropriate RegionServer.
Appendix: HBase Shell Commands
Insert data put 'student','1001','info:sex','male' put 'student','1001','info:age','18' put 'student','1002','info:name','janna' put 'student','1002','info:sex','female' put 'student','1002','info:age','20'
Scan data scan 'student' scan 'student',{STARTROW=>'1001'} scan 'student',{STARTROW=>'1001',STOPROW=>'1001'} scan 'student',{RAW=>true,VERSIONS=>10}
Get data (requires RowKey) get 'student','1001' get 'student','1001','info' get 'student','1001','info:name' get 'student','1001',{COLUMN=>'info:name',VERSIONS=>3}
Count rows count 'student'
Update data (same as insert) put 'student','1001','info:name','Nick'
Delete data delete 'student','1001','info:name' delete 'student','1001','info' delete 'student','1001' truncate 'student'
IEG Growth Platform Technology Team
Official account of Tencent IEG Growth Platform Technology Team, showcasing cutting‑edge achievements across front‑end, back‑end, client, algorithm, testing and other domains.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.