Understanding RAID, HDFS, and MapReduce: From Storage to Distributed Computing
This article explains the storage challenges of big data, introduces RAID levels and their trade‑offs, describes the HDFS architecture with NameNode and DataNode replication, details the MapReduce programming model and execution flow, and shows how Hive translates SQL queries into MapReduce jobs.
RAID Technology
Big data processing requires solving three core storage problems: capacity, read/write speed, and reliability. RAID (Redundant Array of Independent Disks) addresses these by combining multiple disks into logical arrays.
RAID0 : Stripes data across N disks for N‑times higher throughput but provides no redundancy; a single disk failure corrupts all data.
RAID1 : Mirrors data on two disks, offering instant failover; any one disk can fail without data loss.
RAID10 : Combines RAID0 striping with RAID1 mirroring, improving performance while keeping redundancy, at the cost of 50% storage efficiency.
RAID3 : Uses a dedicated parity disk; data is split into N‑1 parts and written in parallel, but frequent parity updates can wear out the parity disk, so it is rarely used.
RAID5 : Distributes parity across all disks, reducing the single‑disk bottleneck of RAID3 and improving reliability.
RAID6 : Stores two independent parity blocks, allowing the system to survive simultaneous failure of two disks.
RAID can be implemented via hardware (RAID cards) or software (OS‑level RAID). While effective for a single server, massive data volumes require a distributed approach.
HDFS Architecture
Hadoop Distributed File System (HDFS) extends the RAID concept to a cluster of servers, providing petabyte‑scale storage with parallel read/write and fault tolerance.
Key components:
NameNode : Stores metadata (file names, block IDs, locations) and manages block replication (default three copies).
DataNode : Holds actual data blocks; each block is replicated across multiple DataNodes.
When a client writes a file, the NameNode allocates free blocks, returns their IDs and target DataNodes, and the client streams data directly to the first DataNode, which forwards it to the others.
Block replication ensures that if a DataNode fails, the remaining copies keep the data accessible. DataNodes send heartbeats to the NameNode; missing heartbeats trigger automatic replication of under‑replicated blocks.
MapReduce Programming Model
MapReduce splits a large computation into two phases:
Map : Processes input <key, value> pairs and emits intermediate <key, value> pairs.
Reduce : Receives all values associated with the same key, aggregates them, and produces final output.
Example: WordCount counts word frequencies in a text corpus.
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
}After the map phase, Hadoop performs a shuffle step: each map output is partitioned (default by hash(key) % numReduceTasks) and sent via HTTP to the appropriate reducer. The reducer then sorts and merges values with the same key.
/** Use {@link Object#hashCode()} to partition. */
public int getPartition(K2 key, V2 value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}MapReduce Execution Flow
The job lifecycle involves several processes:
Application process submits a job JAR to HDFS.
JobTracker creates a job graph, divides input blocks into map tasks, and schedules them.
TaskTracker on each node receives tasks, prefers tasks that run on data local to the node.
Map tasks read local HDFS blocks, emit intermediate data; reduce tasks write final results back to HDFS.
Hive: Turning SQL into MapReduce
Hive provides a SQL‑like interface that compiles queries into MapReduce jobs, allowing analysts to write familiar SELECT … FROM … GROUP BY … statements.
Example group‑by query:
SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age;Hive parses the SQL, builds an execution plan, and generates a DAG of MapReduce tasks. Simple aggregations may use only a map phase, while joins require both map and reduce phases with table‑specific tags to differentiate records.
Join example:
SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid);In practice, most big‑data workloads are expressed as Hive SQL, and the generated MapReduce jobs handle the heavy lifting of distributed computation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
