Big Data 20 min read

From Hadoop to Spark: A Complete Roadmap to Becoming a Big Data Architect

This guide walks beginners through the essential big‑data ecosystem—from understanding Hadoop’s core components and mastering MapReduce, to using Hive, SparkSQL, Kafka, and real‑time frameworks like Storm, while also covering data ingestion, export, scheduling, and introductory machine‑learning techniques.

21CTO
21CTO
21CTO
From Hadoop to Spark: A Complete Roadmap to Becoming a Big Data Architect

Introduction

Big‑data architects need a clear learning path, covering everything from SQL to NoSQL, and from beginner to master. The article outlines three development directions: platform building/optimization/operations, big‑data development/design/architecture, and data analysis/mining.

4V Characteristics of Big Data

Massive volume (TB‑>PB)

Variety of data types (structured, unstructured, logs, video, images, geo‑location, etc.)

High commercial value that must be extracted via analysis and machine learning

High velocity requirements beyond offline batch processing

Common Open‑Source Big‑Data Frameworks

File storage: Hadoop HDFS, Tachyon, KFS

Batch processing: Hadoop MapReduce, Spark

Streaming: Storm, Spark Streaming, S4, Heron

K‑V/NoSQL stores: HBase, Redis, MongoDB

Resource management: YARN, Mesos

Log collection: Flume, Scribe, Logstash, Kibana

Message systems: Kafka, StormMQ, ZeroMQ, RabbitMQ

Query/analysis: Hive, Impala, Pig, Presto, Phoenix, SparkSQL, Drill, Flink, Kylin, Druid

Coordination service: Zookeeper

Cluster management & monitoring: Ambari, Ganglia, Nagios, Cloudera Manager

Data mining/ML: Mahout, Spark MLlib

Data sync: Sqoop

Job scheduling: Oozie

Chapter 1: Getting Started with Hadoop

1.1 Learn to search – Use Google or Baidu to solve problems.

1.2 Official documentation – The primary reference for beginners.

1.3 Run Hadoop – Hadoop is the foundation for most big‑data frameworks.

1.4 Basic Hadoop commands – HDFS directory operations, file upload/download, submit MapReduce examples, view Web UI and logs.

1.5 Core concepts – Hadoop 1.0/2.0, MapReduce, HDFS, NameNode, DataNode, JobTracker, TaskTracker, YARN, ResourceManager, NodeManager.

1.6 Write a MapReduce program – Follow the WordCount example (Java, Shell, or Python via Hadoop Streaming).

Chapter 2: Faster WordCount with SQL

2.1 Learn SQL – Essential for data analysis.

2.2 SQL‑based WordCount – SELECT word,COUNT(1) FROM wordcount GROUP BY word; 2.3 Hive on Hadoop – Hive provides a data‑warehouse interface that translates SQL into MapReduce jobs.

2.4 Install and configure Hive – Follow earlier steps to get Hive running.

2.5 Use Hive – Create a wordcount table and run the SQL from 2.2, then compare results with the MapReduce version.

2.6 How Hive works – Hive SQL is compiled into MapReduce tasks.

2.7 Basic Hive commands – Create/drop tables, load data, download data, partitioning, etc.

Chapter 3: Ingesting Data into Hadoop

3.1 HDFS PUT – Command‑line data upload, often scripted.

3.2 HDFS API – Programmatic writes via Java, Python, etc.

3.3 Sqoop – Transfers data between relational databases and Hadoop/Hive using MapReduce.

3.4 Flume – Distributed log collection and transport to HDFS (real‑time).

3.5 DataX – Alibaba’s open‑source data‑exchange tool, similar to Sqoop.

Chapter 4: Exporting Data from Hadoop

4.1 HDFS GET – Download files from HDFS.

4.2 HDFS API – Programmatic reads.

4.3 Sqoop – Sync HDFS or Hive tables back to relational databases.

4.4 DataX – Same purpose as Sqoop, with broader source support.

Chapter 5: Faster SQL on Hadoop

Hive’s MapReduce engine is slow; newer engines like SparkSQL, Impala, and Presto provide in‑memory or semi‑memory execution for quicker queries. The author prefers SparkSQL for its versatility.

Chapter 6: Kafka for Multi‑Consumer Architecture

Kafka enables one‑time data collection and multiple downstream consumptions, complementing Flume for real‑time log pipelines.

Chapter 7: Task Scheduling and Monitoring

7.1 Apache Oozie – Workflow scheduler for Hadoop jobs.

7.2 Other schedulers – Azkaban, Light‑Task‑Scheduler, Zeus, and custom solutions.

Chapter 8: Real‑Time Processing

8.1 Storm – Low‑latency stream processing (millisecond level).

8.2 Spark Streaming – Micro‑batch streaming; can be combined with Kafka for real‑time analytics.

Chapter 9: Exposing Data to Business

Offline delivery: periodic dumps via Sqoop/DataX.

Real‑time services: low‑latency queries using HBase, Redis, MongoDB, Elasticsearch.

OLAP: Impala, Presto, SparkSQL, Kylin for large‑scale analytical queries.

Ad‑hoc queries: Impala, Presto, SparkSQL.

The guiding principle is to keep the architecture simple and stable.

Chapter 10: Introductory Machine Learning

Typical use cases include classification, clustering, and recommendation. Learning path: solid math foundation, Python programming, then Spark MLlib for ready‑made algorithms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data engineeringBig DataHiveSpark
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.