From Hadoop to Spark: A Complete Roadmap to Becoming a Big Data Architect
This guide walks beginners through the essential big‑data ecosystem—from understanding Hadoop’s core components and mastering MapReduce, to using Hive, SparkSQL, Kafka, and real‑time frameworks like Storm, while also covering data ingestion, export, scheduling, and introductory machine‑learning techniques.
Introduction
Big‑data architects need a clear learning path, covering everything from SQL to NoSQL, and from beginner to master. The article outlines three development directions: platform building/optimization/operations, big‑data development/design/architecture, and data analysis/mining.
4V Characteristics of Big Data
Massive volume (TB‑>PB)
Variety of data types (structured, unstructured, logs, video, images, geo‑location, etc.)
High commercial value that must be extracted via analysis and machine learning
High velocity requirements beyond offline batch processing
Common Open‑Source Big‑Data Frameworks
File storage: Hadoop HDFS, Tachyon, KFS
Batch processing: Hadoop MapReduce, Spark
Streaming: Storm, Spark Streaming, S4, Heron
K‑V/NoSQL stores: HBase, Redis, MongoDB
Resource management: YARN, Mesos
Log collection: Flume, Scribe, Logstash, Kibana
Message systems: Kafka, StormMQ, ZeroMQ, RabbitMQ
Query/analysis: Hive, Impala, Pig, Presto, Phoenix, SparkSQL, Drill, Flink, Kylin, Druid
Coordination service: Zookeeper
Cluster management & monitoring: Ambari, Ganglia, Nagios, Cloudera Manager
Data mining/ML: Mahout, Spark MLlib
Data sync: Sqoop
Job scheduling: Oozie
Chapter 1: Getting Started with Hadoop
1.1 Learn to search – Use Google or Baidu to solve problems.
1.2 Official documentation – The primary reference for beginners.
1.3 Run Hadoop – Hadoop is the foundation for most big‑data frameworks.
1.4 Basic Hadoop commands – HDFS directory operations, file upload/download, submit MapReduce examples, view Web UI and logs.
1.5 Core concepts – Hadoop 1.0/2.0, MapReduce, HDFS, NameNode, DataNode, JobTracker, TaskTracker, YARN, ResourceManager, NodeManager.
1.6 Write a MapReduce program – Follow the WordCount example (Java, Shell, or Python via Hadoop Streaming).
Chapter 2: Faster WordCount with SQL
2.1 Learn SQL – Essential for data analysis.
2.2 SQL‑based WordCount – SELECT word,COUNT(1) FROM wordcount GROUP BY word; 2.3 Hive on Hadoop – Hive provides a data‑warehouse interface that translates SQL into MapReduce jobs.
2.4 Install and configure Hive – Follow earlier steps to get Hive running.
2.5 Use Hive – Create a wordcount table and run the SQL from 2.2, then compare results with the MapReduce version.
2.6 How Hive works – Hive SQL is compiled into MapReduce tasks.
2.7 Basic Hive commands – Create/drop tables, load data, download data, partitioning, etc.
Chapter 3: Ingesting Data into Hadoop
3.1 HDFS PUT – Command‑line data upload, often scripted.
3.2 HDFS API – Programmatic writes via Java, Python, etc.
3.3 Sqoop – Transfers data between relational databases and Hadoop/Hive using MapReduce.
3.4 Flume – Distributed log collection and transport to HDFS (real‑time).
3.5 DataX – Alibaba’s open‑source data‑exchange tool, similar to Sqoop.
Chapter 4: Exporting Data from Hadoop
4.1 HDFS GET – Download files from HDFS.
4.2 HDFS API – Programmatic reads.
4.3 Sqoop – Sync HDFS or Hive tables back to relational databases.
4.4 DataX – Same purpose as Sqoop, with broader source support.
Chapter 5: Faster SQL on Hadoop
Hive’s MapReduce engine is slow; newer engines like SparkSQL, Impala, and Presto provide in‑memory or semi‑memory execution for quicker queries. The author prefers SparkSQL for its versatility.
Chapter 6: Kafka for Multi‑Consumer Architecture
Kafka enables one‑time data collection and multiple downstream consumptions, complementing Flume for real‑time log pipelines.
Chapter 7: Task Scheduling and Monitoring
7.1 Apache Oozie – Workflow scheduler for Hadoop jobs.
7.2 Other schedulers – Azkaban, Light‑Task‑Scheduler, Zeus, and custom solutions.
Chapter 8: Real‑Time Processing
8.1 Storm – Low‑latency stream processing (millisecond level).
8.2 Spark Streaming – Micro‑batch streaming; can be combined with Kafka for real‑time analytics.
Chapter 9: Exposing Data to Business
Offline delivery: periodic dumps via Sqoop/DataX.
Real‑time services: low‑latency queries using HBase, Redis, MongoDB, Elasticsearch.
OLAP: Impala, Presto, SparkSQL, Kylin for large‑scale analytical queries.
Ad‑hoc queries: Impala, Presto, SparkSQL.
The guiding principle is to keep the architecture simple and stable.
Chapter 10: Introductory Machine Learning
Typical use cases include classification, clustering, and recommendation. Learning path: solid math foundation, Python programming, then Spark MLlib for ready‑made algorithms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
