Designing Scalable Hadoop‑Based Data Analytics Platforms: Architecture & Best Practices
This article explains how enterprises can build a scalable data analytics platform on Hadoop by outlining the multi‑layer architecture, storage options, data synchronization methods, and ETL/offline computation techniques, while highlighting practical component choices such as Hive, HBase, Spark, and Oozie.
Data Analytics Platform Architecture Overview
A scalable analytics platform can be logically divided into six layers:
Landing (Access) Layer : Temporary storage of raw source data, often called ODS. Data is kept in its original schema without transformation.
Integration Layer : Persistent storage of integrated enterprise data, modeling business entities and events. This layer serves as the single source of truth (the data warehouse).
Presentation Layer : Data models optimized for BI queries and performance, commonly referred to as data marts.
Semantic Layer : Defines how data is presented to end‑users and enforces access control, typically via reporting tools.
End‑User Applications : Dashboards, reports, charts, and other visualisations that consume the semantic layer.
Metadata : Stores definitions, lineage, and processing details for every data object.
All layers are backed by a data lake built on the Hadoop ecosystem, allowing raw, integrated, and presentation data to coexist as files or tables.
The overall data processing flow is illustrated below.
Hadoop‑Based Implementation
1. Storage Layer
HDFS provides the distributed file system foundation. On top of HDFS two databases are typically used:
Hive : A SQL‑on‑HDFS engine that stores table metadata in the Hive metastore and translates SQL statements into MapReduce or Spark jobs. Hive supports both text formats (CSV, JSON) and columnar binary formats (ORC, Parquet). For the Landing layer raw files are usually kept as CSV/JSON without partitioning; the Integration layer prefers ORC/Parquet with partitioning to accelerate offline queries.
HBase : A NoSQL key‑value store optimized for low‑latency random reads and high write throughput. It is suitable for real‑time access patterns that cannot be satisfied efficiently by Hive.
2. Data Ingestion
Data reaches the Landing layer through two complementary mechanisms:
Batch (scheduled) sync : Executed at fixed intervals using sqoop for full loads or incremental loads. Full loads are practical for small tables; large tables rely on incremental capture to keep source and platform in sync.
Streaming sync : Change‑data‑capture events are published to Kafka (or another MQ). Consumers read the events and apply incremental updates in near real‑time.
3. Processing and ETL
YARN manages cluster resources and schedules jobs. The preferred compute engine is Spark on YARN because:
Spark SQL and Spark RDD provide a high‑level, memory‑centric API that is easier for developers than classic MapReduce.
Spark can read/write both Hive tables and HBase tables, enabling unified pipelines.
ETL pipelines are typically expressed in Spark SQL or Hive SQL. Hive 2.0+ also supports stored procedures, but Spark SQL is usually chosen for performance‑critical jobs.
4. Workflow Orchestration
Complex pipelines are chained together using an orchestration engine such as Oozie . Oozie defines directed acyclic graphs (DAGs) of actions (Sqoop import, Spark job, Hive query, etc.) and handles retries, dependencies, and scheduling.
5. Presentation and Consumption
After ETL, data is materialized into data marts. Options for serving the marts include:
Loading the results back into a traditional RDBMS for legacy reporting tools.
Keeping the data in Hive/HBase and exposing it directly to BI tools.
Building OLAP cubes with Apache Kylin , which reads from Hive and provides a high‑performance SQL interface for multidimensional analysis.
6. Metadata Management
The Hive metastore and external catalog services record table definitions, partitioning schemes, and lineage information. This metadata underpins the Semantic layer, enabling consistent data definitions and access control across all end‑user applications.
In summary, the platform replaces traditional single‑node databases with a Hadoop stack (HDFS, Hive, HBase, Spark, Kafka, Oozie) while preserving the same logical architecture. This design delivers horizontal scalability for both storage and compute, supports batch and real‑time ingestion, and provides flexible consumption options for downstream analytics.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
