Big Data 10 min read

Designing Scalable Hadoop‑Based Data Analytics Platforms: Architecture & Best Practices

This article explains how enterprises can build a scalable data analytics platform on Hadoop by outlining the multi‑layer architecture, storage options, data synchronization methods, and ETL/offline computation techniques, while highlighting practical component choices such as Hive, HBase, Spark, and Oozie.

dbaplus Community

Jun 14, 2018

Designing Scalable Hadoop‑Based Data Analytics Platforms: Architecture & Best Practices

Data Analytics Platform Architecture Overview

A scalable analytics platform can be logically divided into six layers:

Landing (Access) Layer : Temporary storage of raw source data, often called ODS. Data is kept in its original schema without transformation.

Integration Layer : Persistent storage of integrated enterprise data, modeling business entities and events. This layer serves as the single source of truth (the data warehouse).

Presentation Layer : Data models optimized for BI queries and performance, commonly referred to as data marts.

Semantic Layer : Defines how data is presented to end‑users and enforces access control, typically via reporting tools.

End‑User Applications : Dashboards, reports, charts, and other visualisations that consume the semantic layer.

Metadata : Stores definitions, lineage, and processing details for every data object.

All layers are backed by a data lake built on the Hadoop ecosystem, allowing raw, integrated, and presentation data to coexist as files or tables.

The overall data processing flow is illustrated below.

Hadoop‑Based Implementation

1. Storage Layer

HDFS provides the distributed file system foundation. On top of HDFS two databases are typically used:

Hive : A SQL‑on‑HDFS engine that stores table metadata in the Hive metastore and translates SQL statements into MapReduce or Spark jobs. Hive supports both text formats (CSV, JSON) and columnar binary formats (ORC, Parquet). For the Landing layer raw files are usually kept as CSV/JSON without partitioning; the Integration layer prefers ORC/Parquet with partitioning to accelerate offline queries.

HBase : A NoSQL key‑value store optimized for low‑latency random reads and high write throughput. It is suitable for real‑time access patterns that cannot be satisfied efficiently by Hive.

2. Data Ingestion

Data reaches the Landing layer through two complementary mechanisms:

Batch (scheduled) sync : Executed at fixed intervals using sqoop for full loads or incremental loads. Full loads are practical for small tables; large tables rely on incremental capture to keep source and platform in sync.

Streaming sync : Change‑data‑capture events are published to Kafka (or another MQ). Consumers read the events and apply incremental updates in near real‑time.

3. Processing and ETL

YARN manages cluster resources and schedules jobs. The preferred compute engine is Spark on YARN because:

Spark SQL and Spark RDD provide a high‑level, memory‑centric API that is easier for developers than classic MapReduce.

Spark can read/write both Hive tables and HBase tables, enabling unified pipelines.

ETL pipelines are typically expressed in Spark SQL or Hive SQL. Hive 2.0+ also supports stored procedures, but Spark SQL is usually chosen for performance‑critical jobs.

4. Workflow Orchestration

Complex pipelines are chained together using an orchestration engine such as Oozie . Oozie defines directed acyclic graphs (DAGs) of actions (Sqoop import, Spark job, Hive query, etc.) and handles retries, dependencies, and scheduling.

5. Presentation and Consumption

After ETL, data is materialized into data marts. Options for serving the marts include:

Loading the results back into a traditional RDBMS for legacy reporting tools.

Keeping the data in Hive/HBase and exposing it directly to BI tools.

Building OLAP cubes with Apache Kylin , which reads from Hive and provides a high‑performance SQL interface for multidimensional analysis.

6. Metadata Management

The Hive metastore and external catalog services record table definitions, partitioning schemes, and lineage information. This metadata underpins the Semantic layer, enabling consistent data definitions and access control across all end‑user applications.

In summary, the platform replaces traditional single‑node databases with a Hadoop stack (HDFS, Hive, HBase, Spark, Kafka, Oozie) while preserving the same logical architecture. This design delivers horizontal scalability for both storage and compute, supports batch and real‑time ingestion, and provides flexible consumption options for downstream analytics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Hive ETL Data Lake Spark Hadoop Data Architecture

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.