Big Data 9 min read

Unpacking the Core Technologies Behind Modern Big Data Platforms

From data ingestion to real‑time analytics, this guide breaks down the essential layers of a typical big‑data platform—covering collection methods, HDFS storage, Hive/Spark analysis, data sharing mechanisms, application use‑cases, streaming with Spark Streaming, and the need for robust scheduling and monitoring.

dbaplus Community

Jul 13, 2022

Unpacking the Core Technologies Behind Modern Big Data Platforms

1. Data Collection

The collection layer ingests raw data from heterogeneous sources and writes it to HDFS (the data lake). Typical sources and recommended ingestion tools are:

Website logs – Deploy a Flume agent on each log server; the agent streams logs in real time to HDFS.

Business databases (MySQL, Oracle, SQL Server, etc.) – Use DataX (an open‑source ETL framework from Alibaba) to pull data into HDFS. DataX can be extended for custom logic. Flume can also be configured for incremental sync, but it requires each Hadoop node to have network access to the source DB, which is often impractical.

FTP/HTTP feeds – Schedule periodic pulls from partner sites; DataX supports both protocols.

Ad‑hoc or manual sources – Small programs or lightweight APIs can write directly to HDFS for one‑off loads.

2. Data Storage and Analysis

HDFS is the primary storage layer for both raw and processed data. For batch analytics the following engines are commonly used:

Hive – Provides a SQL‑like interface, rich data types, built‑in functions, and the high‑compression ORC file format. Hive queries are compiled to MapReduce or Tez jobs, dramatically reducing development effort compared to hand‑written MapReduce.

MapReduce – Still available for custom Java jobs when low‑level control is required.

Spark & SparkSQL – Offers in‑memory processing that is typically 5‑10× faster than MapReduce. Spark integrates with YARN, shares the same HDFS cluster, and SparkSQL can query Hive tables without a separate Spark cluster.

3. Data Sharing

After analysis, result sets are copied from HDFS to downstream stores that business services can query efficiently:

Relational databases (e.g., MySQL, PostgreSQL) for structured reports.

NoSQL stores such as HBase or Redis for low‑latency look‑ups.

DataX can be reused to synchronize processed data from HDFS to these targets.

Real‑time pipelines may write results directly to the sharing layer (e.g., Redis) without an intermediate batch step.

4. Data Application

Business applications (CRM, ERP) and reporting tools consume data from the sharing layer. Typical usage patterns include:

Ad‑hoc queries – Analysts, developers, and managers often need to run arbitrary SQL against the data. Hive can be slow for interactive workloads; SparkSQL provides much lower latency while remaining Hive‑compatible.

Impala – An alternative low‑latency SQL engine; adopt only if the platform can accommodate an additional service.

OLAP on large volumes – Many commercial OLAP tools cannot read directly from HDFS, forcing a copy to a relational store, which does not scale. Custom services can fetch data from HBase or HDFS on‑demand based on user‑selected dimensions and metrics.

General purpose APIs – Example: a Redis‑backed service that returns user attributes to any downstream module.

5. Real‑time Computing

For low‑latency use cases such as live traffic monitoring or ad‑impression tracking, a streaming framework is required. The platform adopts Spark Streaming because it avoids adding a separate component (e.g., Storm) while meeting throughput and latency requirements.

Typical streaming pipeline:

Flume agent (log server) → Spark Streaming (YARN) → Aggregation logic → Redis (real‑time store)

Business services read the latest metrics from Redis, achieving sub‑second response times.

6. Task Scheduling and Monitoring

The platform orchestrates a large number of jobs (collection, sync, batch analysis, streaming) that have complex dependencies. A robust scheduler must:

Define execution order (e.g., analysis starts only after the corresponding collection job finishes).

Handle retries, failure alerts, and resource allocation on YARN.

Provide visibility into job status and performance metrics for operations teams.

Implementations typically use workflow engines such as Apache Oozie, Airflow, or commercial equivalents, but the essential requirement is a centralized system that guarantees reliable, ordered execution of all data‑pipeline tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Real-time Processing Data Warehouse Data Integration HDFS Spark

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.