Big Data 8 min read

Core Technologies and Architecture of a Big Data Platform

This article explains the typical architecture of a big‑data platform, detailing its four core layers—data collection, storage & analysis, data sharing, and application—and describing the key technologies such as Flume, DataX, HDFS, Hive, Spark, Spark Streaming, and task scheduling components.

Architecture Digest
Architecture Digest
Architecture Digest
Core Technologies and Architecture of a Big Data Platform

The article starts with a schematic of a typical big‑data platform architecture, which usually consists of four logical layers: data collection, storage & analysis, data sharing, and data application.

Data Collection : The goal is to ingest data from various sources into the storage layer, optionally performing light cleansing. Common sources include website logs (collected via Flume agents to HDFS), business databases (MySQL, Oracle, SQL Server) where tools like DataX replace heavyweight Sqoop, FTP/HTTP sources, and manually entered data accessed through simple APIs or mini‑programs.

Storage & Analysis : HDFS is presented as the optimal data lake. Offline analysis is typically performed with Hive (SQL‑like, ORC compression, high performance) or MapReduce for Java‑centric developers. Spark and SparkSQL are highlighted for faster processing and seamless integration with YARN, while Spark Streaming is used for near‑real‑time analytics.

Data Sharing : After analysis, results are stored in relational or NoSQL databases to be consumed by downstream services. DataX can also synchronize processed data from HDFS back to these target stores, and real‑time results may be written directly to the sharing layer.

Data Applications : Includes business systems (CRM, ERP), reporting tools (FineReport), ad‑hoc queries (often requiring direct HDFS access and preferring SparkSQL over Hive for speed), OLAP workloads (which may need custom extraction from HDFS/HBase due to volume), and generic data interfaces (e.g., a Redis‑based user attribute service).

Real‑Time Data Computation : To meet low‑latency requirements, the platform uses Spark Streaming together with Flume for log collection, storing aggregated results in Redis for immediate consumption by business services.

Task Scheduling & Monitoring : The platform orchestrates numerous jobs (collection, sync, analysis) with complex dependencies, requiring a robust scheduler and monitoring system to manage execution order and health.

big dataDataXSparkHadoopdata architectureflumeData ingestion
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.