Core Technologies and Architecture of a Big Data Platform
The article outlines a typical big data platform architecture, detailing its core layers—data collection, storage and analysis, sharing, application, real-time computation, and task scheduling—while describing key technologies such as Flume, DataX, HDFS, Hive, Spark, Spark Streaming, and Redis.
We first look at a typical big data platform architecture diagram used by many companies, which consists of four core layers: data collection, data storage and analysis, data sharing, and data application.
1. Data Collection
The data collection layer gathers data from various sources and stores it in the data storage layer, often performing simple cleaning along the way. Common data sources include:
Website logs – collected in real time by deploying Flume agents on log servers and storing the logs in HDFS.
Business databases – such as MySQL, Oracle, SQL Server. Tools like DataX (or Sqoop for smaller workloads) are used to sync data to HDFS.
FTP/HTTP sources – periodic data fetched from partners, also supported by DataX.
Other sources – manually entered data that can be provided via a simple API or mini‑program.
2. Data Storage and Analysis
HDFS is the de‑facto storage solution for a big data warehouse. For offline analysis, Hive is the preferred tool because of its rich data types, built‑in functions, ORC compression, and SQL‑like interface, which is far more concise than writing MapReduce jobs. Hadoop’s MapReduce can also be used for custom Java‑based analysis. Spark, which integrates well with Hive and YARN, offers superior performance and is easy to deploy on an existing Hadoop cluster. SparkSQL provides fast, Hive‑compatible query capabilities.
3. Data Sharing
After analysis, results need to be made available to downstream systems. This is achieved by synchronizing data from HDFS to relational databases or NoSQL stores. DataX can again be used to move processed data from HDFS to these target systems, and some real‑time results may be written directly to the sharing layer.
4. Data Application
Data in the sharing layer powers various applications:
Business products (CRM, ERP) – directly query the shared data.
Reports (FineReport, custom dashboards) – use pre‑aggregated data.
Ad‑hoc queries – required by developers, operators, analysts, or managers, often executed directly against the storage layer.
OLAP – many OLAP tools cannot read HDFS directly, so data is often synced to relational databases; for very large volumes, custom solutions read from HDFS or HBase.
Other data interfaces – generic or customized APIs (e.g., fetching user attributes from Redis) that serve multiple business services.
5. Real‑time Computation
Increasing business demand for low‑latency insights (e.g., website traffic, ad exposure) requires a distributed, high‑throughput, low‑delay framework. While Storm is mature, Spark Streaming is chosen for its simplicity and comparable latency. Spark Streaming is used to implement real‑time website traffic statistics and ad‑effect metrics. Flume collects logs from web and ad servers, streams them to Spark Streaming, which processes the data and stores results in Redis for instant access.
6. Task Scheduling and Monitoring
A big data platform runs numerous tasks—data collection, synchronization, analysis, etc.—with complex dependencies (e.g., analysis must wait for collection). A robust scheduling and monitoring system is essential to orchestrate and track the execution of all these tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
