Big Data 11 min read

What Is a Big Data Platform and How to Design Its Architecture?

This article explains what a big data platform is, outlines its seven‑component overall architecture, details the technical stack from data sources to applications, and describes the key subsystems such as catalog management, data integration, governance, storage, processing, sharing, development, and analysis.

ITFLY8 Architecture Home

Jun 21, 2021

What Is a Big Data Platform and How to Design Its Architecture?

1. What Is a Big Data Platform

A big data platform typically uses distributed or real‑time frameworks such as Hadoop, Spark, Storm, Flink, Blink to run various compute tasks. Its purpose is to serve business needs, solving problems that arise from large data volumes, heavy computation, or semi‑/unstructured data.

2. Big Data Platform Architecture Design

Overall Architecture

The platform consists of seven layers: catalog management, data integration, data asset management, data governance, data development, data analysis, data sharing, and data security.

Catalog Management – inventory and organize business data, publish data catalogs, guide data acquisition, management, governance, development, and sharing.

Data Integration – provide foundational services for data ingestion, supporting both structured and unstructured data, and preprocessing.

Data Asset Management – manage data standards, metadata, and resources to increase asset value.

Data Governance – standardize data creation and usage, continuously improve data quality.

Data Development – offer development, analysis, and mining functions; non‑technical users can use graphical IDEs.

Data Analysis – provide multi‑level analysis such as basic queries, cross‑summaries, drill‑down, and multidimensional analysis.

Data Sharing – enable exchange of data across departments, formats, and systems.

Data Security – include encryption, masking, backup, and audit logging.

Technical Architecture

The technical stack from bottom to top includes the data source layer, data acquisition layer, data storage layer, data processing layer, and data application layer.

Data source layer handles unstructured data (images, audio, video) stored as BLOBs and semi‑structured data stored in XML or CLOB fields.

Data acquisition layer aggregates heterogeneous sources, using tools such as Sqoop (for Hadoop‑RDBMS transfer), Flume (log collection), message queues, and Kettle (ETL).

Sqoop – import/export between Hadoop/Hive and relational databases.

Flume – distributed log collection and aggregation.

Message Queue – asynchronous app‑to‑app communication.

Kettle – open‑source ETL with GUI (Chef, Kitchen, Spoon, Pan).

Data Storage Layer

Relational databases with MPP technology for large‑scale parallel processing, and NoSQL databases (key‑value, column, document, graph) such as Redis, HBase, SequoiaDB, Neo4j.

Distributed file systems like HDFS and FastDFS provide fault‑tolerant, scalable storage for massive files.

Full‑text search engines such as Solr and Elasticsearch offer indexing and search capabilities.

Data Processing Layer

Offline processing uses HDFS/MPP storage with MapReduce or Spark for batch jobs, often loading results into Hive for reporting.

Real‑time processing handles data as it arrives, enabling immediate analytics.

3. Big Data Platform System Design

Key Sub‑systems

Catalog Management System – creates and publishes business data catalogs, supports classification, review, and release.

Data Acquisition System – builds scalable data transmission channels.

Data Asset Management System – manages standards, metadata, resources, and asset inventory.

Data Governance System – enforces data quality rules, cleansing, and processing.

Data Sharing System – based on data resource catalogs, supports cross‑department, cross‑level, and cross‑region data exchange with discovery, request, usage, and download capabilities.

Data Development System – uses big‑data or AI components for analysis and mining, offering IDE, component libraries, and scheduling.

Data Analysis System – connects to various databases and warehouses, providing drag‑and‑drop analytics, statistical charts, and multidimensional visualizations.

Source: https://zhuanlan.zhihu.com/p/81999971

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems platform architecture Data Integration Data Governance

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.