Comprehensive Overview of Data Middle Platform Architecture, Components, and Practices
This article provides an extensive summary of data middle platform concepts, covering data aggregation, collection tools, offline and real‑time development, data governance, service layers, warehouse construction, and operational practices, illustrating how enterprises build and manage a unified data ecosystem.
This article summarizes the theory of the data middle platform, referencing "Data Middle Platform".
Data Middle Platform
Data Aggregation
Data aggregation is a core capability of the data middle platform, collecting heterogeneous network and data source information into a centralized storage for downstream processing and modeling. Aggregation methods include database synchronization, embedded tracking, web crawling, and message queues; they can be classified as offline batch aggregation or real‑time collection based on timeliness.
Data Collection Tools
Canal, DataX, Sqoop
Data Development
The data development module serves developers and analysts, offering offline, real‑time, and algorithm development tools.
Offline Development
Job Scheduling
• Dependency scheduling: a job starts only after all its parent jobs have completed. For example, Job B can be scheduled only after Jobs A and C finish. • Time scheduling: a job can be set to start at a specific time, e.g., Job B starts after 05:00.
Baseline Control
In long‑running big‑data offline jobs, predictive algorithms estimate completion times; when a job cannot finish on time, the scheduler alerts operations staff for early intervention.
Heterogeneous Storage
Enterprise storage engines are diverse. The offline development center builds specific plugins for each engine (e.g., Oracle plugin, Hive/Spark/MR plugins for Hadoop). Users create jobs via the UI, and the system automatically selects the appropriate plugin at execution time.
Code Validation
SQL tasks undergo strict pre‑execution checks to detect issues early.
Multi‑Environment Cascading
Supports various environment needs with isolated Hive databases, YARN queues, or even separate Hadoop clusters. Environments include:
• Single environment: only one production environment. • Classic environment: development with masked data, production with real data. • Complex environment: external users get a masked environment; after testing, models are promoted to internal development.
Recommended Dependencies
As business depth grows, developers need to manage accumulating jobs. The system helps locate upstream jobs and avoid circular dependencies by analyzing table‑level lineage graphs, performing loop detection, and returning suitable node lists.
Data Permissions
Multiple engines have separate permission systems (e.g., Oracle, HANA, LibrA), making permission requests cumbersome. Strategies include:
• RBAC (Role‑Based Access Control) – e.g., Cloudera Sentry, Huawei FI. • PBAC (Policy‑Based Access Control) – e.g., Hortonworks Ranger.
Permissions are usually managed by big‑data or database ops staff; developers request access through a centralized permission‑management portal, which records approvals for auditing.
Real‑Time Development
• Metadata management • SQL‑driven development • Component‑based development
Intelligent Operations
Integrated tools for job management, code deployment, operations, monitoring, and alerting improve efficiency. Features include re‑run, downstream re‑run, and data back‑fill.
Data System
With data aggregation and development modules, the middle platform provides core data‑warehouse capabilities, enabling a comprehensive enterprise data system characterized by full‑domain coverage, clear hierarchical structure, consistent accuracy, performance improvement, cost reduction, and ease of use across industries (real estate, securities, retail, manufacturing, media, etc.).
ODS Layer (Raw Data)
Collects source system data, preserving original business process information with minimal transformation; supports incremental sync with delta tables for large datasets.
Unified Data Warehouse Layer (DW)
Includes detailed (DWD) and summary (DWS) layers, reorganizing source data into standardized metrics and dimensions for unified business reporting.
Application Data Layer (ADS)
Extracts data from DW/TDM to serve specific business needs, providing tailored datasets for downstream applications.
Data Asset Management
Manages catalogs, metadata, quality, lineage, and lifecycle, presenting assets visually to enhance data awareness and support value‑driven applications.
Data Governance
Covers standard management, metadata, quality, security, and lifecycle governance.
Data Service System
Transforms data into service capabilities, exposing APIs for query and analysis.
Query Service
Accepts query conditions and returns data via API, supporting indexed identifiers, filter items, sorting, and pagination.
Analysis Service
Provides high‑performance multi‑source analysis (Hive, ES, Greenplum, MySQL, Oracle, files) with instant queries, multi‑dimensional analysis, and flexible business integration.
Recommendation Service
Delivers personalized recommendations by mining user‑item behavior, supporting industry‑specific logic and various scenarios (cold start, active browsing), with continuous model optimization.
Crowd‑Targeting Service
Filters users based on tag combinations, supports audience sizing, and integrates with multiple channels (SMS, email, marketing platforms).
Offline Platform
Includes product function diagram, scheduling module, overall architecture, FTP‑based task dependency, and diagnostic platform "Huatuo" for task analysis.
Real‑Time Platform
Meituan‑Dianping
Uses Grafana for embedded monitoring.
Bilibili
Features SQL‑based programming, DAG drag‑and‑drop, integrated operations; built on BSQL, YARN, Flink, Kafka, HBase, Redis, RocksDB, and supports AI, search, recommendation, and real‑time ETL scenarios.
NetEase
Real‑time stream processing covers advertising, e‑commerce, search, and recommendation workloads.
Event Management
Coordinates Server (request initiator), Kernel (executor), and Admin (result confirmer) modules to ensure reliable distributed task execution with high availability.
Platform Task State Management
Server handles initial state; Admin manages YARN‑related interactions.
Task Debugging
SQL tasks support debugging with custom CSV inputs; sloth‑server assembles requests, invokes kernels, and collects logs.
Log Retrieval
Filebeat ships logs to Kafka, Logstash parses them, and Elasticsearch stores them for Kibana visualization and search.
Monitoring
Uses InfluxDB metric‑report component and NetEase‑built NTSDB for metric collection, viewable via Grafana and alerting modules.
Alerting
Sloth stream platform supports failure, latency, and custom rule alerts, delivering notifications through internal chat, email, phone, or SMS.
Real‑Time Data Warehouse
Collects logs and event data into Kafka, processes them in real‑time, extracts ODS details, aggregates into Redis, Kudu, etc., and serves data via APIs to front‑end applications.
E‑Commerce Applications – Data Analysis
Real‑time activity analysis, homepage resource analysis, funnel metrics, and profit calculations.
E‑Commerce Applications – Search & Recommendation
Handles real‑time user footprints, features, CTR/CVR modeling, homepage carousel, and activity selection with UV/PV statistics.
Offline vs. Real‑Time Data Warehouse
Building an Offline Warehouse
Defines data warehouse as a subject‑oriented, integrated, time‑variant, read‑only collection for decision support. Goals include data assets and decision information. ETL bridges offline and real‑time pipelines, enabling data flow across layers.
ETL
Supports diverse sources (text, logs, RDBMS, NoSQL) using tools like DataX, Sqoop, Kettle, Informatica, ensuring scheduled, non‑blocking data sync.
Layered Architecture
ODS (raw), Stage (buffer), DWD (detail), DIM (dimension), DW (fact), DM (application) layers each serve specific processing and storage purposes.
Code Standards
Enforces script header comments, naming conventions, and field naming consistency across models.
Differences Between Offline and Real‑Time Warehouses
Offline warehouses use Sqoop/DataX/Hive for T+1 data with daily batch jobs; real‑time warehouses ingest raw data via Canal into Kafka, store in OLAP systems like HBase, and provide minute‑level or sub‑second query capabilities.
Data Middle Platform Solutions
Industry‑specific implementations (retail, securities, etc.) with metrics such as RPS (Revenue Per Search) and ROI (Return on Investment).
Disclaimer: Thanks to the original author for the content. If there are copyright issues, please contact us.
Recommended Reading: For more architecture‑related knowledge, refer to the "Architect Technical Alliance Library" (32 books) and obtain the original article.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.