Big Data 26 min read

Unlocking the Power of Data Middle Platforms: Key Concepts and Best Practices

This article provides a comprehensive overview of data middle platforms, covering data aggregation, collection tools, offline and real‑time development, scheduling, baseline control, heterogeneous storage, data governance, service layers, monitoring, and the architectural differences between offline and real‑time data warehouses.

ITFLY8 Architecture Home

May 3, 2021

Data Middle Platform

The data middle platform (data‑center) supplies core capabilities such as data aggregation, unified storage, and processing to support downstream analytics and applications.

Data Aggregation

Data aggregation collects heterogeneous network and source data into the platform for centralized storage, enabling both batch and real‑time ingestion.

Methods: database sync, event tracking, web crawling, message queues.

Modes: offline batch aggregation and real‑time collection.

Data Collection Tools

Canal

DataX

Sqoop

Data Development

Provides offline, real‑time, and algorithm development tools for developers and analysts.

Offline Development

Job Scheduling

Dependency scheduling: a job starts only after all parent jobs finish.

Time scheduling: a job can be set to start at a specific time.

Baseline Control

Predicts job completion time with algorithms; alerts operations when jobs cannot finish on schedule.

Heterogeneous Storage

Supports multiple compute engines (Oracle, Hive, Spark, MapReduce) via dedicated plugins that are automatically selected based on job type.

Code Validation

SQL checkers enforce strict pre‑execution validation for common SQL tasks.

Multi‑Environment Cascade

Single environment: one production environment.

Classic environment: development (masked data) → production (real data).

Complex environment: external and internal users with separate controlled spaces.

Recommended Dependencies

Build a table‑level lineage graph of upstream and downstream jobs.

Identify suitable upstream jobs via lineage analysis.

Detect and remove circular dependencies.

Return a list of appropriate dependency nodes.

Data Permissions

Different engines have independent permission systems (e.g., Oracle, HANA, LibrA).

RBAC (e.g., Cloudera Sentry) and PBAC (e.g., Hortonworks Ranger) are common strategies.

Permissions are usually managed by data‑cluster or DB admins; a centralized permission center provides UI for requests, approvals, and auditing.

Real‑time Development

Metadata management

SQL‑driven development

Component‑based modular design

Intelligent Operations

Integrated tools for task management, code deployment, monitoring, and alerting improve efficiency; features include job rerun, downstream rerun, and data back‑fill.

Data System Architecture

The platform builds a full data ecosystem consisting of ODS (raw source layer), DW (unified warehouse), TDM (tag data model), and ADS (application data layer).

ODS Layer

Collects raw business data with minimal transformation; table names follow the pattern ODS_<system>_<table>. Incremental tables use a _delta suffix.

DW Layer

DWD – detailed data layer

DWS – summary data layer

Re‑organizes business data into consistent metrics and dimensions for unified analysis.

TDM Layer

Object‑oriented modeling creates a cross‑domain tag system by linking IDs across business domains.

ADS Layer

Extracts data from DW/TDM to serve specific business needs, delivering APIs for downstream applications.

Data Asset Management

Manages catalogs, metadata, quality, lineage, and lifecycle, providing a visual overview of enterprise data assets.

Data Governance

Includes standard management, metadata, quality, security, and lifecycle governance.

Data Service System

Query Service

Provides API‑based data retrieval with configurable identifiers, filters, sorting, and pagination.

Analysis Service

Supports multi‑source connections (Hive, ES, Greenplum, MySQL, Oracle, files).

High‑performance ad‑hoc queries for billion‑row datasets.

Multi‑dimensional analysis and deep data mining.

Flexible integration with business systems.

Recommendation Service

Generates personalized recommendations ("thousands of faces") by mining user‑item interactions; supports industry‑specific logic, cold‑start and active‑user scenarios, and continuous model optimization.

Audience Segmentation Service

Filters users based on tag combinations, offers count metrics, and can export results to downstream channels (SMS, WeChat, marketing platforms).

Offline Platform

Provides job orchestration, scheduling, dependency management, and debugging tools. Visual diagrams illustrate product functions, scheduling modules, and overall architecture.

Real‑time Platform

Examples from Meituan‑Dianping, Bilibili, and NetEase demonstrate:

SQL‑based and DAG‑based programming.

Unified metadata, lineage, permission, and operation management.

Data ingestion via App logs, DB binlog, server logs, and Kafka.

Computation built on Flink/Saber with YARN scheduling.

State handling using RocksDB, MapDB, Redis.

Outputs to Kafka, HBase, ES, MySQL, TiDB for downstream AI/BI.

Event Management

Three modules coordinate task actions: Server (receives and validates events), Kernel (executes shell scripts on the cluster), and Admin (confirms results and ensures high availability via redundancy and hot‑standby).

Platform Task Status Management

Server initiates tasks, Admin monitors YARN status, and state transitions are reflected in the UI.

Task Debugging

SQL tasks can be debugged with custom CSV inputs; the sloth‑server assembles requests, invokes the kernel, and collects logs.

Log Retrieval

Filebeat ships node logs to Kafka, Logstash parses them, and Elasticsearch stores them for Kibana visualization and UI search.

Monitoring & Alerting

Metrics are collected via InfluxDB (or NetEase’s NTSDB) and displayed in Grafana; alerts can be sent through internal chat, email, SMS, or phone, with configurable thresholds for QPS, latency, and failure events.

Offline vs. Real‑time Warehouse

Offline warehouses (Sqoop, DataX, Hive) provide T+1 data refreshed daily, while real‑time warehouses ingest via Canal/Kafka into OLAP stores (HBase, Redis) for minute‑level queries.

ETL

ETL tools (DataX, Sqoop, Kettle, Informatica) handle diverse sources (text, logs, RDBMS, NoSQL) and schedule incremental loads to avoid resource contention.

Layered Architecture

Stage buffer layer – transactional daily increments.

ODS – raw data aligned with online sources.

DWD/DW – dimensional and fact models (star schema).

DA – application‑oriented aggregated layer.

Code Standards

Script header, encoding, and comment conventions.

One table per file; file name matches table name.

Clear field naming to avoid synonyms and ambiguity.

Key Differences

Offline warehouses rely on batch tools (Sqoop, DataX) for T+1 data.

Real‑time warehouses use streaming tools (Canal) and OLAP stores for sub‑minute queries.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time Processing Data Warehouse ETL

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.