Unlocking the Power of Data Middle Platforms: Key Concepts and Best Practices
This article provides a comprehensive overview of data middle platforms, covering data aggregation, collection tools, offline and real‑time development, scheduling, baseline control, heterogeneous storage, data governance, service layers, monitoring, and the architectural differences between offline and real‑time data warehouses.
Data Middle Platform
The data middle platform (data‑center) supplies core capabilities such as data aggregation, unified storage, and processing to support downstream analytics and applications.
Data Aggregation
Data aggregation collects heterogeneous network and source data into the platform for centralized storage, enabling both batch and real‑time ingestion.
Methods: database sync, event tracking, web crawling, message queues.
Modes: offline batch aggregation and real‑time collection.
Data Collection Tools
Canal
DataX
Sqoop
Data Development
Provides offline, real‑time, and algorithm development tools for developers and analysts.
Offline Development
Job Scheduling
Dependency scheduling: a job starts only after all parent jobs finish.
Time scheduling: a job can be set to start at a specific time.
Baseline Control
Predicts job completion time with algorithms; alerts operations when jobs cannot finish on schedule.
Heterogeneous Storage
Supports multiple compute engines (Oracle, Hive, Spark, MapReduce) via dedicated plugins that are automatically selected based on job type.
Code Validation
SQL checkers enforce strict pre‑execution validation for common SQL tasks.
Multi‑Environment Cascade
Single environment: one production environment.
Classic environment: development (masked data) → production (real data).
Complex environment: external and internal users with separate controlled spaces.
Recommended Dependencies
Build a table‑level lineage graph of upstream and downstream jobs.
Identify suitable upstream jobs via lineage analysis.
Detect and remove circular dependencies.
Return a list of appropriate dependency nodes.
Data Permissions
Different engines have independent permission systems (e.g., Oracle, HANA, LibrA).
RBAC (e.g., Cloudera Sentry) and PBAC (e.g., Hortonworks Ranger) are common strategies.
Permissions are usually managed by data‑cluster or DB admins; a centralized permission center provides UI for requests, approvals, and auditing.
Real‑time Development
Metadata management
SQL‑driven development
Component‑based modular design
Intelligent Operations
Integrated tools for task management, code deployment, monitoring, and alerting improve efficiency; features include job rerun, downstream rerun, and data back‑fill.
Data System Architecture
The platform builds a full data ecosystem consisting of ODS (raw source layer), DW (unified warehouse), TDM (tag data model), and ADS (application data layer).
ODS Layer
Collects raw business data with minimal transformation; table names follow the pattern ODS_<system>_<table>. Incremental tables use a _delta suffix.
DW Layer
DWD – detailed data layer
DWS – summary data layer
Re‑organizes business data into consistent metrics and dimensions for unified analysis.
TDM Layer
Object‑oriented modeling creates a cross‑domain tag system by linking IDs across business domains.
ADS Layer
Extracts data from DW/TDM to serve specific business needs, delivering APIs for downstream applications.
Data Asset Management
Manages catalogs, metadata, quality, lineage, and lifecycle, providing a visual overview of enterprise data assets.
Data Governance
Includes standard management, metadata, quality, security, and lifecycle governance.
Data Service System
Query Service
Provides API‑based data retrieval with configurable identifiers, filters, sorting, and pagination.
Analysis Service
Supports multi‑source connections (Hive, ES, Greenplum, MySQL, Oracle, files).
High‑performance ad‑hoc queries for billion‑row datasets.
Multi‑dimensional analysis and deep data mining.
Flexible integration with business systems.
Recommendation Service
Generates personalized recommendations ("thousands of faces") by mining user‑item interactions; supports industry‑specific logic, cold‑start and active‑user scenarios, and continuous model optimization.
Audience Segmentation Service
Filters users based on tag combinations, offers count metrics, and can export results to downstream channels (SMS, WeChat, marketing platforms).
Offline Platform
Provides job orchestration, scheduling, dependency management, and debugging tools. Visual diagrams illustrate product functions, scheduling modules, and overall architecture.
Real‑time Platform
Examples from Meituan‑Dianping, Bilibili, and NetEase demonstrate:
SQL‑based and DAG‑based programming.
Unified metadata, lineage, permission, and operation management.
Data ingestion via App logs, DB binlog, server logs, and Kafka.
Computation built on Flink/Saber with YARN scheduling.
State handling using RocksDB, MapDB, Redis.
Outputs to Kafka, HBase, ES, MySQL, TiDB for downstream AI/BI.
Event Management
Three modules coordinate task actions: Server (receives and validates events), Kernel (executes shell scripts on the cluster), and Admin (confirms results and ensures high availability via redundancy and hot‑standby).
Platform Task Status Management
Server initiates tasks, Admin monitors YARN status, and state transitions are reflected in the UI.
Task Debugging
SQL tasks can be debugged with custom CSV inputs; the sloth‑server assembles requests, invokes the kernel, and collects logs.
Log Retrieval
Filebeat ships node logs to Kafka, Logstash parses them, and Elasticsearch stores them for Kibana visualization and UI search.
Monitoring & Alerting
Metrics are collected via InfluxDB (or NetEase’s NTSDB) and displayed in Grafana; alerts can be sent through internal chat, email, SMS, or phone, with configurable thresholds for QPS, latency, and failure events.
Offline vs. Real‑time Warehouse
Offline warehouses (Sqoop, DataX, Hive) provide T+1 data refreshed daily, while real‑time warehouses ingest via Canal/Kafka into OLAP stores (HBase, Redis) for minute‑level queries.
ETL
ETL tools (DataX, Sqoop, Kettle, Informatica) handle diverse sources (text, logs, RDBMS, NoSQL) and schedule incremental loads to avoid resource contention.
Layered Architecture
Stage buffer layer – transactional daily increments.
ODS – raw data aligned with online sources.
DWD/DW – dimensional and fact models (star schema).
DA – application‑oriented aggregated layer.
Code Standards
Script header, encoding, and comment conventions.
One table per file; file name matches table name.
Clear field naming to avoid synonyms and ambiguity.
Key Differences
Offline warehouses rely on batch tools (Sqoop, DataX) for T+1 data.
Real‑time warehouses use streaming tools (Canal) and OLAP stores for sub‑minute queries.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
