How Modern Data Middle Platforms Power Real‑Time and Offline Analytics
This article provides a comprehensive technical overview of data middle platforms, covering data aggregation, offline and real‑time development, smart operations, data asset management, governance, service layers, platform implementations, warehouse layering, and key differences between offline and real‑time data warehouses.
Data Middle Platform Overview
A Data Middle Platform (DMP) centralizes heterogeneous data sources, providing unified collection, storage, processing, and modeling capabilities for downstream business applications.
Data Aggregation
Core tools gather data from databases, event logs, web crawlers, and message queues. Aggregation can be batch (offline) or streaming (real‑time).
Typical tools: Canal, DataX, Sqoop
Offline Development
Job Scheduling
Dependency scheduling – a job starts only after all parent jobs finish.
Time scheduling – a job can be configured to start at a specific time (e.g., 05:00).
Baseline Control Long‑running jobs are predicted for completion time; if a job cannot finish on schedule, alerts are sent to operators for early intervention.
Heterogeneous Storage Plugins are built for each compute engine (Oracle, Hive, Spark, MapReduce) so that jobs automatically use the appropriate plugin.
Code Validation SQL checkers enforce strict pre‑execution validation for common task types.
Multi‑Environment Cascading Separate environments (single, classic, complex) provide isolated Hive databases, YARN queues, and even distinct Hadoop clusters, supporting resource and permission isolation.
Recommended Dependencies Based on table‑level lineage graphs, the system identifies upstream jobs, removes cycles, and returns a list of suitable dependencies.
Data Permissions Different engines have independent permission systems (e.g., Oracle, HANA, Libra). Common strategies include RBAC (Cloudera Sentry) and PBAC (Hortonworks Ranger). A centralized UI allows applicants to request access, auditors to approve, and logs all actions for audit.
Real‑time Development
Metadata Management
SQL‑Driven Processing
Component‑Based Development
Smart Operations
Integrated tools handle job management, code deployment, monitoring, and alerting, enabling actions such as task re‑run, downstream re‑run, and data back‑fill.
Data Asset Management
Manages catalogs, metadata, data quality, lineage, and lifecycle, presenting assets visually to improve data awareness and support downstream value extraction.
Data Service System
Transforms stored data into services accessed via APIs, supporting query, analysis, recommendation, and audience‑targeting use cases.
Offline Platform (Suning Example)
Cross‑task dependencies are realized via an FTP event mechanism: a marker file on the FTP server signals that upstream processing is complete and triggers downstream tasks.
Immediate tasks are placed at the head of a DelayQueue; periodic tasks use Quartz; dependency‑driven tasks rely on ZooKeeper listeners.
Real‑time Platform
Meituan
Integrates Grafana for embedded monitoring.
Bilibili
Features SQL‑based programming, DAG drag‑and‑drop, and unified managed operations.
The platform consists of real‑time ingestion and computation. Ingestion routes logs, binlogs, and service logs to Kafka or HDFS via the internal Lancer system. Computation uses BSQL on YARN; Flink manages the execution pool and supports MySQL, Redis, HBase as dimension tables. State is stored in RocksDB with extensions to MapDB and Redis for IO‑intensive workloads.
Use Cases
AI engineering: streaming joins for advertising, search, recommendation.
Real‑time feature support for player and CDN quality monitoring (live, PCU, stall rate).
User growth: channel analysis and optimization.
Real‑time ETL: dashboards, boards, and live reporting.
NetEase
Supports advertising, e‑commerce dashboards, ETL, analytics, recommendation, risk control, search, and live streaming.
Event Management
Three modules coordinate task operations:
Server – receives requests, validates data, assembles the event, and forwards it to the Kernel.
Kernel – executes shell scripts on the cluster.
Admin – confirms results, updates status, and writes to ZooKeeper.
Distributed locks in ZooKeeper guarantee single‑operator execution; high availability is achieved via multiple Server instances, Kernel monitoring, and hot‑standby Admin.
Task Debugging
SQL tasks can be debugged by uploading CSV inputs; the sloth‑server assembles the request, invokes the Kernel, and collects logs.
Log Retrieval
Filebeat ships task logs from each YARN node to Kafka; Logstash parses them; Elasticsearch stores them for Kibana UI search and direct UI display.
Monitoring
Metrics are collected by an InfluxDB‑based component; NetEase’s custom time‑series database (NTSDB) provides dynamic scaling and high availability. Metrics are visualized via Grafana or trigger alerts.
Alerting
Supports alerts for task failures, data lag, failover, and custom rules (e.g., QPS thresholds). Notification channels include internal chat tools, email, phone, and SMS.
Real‑time Data Warehouse
Data is ingested into Kafka, processed in real‑time, and results are written to Redis, Kudu, etc., then served to front‑end applications via data services.
Data Warehouse Layers
ODS (Original Data Store) – raw data from source systems with minimal transformation.
DWD (Detail Layer) – cleansed, enriched data, often stored in Kafka or Redis.
DWS (Summary Layer) – aggregated, modeled data for business reporting.
TDM (Tag Data Model) – object‑centric integration across domains via ID mapping.
ADS (Application Data Layer) – extracts data for specific business needs.
Differences Between Offline and Real‑time Warehouses
Offline warehouses use Sqoop, DataX, Hive to build T+1 data refreshed daily, offering high accuracy and stability. Real‑time warehouses ingest data via Canal into Kafka, then store in OLAP systems such as HBase, providing minute‑level or sub‑second latency at the cost of lower accuracy and stability.
Key Implementation Points for Real‑time Warehouses
End‑to‑end latency and traffic monitoring.
Rapid fault recovery.
Back‑track capability to consume data from arbitrary time windows.
Hybrid query model: real‑time queries for fresh data, offline paths to correct T+1 data.
Data map and lineage documentation.
Real‑time data quality monitoring, initially rule‑based.
Code Standards
Script header comments follow Google style guidelines.
One table per file; file name matches table name.
Field names avoid synonyms and ambiguity, especially in model layers.
Overall, a well‑designed data middle platform unifies data collection, processing, governance, and service delivery, enabling both offline batch analytics and low‑latency real‑time insights across industries.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
