Big Data 26 min read

How Modern Data Middle Platforms Power Real‑Time and Offline Analytics

This article provides a comprehensive technical overview of data middle platforms, covering data aggregation, offline and real‑time development, smart operations, data asset management, governance, service layers, platform implementations, warehouse layering, and key differences between offline and real‑time data warehouses.

IT Architects Alliance

May 25, 2021

How Modern Data Middle Platforms Power Real‑Time and Offline Analytics

Data Middle Platform Overview

A Data Middle Platform (DMP) centralizes heterogeneous data sources, providing unified collection, storage, processing, and modeling capabilities for downstream business applications.

Data Aggregation

Core tools gather data from databases, event logs, web crawlers, and message queues. Aggregation can be batch (offline) or streaming (real‑time).

Typical tools: Canal, DataX, Sqoop

Offline Development

Job Scheduling

Dependency scheduling – a job starts only after all parent jobs finish.

Time scheduling – a job can be configured to start at a specific time (e.g., 05:00).

Baseline Control Long‑running jobs are predicted for completion time; if a job cannot finish on schedule, alerts are sent to operators for early intervention.

Heterogeneous Storage Plugins are built for each compute engine (Oracle, Hive, Spark, MapReduce) so that jobs automatically use the appropriate plugin.

Code Validation SQL checkers enforce strict pre‑execution validation for common task types.

Multi‑Environment Cascading Separate environments (single, classic, complex) provide isolated Hive databases, YARN queues, and even distinct Hadoop clusters, supporting resource and permission isolation.

Recommended Dependencies Based on table‑level lineage graphs, the system identifies upstream jobs, removes cycles, and returns a list of suitable dependencies.

Data Permissions Different engines have independent permission systems (e.g., Oracle, HANA, Libra). Common strategies include RBAC (Cloudera Sentry) and PBAC (Hortonworks Ranger). A centralized UI allows applicants to request access, auditors to approve, and logs all actions for audit.

Real‑time Development

Metadata Management

SQL‑Driven Processing

Component‑Based Development

Smart Operations

Integrated tools handle job management, code deployment, monitoring, and alerting, enabling actions such as task re‑run, downstream re‑run, and data back‑fill.

Data Asset Management

Manages catalogs, metadata, data quality, lineage, and lifecycle, presenting assets visually to improve data awareness and support downstream value extraction.

Data Service System

Transforms stored data into services accessed via APIs, supporting query, analysis, recommendation, and audience‑targeting use cases.

Offline Platform (Suning Example)

Cross‑task dependencies are realized via an FTP event mechanism: a marker file on the FTP server signals that upstream processing is complete and triggers downstream tasks.

Immediate tasks are placed at the head of a DelayQueue; periodic tasks use Quartz; dependency‑driven tasks rely on ZooKeeper listeners.

Real‑time Platform

Meituan

Integrates Grafana for embedded monitoring.

Bilibili

Features SQL‑based programming, DAG drag‑and‑drop, and unified managed operations.

The platform consists of real‑time ingestion and computation. Ingestion routes logs, binlogs, and service logs to Kafka or HDFS via the internal Lancer system. Computation uses BSQL on YARN; Flink manages the execution pool and supports MySQL, Redis, HBase as dimension tables. State is stored in RocksDB with extensions to MapDB and Redis for IO‑intensive workloads.

Use Cases

AI engineering: streaming joins for advertising, search, recommendation.

Real‑time feature support for player and CDN quality monitoring (live, PCU, stall rate).

User growth: channel analysis and optimization.

Real‑time ETL: dashboards, boards, and live reporting.

NetEase

Supports advertising, e‑commerce dashboards, ETL, analytics, recommendation, risk control, search, and live streaming.

Event Management

Three modules coordinate task operations:

Server – receives requests, validates data, assembles the event, and forwards it to the Kernel.

Kernel – executes shell scripts on the cluster.

Admin – confirms results, updates status, and writes to ZooKeeper.

Distributed locks in ZooKeeper guarantee single‑operator execution; high availability is achieved via multiple Server instances, Kernel monitoring, and hot‑standby Admin.

Task Debugging

SQL tasks can be debugged by uploading CSV inputs; the sloth‑server assembles the request, invokes the Kernel, and collects logs.

Log Retrieval

Filebeat ships task logs from each YARN node to Kafka; Logstash parses them; Elasticsearch stores them for Kibana UI search and direct UI display.

Monitoring

Metrics are collected by an InfluxDB‑based component; NetEase’s custom time‑series database (NTSDB) provides dynamic scaling and high availability. Metrics are visualized via Grafana or trigger alerts.

Alerting

Supports alerts for task failures, data lag, failover, and custom rules (e.g., QPS thresholds). Notification channels include internal chat tools, email, phone, and SMS.

Real‑time Data Warehouse

Data is ingested into Kafka, processed in real‑time, and results are written to Redis, Kudu, etc., then served to front‑end applications via data services.

Data Warehouse Layers

ODS (Original Data Store) – raw data from source systems with minimal transformation.

DWD (Detail Layer) – cleansed, enriched data, often stored in Kafka or Redis.

DWS (Summary Layer) – aggregated, modeled data for business reporting.

TDM (Tag Data Model) – object‑centric integration across domains via ID mapping.

ADS (Application Data Layer) – extracts data for specific business needs.

Differences Between Offline and Real‑time Warehouses

Offline warehouses use Sqoop, DataX, Hive to build T+1 data refreshed daily, offering high accuracy and stability. Real‑time warehouses ingest data via Canal into Kafka, then store in OLAP systems such as HBase, providing minute‑level or sub‑second latency at the cost of lower accuracy and stability.

Key Implementation Points for Real‑time Warehouses

End‑to‑end latency and traffic monitoring.

Rapid fault recovery.

Back‑track capability to consume data from arbitrary time windows.

Hybrid query model: real‑time queries for fresh data, offline paths to correct T+1 data.

Data map and lineage documentation.

Real‑time data quality monitoring, initially rule‑based.

Code Standards

Script header comments follow Google style guidelines.

One table per file; file name matches table name.

Field names avoid synonyms and ambiguity, especially in model layers.

Overall, a well‑designed data middle platform unifies data collection, processing, governance, and service delivery, enabling both offline batch analytics and low‑latency real‑time insights across industries.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data real-time analytics Data Platform Data Warehouse ETL Data Governance

Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.