Big Data 25 min read

Unlocking the Data Middle Platform: From Ingestion to Real‑Time Analytics

This article provides a comprehensive overview of data middle platform concepts, covering data aggregation, collection tools, development modules, job scheduling, baseline control, heterogeneous storage, permission management, real‑time and offline processing, governance, services, and implementation details for building robust big‑data solutions.

ITFLY8 Architecture Home

Dec 18, 2020

Data Middle Platform

Data middle platform (DMP) is a core capability that aggregates heterogeneous network and data source information into a centralized repository for downstream processing and modeling.

Data Aggregation

Core tools include database synchronization, event tracking, web crawlers, and message queues, with both offline batch and real‑time collection methods.

Data Collection Tools

Canal

DataX

Sqoop

Data Development

Provides offline, real‑time, and algorithm development tools for developers and analysts.

Offline Development

Job Scheduling

Dependency scheduling: a job starts only after all parent jobs finish.

Time scheduling: a job can be set to start at a specific time.

Baseline Control

Predicts job completion time using algorithms; if a job cannot finish on time, the scheduler alerts operations staff for early intervention.

Heterogeneous Storage

Different compute engines (Oracle, Hive, Spark, MapReduce, etc.) have dedicated plugins; the platform automatically selects the appropriate plugin based on job type.

Code Validation

SQL checkers enforce strict pre‑execution validation for common SQL tasks.

Multi‑Environment Cascading

Supports single, classic, and complex environments, each with isolated Hive databases, YARN queues, and possibly separate Hadoop clusters.

Recommended Dependencies

Uses table‑level lineage graphs to find upstream jobs, performs cycle detection, and returns suitable dependency lists.

Data Permissions

Addresses challenges of diverse permission systems (e.g., Oracle, HANA, Sentry, Ranger) and provides a unified UI for request, approval, and audit of data access.

Real‑Time Development

Metadata management

SQL‑driven processing

Componentized development

Intelligent Operations

Integrates task management, code deployment, monitoring, and alerting to improve efficiency, supporting re‑run, downstream re‑run, and data back‑fill.

Data System

Combines data aggregation and development modules to form a traditional data warehouse capability, enabling the construction of an enterprise‑wide data system.

Full‑domain coverage

Clear hierarchical structure

Accurate and consistent data

Performance optimization

Cost reduction through data sharing

Ease of use with pre‑processed data

Data Asset Management

Manages catalogs, metadata, quality, lineage, and lifecycle, presenting assets to enhance data awareness.

Data Governance

Includes standards, metadata, quality, security, and lifecycle management.

Data Service System

Transforms data assets into services, enabling rapid development of business middle platforms.

Query Service

Provides API‑based data retrieval with configurable identifiers, filters, sorting, and pagination.

Analysis Service

Supports multi‑source data access, high‑performance ad‑hoc queries, multi‑dimensional analysis, and deep data mining.

Recommendation Service

Generates personalized recommendations by mining user‑item interactions, supporting industry‑specific logic, various scenarios, and continuous model optimization.

Crowd Service

Selects target user groups based on tag combinations and exposes them via API, with support for audience sizing and multi‑channel integration.

Offline Platform

Illustrates product functions, scheduling modules, and overall architecture for batch processing.

Real‑Time Platform

Meituan Dianping

Uses Grafana for embedded monitoring.

bilibili

SQL‑based programming

DAG drag‑and‑drop

Integrated operation and maintenance

Real‑time platform consists of transmission and computation layers, unified metadata, lineage, and permission management. Transmission ingests logs, binlogs, and app data into Kafka; computation runs on Flink (BSQL) with YARN scheduling, leveraging RocksDB, MapDB, Redis for state, and outputs to Kafka, HBase, ES, MySQL, TiDB for downstream AI, BI, and reporting.

Event Management

Coordinates Server (request handling), Kernel (execution), and Admin (verification) modules to ensure single‑operator task execution and high availability.

Platform Task State Management

Server initiates tasks and creates distributed locks; Kernel executes shell scripts; Admin monitors locks, updates YARN status, and releases locks for subsequent operations.

Task Debugging

SQL tasks can be debugged with custom CSV inputs; Sloth‑server assembles requests, invokes kernels, collects logs, and returns results.

Log Retrieval

Filebeat forwards task logs to Kafka, Logstash parses them, and ES stores them for Kibana UI search and on‑screen display.

Monitoring & Alerting

Metrics are collected via InfluxDB and a proprietary time‑series database (NTSDB), visualized in Grafana, with alerts sent through internal chat, email, phone, or SMS.

Real‑Time Data Warehouse

Collects logs and event data into Kafka, processes them in real‑time, writes detailed ODS data to Redis/Kudu, and serves applications via data services.

Offline vs. Real‑Time Data Warehouse

Offline: built with Sqoop, DataX, Hive; provides T+1 data refreshed daily.

Real‑time: ingests via Canal to Kafka, stores in HBase/OLAP, offers minute‑level or sub‑second queries.

Data Warehouse Construction

Involves data collection, processing, archiving, and application, supporting reporting, ad‑hoc queries, BI, analysis, mining, and model training.

Key Points for Real‑Time Warehouse

End‑to‑end latency and traffic monitoring

Rapid fault recovery

Back‑track capability for specific time windows

Hybrid query: real‑time for recent data, offline for T+1 correction

Data map and lineage

Real‑time data quality monitoring

Solution Overview

Provides industry‑specific implementations (retail, securities, manufacturing, media, etc.) and emphasizes the need for a powerful OLAP database to support both offline and real‑time analytics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink real-time analytics Data Platform ETL Data Governance

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.