Unlocking the Data Middle Platform: From Ingestion to Real‑Time Analytics
This article provides a comprehensive overview of data middle platform concepts, covering data aggregation, collection tools, development modules, job scheduling, baseline control, heterogeneous storage, permission management, real‑time and offline processing, governance, services, and implementation details for building robust big‑data solutions.
Data Middle Platform
Data middle platform (DMP) is a core capability that aggregates heterogeneous network and data source information into a centralized repository for downstream processing and modeling.
Data Aggregation
Core tools include database synchronization, event tracking, web crawlers, and message queues, with both offline batch and real‑time collection methods.
Data Collection Tools
Canal
DataX
Sqoop
Data Development
Provides offline, real‑time, and algorithm development tools for developers and analysts.
Offline Development
Job Scheduling
Dependency scheduling: a job starts only after all parent jobs finish.
Time scheduling: a job can be set to start at a specific time.
Baseline Control
Predicts job completion time using algorithms; if a job cannot finish on time, the scheduler alerts operations staff for early intervention.
Heterogeneous Storage
Different compute engines (Oracle, Hive, Spark, MapReduce, etc.) have dedicated plugins; the platform automatically selects the appropriate plugin based on job type.
Code Validation
SQL checkers enforce strict pre‑execution validation for common SQL tasks.
Multi‑Environment Cascading
Supports single, classic, and complex environments, each with isolated Hive databases, YARN queues, and possibly separate Hadoop clusters.
Recommended Dependencies
Uses table‑level lineage graphs to find upstream jobs, performs cycle detection, and returns suitable dependency lists.
Data Permissions
Addresses challenges of diverse permission systems (e.g., Oracle, HANA, Sentry, Ranger) and provides a unified UI for request, approval, and audit of data access.
Real‑Time Development
Metadata management
SQL‑driven processing
Componentized development
Intelligent Operations
Integrates task management, code deployment, monitoring, and alerting to improve efficiency, supporting re‑run, downstream re‑run, and data back‑fill.
Data System
Combines data aggregation and development modules to form a traditional data warehouse capability, enabling the construction of an enterprise‑wide data system.
Full‑domain coverage
Clear hierarchical structure
Accurate and consistent data
Performance optimization
Cost reduction through data sharing
Ease of use with pre‑processed data
Data Asset Management
Manages catalogs, metadata, quality, lineage, and lifecycle, presenting assets to enhance data awareness.
Data Governance
Includes standards, metadata, quality, security, and lifecycle management.
Data Service System
Transforms data assets into services, enabling rapid development of business middle platforms.
Query Service
Provides API‑based data retrieval with configurable identifiers, filters, sorting, and pagination.
Analysis Service
Supports multi‑source data access, high‑performance ad‑hoc queries, multi‑dimensional analysis, and deep data mining.
Recommendation Service
Generates personalized recommendations by mining user‑item interactions, supporting industry‑specific logic, various scenarios, and continuous model optimization.
Crowd Service
Selects target user groups based on tag combinations and exposes them via API, with support for audience sizing and multi‑channel integration.
Offline Platform
Illustrates product functions, scheduling modules, and overall architecture for batch processing.
Real‑Time Platform
Meituan Dianping
Uses Grafana for embedded monitoring.
bilibili
SQL‑based programming
DAG drag‑and‑drop
Integrated operation and maintenance
Real‑time platform consists of transmission and computation layers, unified metadata, lineage, and permission management. Transmission ingests logs, binlogs, and app data into Kafka; computation runs on Flink (BSQL) with YARN scheduling, leveraging RocksDB, MapDB, Redis for state, and outputs to Kafka, HBase, ES, MySQL, TiDB for downstream AI, BI, and reporting.
Event Management
Coordinates Server (request handling), Kernel (execution), and Admin (verification) modules to ensure single‑operator task execution and high availability.
Platform Task State Management
Server initiates tasks and creates distributed locks; Kernel executes shell scripts; Admin monitors locks, updates YARN status, and releases locks for subsequent operations.
Task Debugging
SQL tasks can be debugged with custom CSV inputs; Sloth‑server assembles requests, invokes kernels, collects logs, and returns results.
Log Retrieval
Filebeat forwards task logs to Kafka, Logstash parses them, and ES stores them for Kibana UI search and on‑screen display.
Monitoring & Alerting
Metrics are collected via InfluxDB and a proprietary time‑series database (NTSDB), visualized in Grafana, with alerts sent through internal chat, email, phone, or SMS.
Real‑Time Data Warehouse
Collects logs and event data into Kafka, processes them in real‑time, writes detailed ODS data to Redis/Kudu, and serves applications via data services.
Offline vs. Real‑Time Data Warehouse
Offline: built with Sqoop, DataX, Hive; provides T+1 data refreshed daily.
Real‑time: ingests via Canal to Kafka, stores in HBase/OLAP, offers minute‑level or sub‑second queries.
Data Warehouse Construction
Involves data collection, processing, archiving, and application, supporting reporting, ad‑hoc queries, BI, analysis, mining, and model training.
Key Points for Real‑Time Warehouse
End‑to‑end latency and traffic monitoring
Rapid fault recovery
Back‑track capability for specific time windows
Hybrid query: real‑time for recent data, offline for T+1 correction
Data map and lineage
Real‑time data quality monitoring
Solution Overview
Provides industry‑specific implementations (retail, securities, manufacturing, media, etc.) and emphasizes the need for a powerful OLAP database to support both offline and real‑time analytics.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
