Big Data 23 min read

Data Middle Platform: Concepts, Architecture, and Implementation

This article provides a comprehensive overview of data middle platforms, covering data aggregation, collection tools, offline and real‑time development, job scheduling, data governance, multi‑layer architecture, ETL processes, and various industry use cases, illustrating how enterprises build and manage unified data assets.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Data Middle Platform: Concepts, Architecture, and Implementation

Data Middle Platform

The data middle platform (DMP) is a unified data layer that aggregates heterogeneous data sources, provides centralized storage, and supports downstream data processing, modeling, and analytics.

Data Aggregation

Core tools collect data from various networks and sources via database sync, embedded tracking, web crawlers, or message queues, offering both batch and real‑time ingestion.

Data Collection Tools

Typical tools include Canal, DataX, and Sqoop.

Data Development

Provides offline, real‑time, and algorithm development environments for developers and analysts.

Offline Development

Features job scheduling with dependency and time‑based triggers, baseline control for long‑running jobs, heterogeneous storage plugins (e.g., Oracle, Hive, Spark), SQL code validation, and multi‑environment cascading (single, classic, complex environments).

Recommended Dependencies

Uses table‑level lineage graphs to suggest upstream jobs, detects cycles, and returns suitable dependency lists.

Data Permissions

Addresses challenges of disparate permission systems across engines, supporting RBAC (e.g., Sentry) and PBAC (e.g., Ranger) with a centralized permission management portal.

Real‑time Development

Includes metadata management, SQL‑driven programming, componentized development, and intelligent operations such as task management, code release, monitoring, and alerting.

Data System

Defines a layered architecture: ODS (raw source), DW (detail and summary layers), TDM (tag data), and ADS (application data). Emphasizes full‑domain coverage, clear hierarchy, data consistency, performance, cost reduction, and usability.

Data Asset Management

Manages catalogs, metadata, quality, lineage, and lifecycle, providing visibility and value assessment for enterprise data assets.

Data Governance

Covers standards, metadata, quality, security, and lifecycle management.

Data Service System

Transforms data assets into services via query, analysis, recommendation, and audience‑circling APIs, supporting various business scenarios.

Offline Platform

Illustrates architecture, scheduling modules, and task dependency mechanisms (e.g., FTP event triggers, diagnostic platform).

Real‑time Platform

Describes implementations at Meituan‑Dianping and Bilibili, featuring data ingestion (Canal, Kafka), real‑time computation (Flink, BSQL), state management (RocksDB, Redis), and downstream storage (Redis, HBase, Elasticsearch, MySQL).

Event Management & Task Status

Explains Server‑Kernel‑Admin workflow, distributed locks, high‑availability strategies, and task debugging with SQL‑based input simulation.

Log Retrieval & Monitoring

Logs are collected via Filebeat → Kafka → Logstash → Elasticsearch, visualized in Kibana; metrics are monitored with InfluxDB and a proprietary time‑series DB, with alerts via various channels.

Real‑time vs. Offline Data Warehouse

Offline warehouses use batch tools (Sqoop, DataX, Hive) for T+1 data, while real‑time warehouses ingest via Canal/Kafka and store in OLAP systems (HBase) for sub‑minute queries.

Industry Solutions

Provides case studies for retail, securities, e‑commerce, manufacturing, media, and law enforcement, highlighting specific KPI definitions (RPS, ROI, CPC, CPA, CPM, CVR, CTR, PV).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Data PlatformETL
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.