Data Middle Platform: Concepts, Architecture, and Implementation
This article provides a comprehensive overview of data middle platforms, covering data aggregation, collection tools, offline and real‑time development, job scheduling, data governance, multi‑layer architecture, ETL processes, and various industry use cases, illustrating how enterprises build and manage unified data assets.
Data Middle Platform
The data middle platform (DMP) is a unified data layer that aggregates heterogeneous data sources, provides centralized storage, and supports downstream data processing, modeling, and analytics.
Data Aggregation
Core tools collect data from various networks and sources via database sync, embedded tracking, web crawlers, or message queues, offering both batch and real‑time ingestion.
Data Collection Tools
Typical tools include Canal, DataX, and Sqoop.
Data Development
Provides offline, real‑time, and algorithm development environments for developers and analysts.
Offline Development
Features job scheduling with dependency and time‑based triggers, baseline control for long‑running jobs, heterogeneous storage plugins (e.g., Oracle, Hive, Spark), SQL code validation, and multi‑environment cascading (single, classic, complex environments).
Recommended Dependencies
Uses table‑level lineage graphs to suggest upstream jobs, detects cycles, and returns suitable dependency lists.
Data Permissions
Addresses challenges of disparate permission systems across engines, supporting RBAC (e.g., Sentry) and PBAC (e.g., Ranger) with a centralized permission management portal.
Real‑time Development
Includes metadata management, SQL‑driven programming, componentized development, and intelligent operations such as task management, code release, monitoring, and alerting.
Data System
Defines a layered architecture: ODS (raw source), DW (detail and summary layers), TDM (tag data), and ADS (application data). Emphasizes full‑domain coverage, clear hierarchy, data consistency, performance, cost reduction, and usability.
Data Asset Management
Manages catalogs, metadata, quality, lineage, and lifecycle, providing visibility and value assessment for enterprise data assets.
Data Governance
Covers standards, metadata, quality, security, and lifecycle management.
Data Service System
Transforms data assets into services via query, analysis, recommendation, and audience‑circling APIs, supporting various business scenarios.
Offline Platform
Illustrates architecture, scheduling modules, and task dependency mechanisms (e.g., FTP event triggers, diagnostic platform).
Real‑time Platform
Describes implementations at Meituan‑Dianping and Bilibili, featuring data ingestion (Canal, Kafka), real‑time computation (Flink, BSQL), state management (RocksDB, Redis), and downstream storage (Redis, HBase, Elasticsearch, MySQL).
Event Management & Task Status
Explains Server‑Kernel‑Admin workflow, distributed locks, high‑availability strategies, and task debugging with SQL‑based input simulation.
Log Retrieval & Monitoring
Logs are collected via Filebeat → Kafka → Logstash → Elasticsearch, visualized in Kibana; metrics are monitored with InfluxDB and a proprietary time‑series DB, with alerts via various channels.
Real‑time vs. Offline Data Warehouse
Offline warehouses use batch tools (Sqoop, DataX, Hive) for T+1 data, while real‑time warehouses ingest via Canal/Kafka and store in OLAP systems (HBase) for sub‑minute queries.
Industry Solutions
Provides case studies for retail, securities, e‑commerce, manufacturing, media, and law enforcement, highlighting specific KPI definitions (RPS, ROI, CPC, CPA, CPM, CVR, CTR, PV).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
