Big Data 25 min read

Comprehensive Overview of Data Middle Platform Architecture and Practices

This article provides a detailed introduction to data middle platform concepts, covering data aggregation, ingestion tools, offline and real‑time development, data governance, service layers, monitoring, and deployment patterns, illustrating how enterprises build unified data ecosystems across various industries.

Architecture Digest

May 7, 2021

Data Middle Platform

Data aggregation is a core capability of a data middle platform, enabling heterogeneous network and data source collection into a centralized store for downstream processing. Aggregation methods include database sync, tracking, web crawling, and message queues, with both batch and real‑time modes.

Data Ingestion Tools

Canal, DataX, Sqoop

Data Development

The data development module serves developers and analysts, offering offline, real‑time, and algorithmic development tools.

Offline Development

Job Scheduling

• Dependency scheduling: a job starts only after all parent jobs finish. • Time scheduling: a job can be set to start at a specific time.

Baseline Control

Long‑running big‑data jobs use predictive algorithms to estimate completion time and trigger alerts for operations when jobs lag, allowing proactive intervention.

Heterogeneous Storage

Different compute engines (Oracle, Hive, Spark, MapReduce) have dedicated plugins; users create jobs of various types and the system automatically selects the appropriate plugin.

Code Validation

SQL tasks undergo strict pre‑execution checks to catch errors early.

Multi‑Environment Cascading

Supports single, classic, and complex environments with isolated Hive databases, Yarn queues, and even separate Hadoop clusters, enabling fine‑grained resource and permission control.

Recommended Dependencies

Uses table‑level lineage graphs to find upstream jobs, performs cycle detection, and returns suitable dependency lists.

Data Permissions

Addresses challenges of diverse permission systems (e.g., Oracle, HANA, Sentry, Ranger) and provides a unified UI for request, approval, and audit of data access.

Real‑Time Development

Features metadata management, SQL‑driven workflows, and componentized development.

Intelligent Operations

Integrates task management, code release, monitoring, and alerting to streamline operations such as job reruns and data backfills.

Data System

Combines data aggregation and development modules to form a traditional data warehouse capability, establishing a comprehensive enterprise data ecosystem.

ODS Layer (Raw Data)

Collects source system data with minimal transformation, preserving original fields and adding timestamps; supports full and incremental loads, and stores both structured and semi‑structured data.

DataX Synchronization Steps

1) Identify source and target tables. 2) Map fields, optionally adding date, partition, or source identifiers. 3) Configure incremental or conditional sync. 4) Clean target data. 5) Launch sync task. 6) Verify correctness. 7) Publish task to production with rate limits, fault tolerance, and alerting.

Unified Data Warehouse (DW)

Includes detailed (DWD) and aggregated (DWS) layers, providing a unified, business‑oriented view of data across domains.

Tag Data Layer (TDM)

Object‑oriented modeling creates cross‑domain tag sets for deep analysis.

Application Data Layer (ADS)

Extracts and transforms data from DW/TDM to meet specific business and performance needs.

Data Asset Management

Manages catalogs, metadata, quality, lineage, and lifecycle, offering a visual representation of enterprise data assets.

Data Governance

Covers standards, metadata, quality, security, and lifecycle management.

Data Service System

Transforms data assets into services via APIs for querying, analysis, recommendation, and audience segmentation.

Query Service

Provides configurable query identifiers, filters, sorting, and pagination, exposing results via API.

Analysis Service

Supports multi‑source connectivity (Hive, ES, Greenplum, MySQL, Oracle, files), high‑performance ad‑hoc queries, multi‑dimensional analysis, and flexible business integration.

Recommendation Service

Generates recommendation APIs from historical logs and real‑time data, supporting industry‑specific logic, cold‑start handling, and continuous model optimization.

Audience Segmentation Service

Filters users based on tag combinations, provides count metrics, and integrates with downstream channels (SMS, WeChat, marketing platforms).

Offline Platform

Describes SuNing's offline scheduling architecture, task dependency handling via FTP markers, and high‑availability components (Server, Kernel, Admin).

Real‑Time Platform

Highlights Meituan‑Dianping and Bilibili implementations, including SQL‑based programming, DAG drag‑and‑drop, unified metadata, and Flink‑based computation with state management in RocksDB/Redis.

Scenarios

AI engineering (ads, search, recommendation), real‑time quality monitoring, user growth analysis, and real‑time ETL dashboards.

Event Management

Ensures single‑operator task execution using Server (request), Kernel (execution), and Admin (verification) modules with distributed locks and Zookeeper coordination.

Task State Management

Server initiates tasks, Admin monitors YARN status, and both modules provide high availability via horizontal scaling and hot‑standby.

Task Debugging

SQL tasks can be debugged with custom CSV inputs; Sloth‑server assembles requests, invokes kernels, and aggregates logs.

Log Retrieval

Filebeat ships node logs to Kafka, Logstash parses them, and Elasticsearch stores them for Kibana UI and in‑app log search.

Monitoring

Uses InfluxDB/ntsdb metrics visualized in Grafana and alerting modules for threshold‑based notifications.

Alerting

Supports failure, latency, and custom rule alerts with multiple notification channels (chat, email, SMS).

Real‑Time Data Warehouse

Collects logs and events into Kafka, processes them via Flink/Saber, stores results in Redis, Kudu, etc., and serves AI, BI, and reporting use cases.

Offline vs Real‑Time Data Warehouse

Offline warehouses rely on batch tools (Sqoop, DataX, Hive) for T+1 data; real‑time warehouses ingest via Canal to Kafka and store in OLAP systems (HBase) for sub‑second queries.

Data Middle Platform Solutions

Provides industry‑specific blueprints (retail, finance, media) with metrics such as RPS, ROI, and RPS, emphasizing the need for a powerful OLAP database to achieve low‑latency analytics.

End of article.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Real-time Processing Data Platform Data Warehouse ETL Data Governance

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.