Big Data 23 min read

Mastering Data Middle Platforms: From Ingestion to Real‑Time Analytics

This comprehensive guide explains the concepts, architecture, and best practices of data middle platforms, covering data aggregation, ingestion tools, offline and real‑time development, data governance, service layers, and implementation details for building scalable big‑data solutions.

ITFLY8 Architecture Home

Jul 7, 2021

Data Middle Platform

Data aggregation is a core tool of the data middle platform, collecting heterogeneous network and data source information into a centralized storage for downstream processing. Aggregation methods include database sync, tracking points, web crawlers, and message queues, with offline batch and real‑time modes.

Data Collection Tools

Canal, DataX, Sqoop

Data Development

The data development module provides offline, real‑time, and algorithm development tools for developers and analysts.

Offline Development

Job Scheduling

• Dependency scheduling: a job starts only after all parent jobs finish. • Time scheduling: a job can be set to start at a specific time.

Baseline Control

Long‑running big‑data jobs use predictive algorithms to estimate completion time; if a job cannot finish normally, alerts notify operations staff for early intervention.

Heterogeneous Storage

Different compute engines (Oracle, Hadoop, Hive, Spark, MR) have dedicated plugins; the platform automatically selects the appropriate plugin based on job type.

Code Validation

SQL checkers enforce strict controls on common SQL tasks to catch issues early.

Multi‑Environment Cascading

Supports single, classic, and complex environments with isolated Hive databases, Yarn queues, and even separate Hadoop clusters, enabling fine‑grained resource and permission control.

Recommended Dependencies

Uses table‑level lineage graphs to identify upstream jobs, performs cycle detection, and returns suitable dependency lists.

Data Permissions

Challenges include disparate permission systems across engines and the need for unified RBAC (e.g., Sentry) or PBAC (e.g., Ranger) solutions; a centralized permission center provides UI‑driven request, approval, and audit workflows.

Real‑Time Development

Key components: metadata management, SQL‑driven development, componentized programming.

Intelligent Operations

Integrated tools for job management, code release, monitoring, and alerting improve efficiency; features include rerun, downstream rerun, and data补 (补数据).

Data Architecture

The platform supports a layered data model: ODS (raw source layer), DW (detail and summary layers), TDM (tag data layer), and ADS (application data layer). Each layer serves specific purposes such as raw data preservation, dimensional modeling, and business‑specific data extraction.

Data Asset Management

Manages catalogs, metadata, quality, lineage, and lifecycle, providing a visual overview of enterprise data assets.

Data Governance

Includes standards, metadata, quality, security, and lifecycle management.

Data Service System

Transforms data assets into services via query APIs, analysis APIs, recommendation APIs, and audience segmentation APIs, enabling data‑driven business applications.

Offline Platform

Illustrates product functions, scheduling modules, and overall architecture, including task dependency handling via FTP events and distributed locking.

Real‑Time Platform

Describes implementations at Meituan‑Dianping, Bilibili, and NetEase, covering real‑time transmission (logs, binlog) and computation (Flink, BSQL), state management, and integration with downstream stores like Kafka, HBase, Redis, and OLAP databases.

Event Management

Server initiates events, Kernel executes logic via shell scripts, and Admin confirms results; distributed locks ensure single‑operator safety, with high‑availability achieved through horizontal scaling and hot‑standby.

Offline vs Real‑Time Data Warehouses

Offline warehouses use batch tools (Sqoop, DataX, Hive) for T+1 data; real‑time warehouses ingest via Canal to Kafka and store in OLAP systems for sub‑minute queries.

Solution Overview

Provides industry‑specific examples (retail, securities, manufacturing) and key metrics such as RPS, ROI, CPC, CPA, CPM, CVR, CTR, PV, ADPV, and ADIMP.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

real-time analytics Data Platform ETL Data Governance

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.