Big Data 25 min read

Unlocking Data Middle Platform: From Ingestion to Real‑Time Analytics

This article provides a comprehensive overview of data middle platform concepts, covering data aggregation, ingestion tools, offline and real‑time development, scheduling, baseline control, heterogeneous storage, recommendation dependencies, data permissions, layered data architecture (ODS, DW, DWD, DWS, TDM, ADS), asset management, governance, service APIs, query and analysis services, as well as monitoring, alerting, and operational best practices for building robust big‑data solutions.

ITFLY8 Architecture Home

Feb 4, 2021

Data Middle Platform

The data middle platform must provide core tools for data aggregation, collecting heterogeneous network and source data into a centralized store for downstream processing.

Data Aggregation

Methods include database sync, tracking, web crawling, and message queues, with offline batch and real‑time collection modes.

Data Ingestion Tools

Canal

DataX

Sqoop

Data Development

Provides offline, real‑time, and algorithm development tools for developers and analysts.

Offline Development

Job Scheduling

• Dependency scheduling: a job starts only after all parent jobs finish. • Time scheduling: a job can be set to start at a specific time.

Baseline Control

Predicts job completion time using algorithms; alerts operations staff when jobs cannot finish on time.

Heterogeneous Storage

Develops plugins for different engines (e.g., Oracle, Hive, Spark, MR) to run jobs automatically based on job type.

Code Validation

SQL checkers enforce strict pre‑execution validation for common SQL tasks.

Multi‑Environment Cascading

Supports single, classic, and complex environments with isolated Hive databases, Yarn queues, and even separate Hadoop clusters.

Recommended Dependencies

Uses table‑level lineage graphs to find upstream jobs, performs cycle detection, and returns suitable dependency lists.

Data Permissions

Addresses challenges of multiple engines with separate permission systems, supporting RBAC (e.g., Sentry) and PBAC (e.g., Ranger), and provides a unified permission portal for request, approval, and audit.

Real‑Time Development

Key components: metadata management, SQL‑driven processing, and componentized development.

Intelligent Operations

Integrates task management, code deployment, monitoring, and alerting to improve efficiency, including job rerun and data backfill.

Data Architecture

With data aggregation and development, the platform forms a traditional data warehouse capable of building an enterprise‑wide data system.

Core characteristics of the data system:

Full‑domain coverage

Clear hierarchical structure

Accurate and consistent data

Performance optimization

Cost reduction through data sharing

Ease of use for downstream applications

Layered Data Model

ODS (Raw Source Layer)

Collects and stores raw business data with minimal transformation; retains original fields and adds timestamps.

DataX synchronization steps:

Identify source and target tables.

Configure field mapping; add date/partition info.

Set incremental or conditional sync conditions.

Clean target tables.

Start sync task.

Validate correctness.

Publish to production scheduling with rate limits, fault tolerance, and alerts.

DW (Unified Warehouse Layer)

Includes detailed (DWD) and summary (DWS) layers, reorganizing business data into consistent metrics and dimensions.

TDM (Tag Data Layer)

Object‑oriented modeling creates a unified tag system across business domains for deep analysis.

ADS (Application Data Layer)

Extracts and processes data from DW/TDM to meet specific business and performance needs.

Data Asset Management

Manages catalogs, metadata, quality, lineage, and lifecycle, providing visualized asset views to enhance data awareness.

Data Governance

Covers standards, metadata, quality, security, and lifecycle management.

Data Service System

Transforms data assets into services via APIs, enabling query, analysis, recommendation, and audience‑segmentation capabilities.

Query Service

Accepts query parameters, returns results, supports indexing, filtering, sorting, and pagination.

Analysis Service

Provides high‑performance ad‑hoc queries across multiple sources (Hive, ES, Greenplum, MySQL, Oracle) with millisecond‑level response for large datasets.

Recommendation Service

Generates personalized recommendations using behavior logs and real‑time data, supporting industry‑specific logic and continuous model optimization.

Audience Segmentation Service

Filters users based on tag combinations, supports count verification, and integrates with downstream channels (SMS, email, marketing platforms).

Offline Platform

Features product functions, scheduling, and task dependency management using FTP event triggers, distributed locks, and high‑availability modules (Server, Kernel, Admin).

Real‑Time Platform

Implemented by Meituan‑Dianping and Bilibili, combines real‑time transmission (logs, binlog) and Flink‑based computation, with BSQL for SQL‑driven analytics, and supports state storage in Redis/RocksDB.

Event Management

Coordinates task operations via Server (request), Kernel (execution), and Admin (verification) to ensure single‑operator safety and high availability.

Task State Management

Server initiates state changes; Admin monitors YARN status and updates the UI accordingly.

Task Debugging

SQL tasks can be debugged with custom CSV inputs; results are returned by the kernel.

Log Retrieval

Filebeat ships logs to Kafka, Logstash parses them, and Elasticsearch stores them for Kibana visualization and UI display.

Monitoring & Alerting

Metrics are collected via InfluxDB/ntsdb and visualized in Grafana; alerts trigger via internal chat, email, or SMS based on thresholds (e.g., QPS, latency).

Offline vs Real‑Time Warehouse

Offline warehouses use batch tools (Sqoop, DataX, Hive) for T+1 data; real‑time warehouses ingest via Canal to Kafka and store in HBase for sub‑minute queries.

Key Implementation Points

End‑to‑end latency and traffic monitoring.

Rapid fault recovery.

Time‑range data replay.

Data lineage mapping.

Real‑time data quality rules.

Conclusion

A robust data middle platform integrates ingestion, processing, storage, governance, and services to enable scalable, reliable, and business‑driven analytics across industries.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Warehouse ETL

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.