Unlocking Data Middle Platform: From Ingestion to Real‑Time Analytics
This article provides a comprehensive overview of data middle platform concepts, covering data aggregation, ingestion tools, offline and real‑time development, scheduling, baseline control, heterogeneous storage, recommendation dependencies, data permissions, layered data architecture (ODS, DW, DWD, DWS, TDM, ADS), asset management, governance, service APIs, query and analysis services, as well as monitoring, alerting, and operational best practices for building robust big‑data solutions.
Data Middle Platform
The data middle platform must provide core tools for data aggregation, collecting heterogeneous network and source data into a centralized store for downstream processing.
Data Aggregation
Methods include database sync, tracking, web crawling, and message queues, with offline batch and real‑time collection modes.
Data Ingestion Tools
Canal
DataX
Sqoop
Data Development
Provides offline, real‑time, and algorithm development tools for developers and analysts.
Offline Development
Job Scheduling
• Dependency scheduling: a job starts only after all parent jobs finish. • Time scheduling: a job can be set to start at a specific time.
Baseline Control
Predicts job completion time using algorithms; alerts operations staff when jobs cannot finish on time.
Heterogeneous Storage
Develops plugins for different engines (e.g., Oracle, Hive, Spark, MR) to run jobs automatically based on job type.
Code Validation
SQL checkers enforce strict pre‑execution validation for common SQL tasks.
Multi‑Environment Cascading
Supports single, classic, and complex environments with isolated Hive databases, Yarn queues, and even separate Hadoop clusters.
Recommended Dependencies
Uses table‑level lineage graphs to find upstream jobs, performs cycle detection, and returns suitable dependency lists.
Data Permissions
Addresses challenges of multiple engines with separate permission systems, supporting RBAC (e.g., Sentry) and PBAC (e.g., Ranger), and provides a unified permission portal for request, approval, and audit.
Real‑Time Development
Key components: metadata management, SQL‑driven processing, and componentized development.
Intelligent Operations
Integrates task management, code deployment, monitoring, and alerting to improve efficiency, including job rerun and data backfill.
Data Architecture
With data aggregation and development, the platform forms a traditional data warehouse capable of building an enterprise‑wide data system.
Core characteristics of the data system:
Full‑domain coverage
Clear hierarchical structure
Accurate and consistent data
Performance optimization
Cost reduction through data sharing
Ease of use for downstream applications
Layered Data Model
ODS (Raw Source Layer)
Collects and stores raw business data with minimal transformation; retains original fields and adds timestamps.
DataX synchronization steps:
Identify source and target tables.
Configure field mapping; add date/partition info.
Set incremental or conditional sync conditions.
Clean target tables.
Start sync task.
Validate correctness.
Publish to production scheduling with rate limits, fault tolerance, and alerts.
DW (Unified Warehouse Layer)
Includes detailed (DWD) and summary (DWS) layers, reorganizing business data into consistent metrics and dimensions.
TDM (Tag Data Layer)
Object‑oriented modeling creates a unified tag system across business domains for deep analysis.
ADS (Application Data Layer)
Extracts and processes data from DW/TDM to meet specific business and performance needs.
Data Asset Management
Manages catalogs, metadata, quality, lineage, and lifecycle, providing visualized asset views to enhance data awareness.
Data Governance
Covers standards, metadata, quality, security, and lifecycle management.
Data Service System
Transforms data assets into services via APIs, enabling query, analysis, recommendation, and audience‑segmentation capabilities.
Query Service
Accepts query parameters, returns results, supports indexing, filtering, sorting, and pagination.
Analysis Service
Provides high‑performance ad‑hoc queries across multiple sources (Hive, ES, Greenplum, MySQL, Oracle) with millisecond‑level response for large datasets.
Recommendation Service
Generates personalized recommendations using behavior logs and real‑time data, supporting industry‑specific logic and continuous model optimization.
Audience Segmentation Service
Filters users based on tag combinations, supports count verification, and integrates with downstream channels (SMS, email, marketing platforms).
Offline Platform
Features product functions, scheduling, and task dependency management using FTP event triggers, distributed locks, and high‑availability modules (Server, Kernel, Admin).
Real‑Time Platform
Implemented by Meituan‑Dianping and Bilibili, combines real‑time transmission (logs, binlog) and Flink‑based computation, with BSQL for SQL‑driven analytics, and supports state storage in Redis/RocksDB.
Event Management
Coordinates task operations via Server (request), Kernel (execution), and Admin (verification) to ensure single‑operator safety and high availability.
Task State Management
Server initiates state changes; Admin monitors YARN status and updates the UI accordingly.
Task Debugging
SQL tasks can be debugged with custom CSV inputs; results are returned by the kernel.
Log Retrieval
Filebeat ships logs to Kafka, Logstash parses them, and Elasticsearch stores them for Kibana visualization and UI display.
Monitoring & Alerting
Metrics are collected via InfluxDB/ntsdb and visualized in Grafana; alerts trigger via internal chat, email, or SMS based on thresholds (e.g., QPS, latency).
Offline vs Real‑Time Warehouse
Offline warehouses use batch tools (Sqoop, DataX, Hive) for T+1 data; real‑time warehouses ingest via Canal to Kafka and store in HBase for sub‑minute queries.
Key Implementation Points
End‑to‑end latency and traffic monitoring.
Rapid fault recovery.
Time‑range data replay.
Data lineage mapping.
Real‑time data quality rules.
Conclusion
A robust data middle platform integrates ingestion, processing, storage, governance, and services to enable scalable, reliable, and business‑driven analytics across industries.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
