Mastering Data Middle Platforms: From Ingestion to Real‑Time Analytics
This comprehensive guide explains the concepts, architecture, and best practices of data middle platforms, covering data aggregation, ingestion tools, offline and real‑time development, data governance, service layers, and implementation details for building scalable big‑data solutions.
Data Middle Platform
Data aggregation is a core tool of the data middle platform, collecting heterogeneous network and data source information into a centralized storage for downstream processing. Aggregation methods include database sync, tracking points, web crawlers, and message queues, with offline batch and real‑time modes.
Data Collection Tools
Canal, DataX, Sqoop
Data Development
The data development module provides offline, real‑time, and algorithm development tools for developers and analysts.
Offline Development
Job Scheduling
• Dependency scheduling: a job starts only after all parent jobs finish. • Time scheduling: a job can be set to start at a specific time.
Baseline Control
Long‑running big‑data jobs use predictive algorithms to estimate completion time; if a job cannot finish normally, alerts notify operations staff for early intervention.
Heterogeneous Storage
Different compute engines (Oracle, Hadoop, Hive, Spark, MR) have dedicated plugins; the platform automatically selects the appropriate plugin based on job type.
Code Validation
SQL checkers enforce strict controls on common SQL tasks to catch issues early.
Multi‑Environment Cascading
Supports single, classic, and complex environments with isolated Hive databases, Yarn queues, and even separate Hadoop clusters, enabling fine‑grained resource and permission control.
Recommended Dependencies
Uses table‑level lineage graphs to identify upstream jobs, performs cycle detection, and returns suitable dependency lists.
Data Permissions
Challenges include disparate permission systems across engines and the need for unified RBAC (e.g., Sentry) or PBAC (e.g., Ranger) solutions; a centralized permission center provides UI‑driven request, approval, and audit workflows.
Real‑Time Development
Key components: metadata management, SQL‑driven development, componentized programming.
Intelligent Operations
Integrated tools for job management, code release, monitoring, and alerting improve efficiency; features include rerun, downstream rerun, and data补 (补数据).
Data Architecture
The platform supports a layered data model: ODS (raw source layer), DW (detail and summary layers), TDM (tag data layer), and ADS (application data layer). Each layer serves specific purposes such as raw data preservation, dimensional modeling, and business‑specific data extraction.
Data Asset Management
Manages catalogs, metadata, quality, lineage, and lifecycle, providing a visual overview of enterprise data assets.
Data Governance
Includes standards, metadata, quality, security, and lifecycle management.
Data Service System
Transforms data assets into services via query APIs, analysis APIs, recommendation APIs, and audience segmentation APIs, enabling data‑driven business applications.
Offline Platform
Illustrates product functions, scheduling modules, and overall architecture, including task dependency handling via FTP events and distributed locking.
Real‑Time Platform
Describes implementations at Meituan‑Dianping, Bilibili, and NetEase, covering real‑time transmission (logs, binlog) and computation (Flink, BSQL), state management, and integration with downstream stores like Kafka, HBase, Redis, and OLAP databases.
Event Management
Server initiates events, Kernel executes logic via shell scripts, and Admin confirms results; distributed locks ensure single‑operator safety, with high‑availability achieved through horizontal scaling and hot‑standby.
Offline vs Real‑Time Data Warehouses
Offline warehouses use batch tools (Sqoop, DataX, Hive) for T+1 data; real‑time warehouses ingest via Canal to Kafka and store in OLAP systems for sub‑minute queries.
Solution Overview
Provides industry‑specific examples (retail, securities, manufacturing) and key metrics such as RPS, ROI, CPC, CPA, CPM, CVR, CTR, PV, ADPV, and ADIMP.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
