Comprehensive Overview of Data Middle Platform Architecture and Practices
This article provides a detailed introduction to data middle platform concepts, covering data aggregation, ingestion tools, offline and real‑time development, data governance, service layers, monitoring, and deployment patterns, illustrating how enterprises build unified data ecosystems across various industries.
Data Middle Platform
Data aggregation is a core capability of a data middle platform, enabling heterogeneous network and data source collection into a centralized store for downstream processing. Aggregation methods include database sync, tracking, web crawling, and message queues, with both batch and real‑time modes.
Data Ingestion Tools
Canal, DataX, Sqoop
Data Development
The data development module serves developers and analysts, offering offline, real‑time, and algorithmic development tools.
Offline Development
Job Scheduling
• Dependency scheduling: a job starts only after all parent jobs finish. • Time scheduling: a job can be set to start at a specific time.
Baseline Control
Long‑running big‑data jobs use predictive algorithms to estimate completion time and trigger alerts for operations when jobs lag, allowing proactive intervention.
Heterogeneous Storage
Different compute engines (Oracle, Hive, Spark, MapReduce) have dedicated plugins; users create jobs of various types and the system automatically selects the appropriate plugin.
Code Validation
SQL tasks undergo strict pre‑execution checks to catch errors early.
Multi‑Environment Cascading
Supports single, classic, and complex environments with isolated Hive databases, Yarn queues, and even separate Hadoop clusters, enabling fine‑grained resource and permission control.
Recommended Dependencies
Uses table‑level lineage graphs to find upstream jobs, performs cycle detection, and returns suitable dependency lists.
Data Permissions
Addresses challenges of diverse permission systems (e.g., Oracle, HANA, Sentry, Ranger) and provides a unified UI for request, approval, and audit of data access.
Real‑Time Development
Features metadata management, SQL‑driven workflows, and componentized development.
Intelligent Operations
Integrates task management, code release, monitoring, and alerting to streamline operations such as job reruns and data backfills.
Data System
Combines data aggregation and development modules to form a traditional data warehouse capability, establishing a comprehensive enterprise data ecosystem.
ODS Layer (Raw Data)
Collects source system data with minimal transformation, preserving original fields and adding timestamps; supports full and incremental loads, and stores both structured and semi‑structured data.
DataX Synchronization Steps
1) Identify source and target tables. 2) Map fields, optionally adding date, partition, or source identifiers. 3) Configure incremental or conditional sync. 4) Clean target data. 5) Launch sync task. 6) Verify correctness. 7) Publish task to production with rate limits, fault tolerance, and alerting.
Unified Data Warehouse (DW)
Includes detailed (DWD) and aggregated (DWS) layers, providing a unified, business‑oriented view of data across domains.
Tag Data Layer (TDM)
Object‑oriented modeling creates cross‑domain tag sets for deep analysis.
Application Data Layer (ADS)
Extracts and transforms data from DW/TDM to meet specific business and performance needs.
Data Asset Management
Manages catalogs, metadata, quality, lineage, and lifecycle, offering a visual representation of enterprise data assets.
Data Governance
Covers standards, metadata, quality, security, and lifecycle management.
Data Service System
Transforms data assets into services via APIs for querying, analysis, recommendation, and audience segmentation.
Query Service
Provides configurable query identifiers, filters, sorting, and pagination, exposing results via API.
Analysis Service
Supports multi‑source connectivity (Hive, ES, Greenplum, MySQL, Oracle, files), high‑performance ad‑hoc queries, multi‑dimensional analysis, and flexible business integration.
Recommendation Service
Generates recommendation APIs from historical logs and real‑time data, supporting industry‑specific logic, cold‑start handling, and continuous model optimization.
Audience Segmentation Service
Filters users based on tag combinations, provides count metrics, and integrates with downstream channels (SMS, WeChat, marketing platforms).
Offline Platform
Describes SuNing's offline scheduling architecture, task dependency handling via FTP markers, and high‑availability components (Server, Kernel, Admin).
Real‑Time Platform
Highlights Meituan‑Dianping and Bilibili implementations, including SQL‑based programming, DAG drag‑and‑drop, unified metadata, and Flink‑based computation with state management in RocksDB/Redis.
Scenarios
AI engineering (ads, search, recommendation), real‑time quality monitoring, user growth analysis, and real‑time ETL dashboards.
Event Management
Ensures single‑operator task execution using Server (request), Kernel (execution), and Admin (verification) modules with distributed locks and Zookeeper coordination.
Task State Management
Server initiates tasks, Admin monitors YARN status, and both modules provide high availability via horizontal scaling and hot‑standby.
Task Debugging
SQL tasks can be debugged with custom CSV inputs; Sloth‑server assembles requests, invokes kernels, and aggregates logs.
Log Retrieval
Filebeat ships node logs to Kafka, Logstash parses them, and Elasticsearch stores them for Kibana UI and in‑app log search.
Monitoring
Uses InfluxDB/ntsdb metrics visualized in Grafana and alerting modules for threshold‑based notifications.
Alerting
Supports failure, latency, and custom rule alerts with multiple notification channels (chat, email, SMS).
Real‑Time Data Warehouse
Collects logs and events into Kafka, processes them via Flink/Saber, stores results in Redis, Kudu, etc., and serves AI, BI, and reporting use cases.
Offline vs Real‑Time Data Warehouse
Offline warehouses rely on batch tools (Sqoop, DataX, Hive) for T+1 data; real‑time warehouses ingest via Canal to Kafka and store in OLAP systems (HBase) for sub‑second queries.
Data Middle Platform Solutions
Provides industry‑specific blueprints (retail, finance, media) with metrics such as RPS, ROI, and RPS, emphasizing the need for a powerful OLAP database to achieve low‑latency analytics.
End of article.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
