Big Data 32 min read

How Data Middle Platforms Transform Ingestion, Governance, and Real‑Time Analytics

This article outlines the core concepts of a data middle platform, covering data aggregation, ingestion tools, offline and real‑time development, data governance, service layers, and practical implementation details such as ODS, DWD, and monitoring, illustrating how enterprises build scalable, secure data ecosystems.

Programmer DD

May 23, 2020

Data Middle Platform

This article summarizes the theoretical knowledge of data middle platforms, including points that need improvement for Flink platformization, referencing "Data Middle Platform".

Data Aggregation

Data aggregation is a core tool that a data middle platform must provide. It collects heterogeneous network and data‑source data into the platform for centralized storage, preparing for downstream processing and modeling. Aggregation methods include database synchronization, embedded tracking, web crawling, and message queues; based on timeliness, they are divided into offline batch aggregation and real‑time collection.

Data Collection Tools

Canal

DataX

Sqoop

Data Development

The data development module targets developers and analysts, offering offline, real‑time, and algorithm development tools.

Offline Development

Job Scheduling

Dependency scheduling: a job starts only after all parent jobs finish. In Figure 64, Job B starts only after Jobs A and C complete.

Time scheduling: a job can be set to start at a specific time. In Figure 64, Job B starts only after 05:00.

Baseline Control

In large‑scale offline jobs, execution time can be long, causing delays when data is needed urgently. An algorithm predicts job completion time; if a job cannot finish normally and dynamic adjustments fail, the scheduling center alerts operations staff early, allowing sufficient time for job execution.

Heterogeneous Storage

Enterprise storage and compute engines are diversified. The offline development center builds specific plugins for each engine (e.g., Oracle plugin, Hive/Spark/MR plugins for Hadoop). Users create jobs of various types in the UI; at execution, the system automatically selects the appropriate plugin.

Code Validation

For common SQL task types, the SQL checker enforces strict controls to detect problems before execution.

Multi‑Environment Cascade

Environment cascading flexibly supports various enterprise environment needs, simplifying resource and permission control and isolation. Each environment has an independent Hive database, Yarn queue, and possibly a different Hadoop cluster. Common environments include:

Single environment: only one production environment, simple internal management.

Classic environment: development environment holds masked data for testing; production follows a release process with real data.

Complex environment: external users get a masked‑control environment; after testing, external models are published to the internal development environment.

Recommended Dependencies

As business deepens, developers continuously add jobs. The system must accurately locate upstream jobs and avoid cycles.

Core principle: build a table‑level lineage graph of upstream and downstream job inputs/outputs.

Use lineage analysis to find suitable upstream jobs.

Perform cycle detection and discard jobs that form loops.

Return a list of appropriate nodes.

Data Permissions

Enterprise heterogeneous compute engines pose the following data‑permission challenges:

Some engines have independent permission systems (e.g., Oracle, HANA, LibrA), requiring separate permission requests.

For the same engine, different vendors implement different permission systems. Hadoop itself lacks a permission system; vendors provide either RBAC (e.g., Cloudera Sentry, Huawei FI) or PBAC (e.g., Hortonworks Ranger).

Data permissions are managed by big‑data cluster or DB administrators; developers cannot operate them directly, leading to heavy operational burden.

A centralized permission‑management UI lets requesters apply for permissions and managers approve or reject them, with full audit trails.

Real‑Time Development

Metadata management

SQL‑driven development

Componentized development

Intelligent Operations

Integrated tools for task management, code deployment, operations, monitoring, and alerting improve efficiency. Features include re‑run, downstream re‑run, and data back‑fill.

Data System

With data aggregation and development modules, the middle platform possesses the basic capabilities of a traditional data warehouse, enabling the construction of an enterprise data system. The data system is the flesh of the middle platform; both developers and users work with data.

Data System Characteristics

Full‑domain coverage: the data system stores all business process data, ensuring the business middle platform can always find needed data.

Clear structure: vertical layers, horizontal subject domains, and business processes make the hierarchy easy to understand.

Accurate and consistent data: unified naming, meaning, and calculation standards, with dedicated modeling teams ensuring consistency.

Performance improvement: standardized design, appropriate data models, and usage‑aware optimization enhance performance.

Cost reduction: shared data avoids siloed duplicate builds, saving compute, storage, and labor costs.

Ease of use: the farther downstream, the easier data consumption becomes, with pre‑processing and optional redundancy.

Industry‑Specific Data Systems

Real estate

Securities

Retail

Manufacturing

Media

Legal services

ODS Layer (Raw Data Layer)

Collect and aggregate business‑system data while preserving original business‑process information. Only simple integration, unstructured‑to‑structured conversion, or date‑stamp enrichment is performed; deep cleaning is avoided.

Table name: ODS_ system_abbr _ source_table

Field names and types remain consistent with source systems.

For large tables, incremental sync creates both full and delta tables (suffix _delta).

For semi‑structured data (logs, files), store both raw and structured versions.

DataX Synchronization Steps

Identify source business tables and target ODS tables.

Configure field mapping; target tables may add collection date, partition, or source identifiers.

If incremental or conditional sync is needed, set synchronization conditions.

Clean target table data.

Start the sync task to load data into ODS.

Validate that the task runs correctly and data is accurate.

Publish the sync task to production scheduling, configuring rate limits, fault tolerance, quality monitoring, and alerts.

DWD Layer (Detail Data Warehouse)

Includes detailed data (DWD) and summary data (DWS) layers, mirroring traditional data‑warehouse functions for full‑history business data modeling and storage.

TDM Layer (Tag Data Model)

Object‑oriented modeling integrates cross‑domain object data via ID mapping, forming a comprehensive tag system for deep analysis and mining.

ADS Layer (Application Data Service)

Extracts data from the unified warehouse and tag layers, processing it for specific business needs and assembling it for downstream applications.

Data Asset Management

Manages catalogs, metadata, data quality, lineage, and lifecycle, presenting assets visually to raise enterprise data awareness.

Data Governance

Traditional data governance includes standard management, metadata management, quality management, security management, and lifecycle management.

Data Service System

Transforms data into service capabilities, allowing data to participate in business and accelerate middle‑platform development.

Query Service

Accepts specific query conditions and returns matching data via APIs. Supports indexed query identifiers, filter items, and configurable sorting and pagination.

Analysis Service

Leverages high‑performance big‑data analysis components for associative analysis; results are exposed via APIs. Supports multi‑source data access (Hive, ES, Greenplum, MySQL, Oracle, files), high‑speed ad‑hoc queries, multi‑dimensional analysis, and deep data mining.

Recommendation Service

Provides personalized recommendations by ingesting historical logs and real‑time access data, generating recommendation APIs for upper‑layer applications. Supports industry‑specific recommendation logic, cold‑start and active‑user scenarios, and continuous model optimization.

Circle‑People Service

Filters user groups based on tag combinations, exposing results via APIs. Supports audience selection, audience sizing for budget control, and multi‑channel integration (file export, SMS, WeChat, marketing systems).

Offline Platform

Illustrates SuNing offline platform architecture, scheduling modules, and task‑flow dependency implementation using FTP event mechanisms and the "Huatuo" diagnostic platform.

Real‑Time Platform

Meituan‑Dianping

Uses Grafana embedded within the platform.

Bilibili

SQL‑based programming

DAG drag‑and‑drop programming

Integrated managed operations

The real‑time platform consists of real‑time transmission and computation. Transmission ingests APP logs, DB binlogs, server or system logs into Kafka/HDFS. Computation builds on BSQL, scheduled by YARN. Flink provides the execution pool, supporting MySQL, Redis, HBase as dimension tables. State is stored in RocksDB, with extensions to MapDB and Redis to alleviate IO bottlenecks. Processed data flows to real‑time warehouses (Kafka, HBase, ES, MySQL, TiDB) and downstream AI/BI/reporting.

NetEase

NetEase stream computing covers advertising, e‑commerce dashboards, ETL, analytics, recommendation, risk control, search, and live streaming.

Event Management

Ensures single‑operator task execution through three modules:

Server: receives event requests, validates data, assembles, and forwards to Kernel.

Kernel: executes event logic by issuing shell commands to the cluster.

Admin: confirms execution results, ensuring correctness.

Example: task start creates a distributed lock (Server), Admin monitors the lock, Server submits to Kernel, updates DB status to "starting", waits for shell success, writes Zookeeper node, Admin detects change, queries YARN for running status, updates DB to "running", then releases the lock for other users.

High availability is achieved by scaling Server horizontally, Server monitoring Kernel health and restarting it, and hot‑standby Admin taking over if the primary fails.

Platform Task Status Management

Server controls initial states; Admin handles all YARN‑related state interactions.

Task Debugging

SQL tasks support debugging by uploading CSV files as source and dimension inputs. The designated Kernel executes the debug, with sloth‑server assembling requests, invoking the Kernel, returning results, and collecting logs.

Log Retrieval

Filebeat agents on each YARN node ship task logs to Kafka; Logstash parses and stores them in Elasticsearch. Logs are visualized via Kibana for developers/operations and displayed in the UI for user search.

Monitoring

Metrics are monitored using InfluxDB and NetEase‑developed NTSDB. Users can view metrics via Grafana or receive alerts through internal chat, email, phone, or SMS.

Alarm

Sloth stream‑compute platform supports alerts for task failures, data latency, failover, and custom rules (e.g., input QPS below a threshold). Alert channels include internal chat tools, email, phone, and SMS, with optional suppression intervals during debugging.

Real‑Time Data Warehouse

Many NetEase products have built real‑time warehouses, still being refined. Real‑time warehouses ingest logs and tracking data into Kafka, process them via the real‑time compute platform, extract detailed ODS data, perform aggregation and dimension joins, and write results to Redis, Kudu, etc., exposing them through data services for front‑end consumption.

E‑Commerce Applications – Data Analysis

Real‑time activity analysis, homepage resource analysis, funnel tracking, and real‑time profit calculation.

E‑Commerce Applications – Search & Recommendation

Includes real‑time user footprints, real‑time user features, real‑time item features, real‑time CTR/CVR sample construction, homepage carousel, activity selection, and UV/PV real‑time statistics.

E‑Commerce Marketing Metrics

CPC (Cost Per Click)

CPA (Cost Per Action)

CPM (Cost Per Mille)

CVR (Conversion Rate)

CTR (Click‑Through Rate)

PV (Page View)

ADPV (Advertisement Page View)

ADIMP (Advertisement Impression)

PV price (Revenue per PV)

Offline vs. Real‑Time Data Warehouse

Building an Offline Warehouse from Scratch

A data warehouse is a subject‑oriented, integrated, time‑variant, immutable data collection used for enterprise management and decision‑making.

Goals: data assets and decision‑making information.

ETL process connects offline and real‑time pipelines, enabling data flow throughout the enterprise.

Data layering provides low coupling and high cohesion, preventing frequent re‑architecting when business or data changes.

Data integration breaks silos, ensuring unified data services.

Standardized design yields maintainable, highly extensible solutions.

Monitoring and support include quality monitoring, scheduling, metadata, and security management.

Service orientation offers API access, self‑service query platforms, and OLAP analysis.

ETL

Business data originates from diverse sources (text, logs, RDBMS, NoSQL). ETL tools must cover these scenarios. Common tools: DataX, Sqoop, Kettle, Informatica.

ETL typically runs after midnight to avoid impacting production systems and to meet defined scheduling windows.

Layering

Stages include raw (ODS), detail (DWD), summary (DWS), and application (ADS) layers, each serving specific processing and consumption needs.

Code Standards

Script header comments follow Google style for encoding, documentation, and SQL conventions.

One file per table; temporary tables only within the file; file name matches table name.

Field naming eliminates synonyms and polysemy, especially in the model layer.

Differences Between Offline and Real‑Time Warehouses

Offline warehouses use Sqoop, DataX, Hive, etc., delivering T+1 data via scheduled batch jobs.

Real‑time warehouses ingest raw data via tools like Canal into Kafka, store results in OLAP stores such as HBase, and provide minute‑ or second‑level query latency.

Choosing a powerful OLAP database is crucial for real‑time warehouse viability.

Data Layer Overview

ODS: raw data layer, facts stored in Kafka.

DWD: detailed layer, supports joins, stored in Kafka/Redis.

DIM: dimension data, stored in HBase.

DM: application layer – MySQL for summary metrics, Greenplum for detailed multi‑dimensional analysis, HBase for high‑concurrency summaries, Redis for top‑N lists.

Data Middle Platform Solutions

Industry‑specific implementations (e.g., retail) illustrate revenue‑per‑search (RPS) and ROI metrics.

Overall, the data middle platform integrates ingestion, processing, governance, and service layers to enable scalable, secure, and business‑driven data ecosystems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

bigdata DataWarehouse DataGovernance DataMiddlePlatform RealTimeAnalytics

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.