Evolving from Data Warehouses to Data Middle Platforms: Architecture & Practices
This talk reviews China's big‑data evolution from early enterprise data warehouses to modern data middle platforms, outlines core architectural components, technology selections, data development practices, lifecycle and quality management, and shares practical Q&A insights for building scalable, cost‑effective data infrastructures.
1. Big Data Evolution in China
The speaker first revisits the first decade of the 21st century when enterprise data warehouses (EDW) dominated, driven by vendors such as IBM, Oracle, and Teradata. Implementations required large‑scale hardware, commercial relational databases (Oracle, DB2, SQL Server) and costly ETL/OLAP suites, mainly serving finance, telecom, retail, and manufacturing.
From 2010‑2015 the rapid growth of mobile internet sparked the big‑data platform era. Hadoop clusters built on inexpensive PC servers enabled large‑scale data processing. The concept of a data lake emerged to ingest raw structured and unstructured data, reducing the complex modeling steps of traditional warehouses. Applications expanded beyond decision analytics to include search, recommendation, A/B testing, and user profiling.
In the current stage (the "data middle platform" era), over a decade of technical accumulation has produced four key capabilities:
Data unification : a common data model, unified metrics and tags, improving standardization and reusability.
Tool componentization : reusable pipelines, processing, storage, and visualization components to avoid duplicate development.
Service‑oriented data access : standardized APIs and visualization products that decouple consumers from underlying data stores.
Organizational clarity : dedicated data platform teams focus on platform development while business, product, and analysis teams consume data services.
Data middle platforms also enable extensive online‑offline integration, real‑time fraud detection, recommendation, and even data‑as‑a‑product monetization. Cloud computing reduces infrastructure overhead, allowing elastic scaling without on‑premise data‑center investments.
2. Data Middle Platform Architecture & Technical Selection
The core architecture consists of four layers:
Foundation – Data Infrastructure : data ingestion, compute, and storage platforms (self‑built or cloud services).
Public Data Zone : a data warehouse (or data lake) for shared data models, plus a unified metric/tag platform.
Application Service Layer : data APIs, multi‑dimensional query, visualization, and analytics services.
Supporting platforms run throughout:
Data Development Platform : pipelines, modeling tools, scripting, and scheduling utilities.
Data Management Platform : metadata, quality, and lifecycle management.
Typical open‑source selections (Hadoop ecosystem) include:
Extraction: Sqoop for relational data, Flume for log streams.
Storage: HDFS for batch data, Kafka for streaming.
Batch compute: Hive, Spark, optionally Tez.
Streaming compute: historically Storm, Spark Streaming, now moving to Flink.
Scheduling: Airflow, Azkaban, Oozie, Dolphin‑scheduler.
OLAP engines: ROLAP (Presto, Trino) and MOLAP (Kylin, Druid) depending on query latency needs.
Visualization: Metabase, Superset, Redash, or proprietary UI components.
Selection criteria focus on proven use cases, openness, community activity, and compatibility with existing stacks.
3. Data Development Practices
Data processing has evolved from pure batch (daily jobs) to near‑real‑time (hourly or 15‑minute windows) and finally to lambda‑style architectures that combine batch and streaming. Recent trends favor unified stream‑batch frameworks (e.g., Flink) that allow seamless switching.
Warehouse layering follows an ELT approach:
ODS (Business Data Layer) : raw source data, stored in normalized form; supports schema‑on‑write and data‑link (slowly changing dimensions) techniques.
DWD/DWS (Public Data Layer) : dimensional models, fact tables, and wide tables for downstream consumption; balances redundancy with query performance.
DWA (Application Data Layer) : flexible, application‑specific marts built on top of the public layer.
Topic classification is performed from both a business‑domain perspective (e.g., commerce, content) and a technical‑domain perspective (e.g., finance, supply‑chain), producing a three‑level hierarchy: domain → topic → sub‑topic.
The end‑to‑end data development workflow includes requirement analysis, data modeling, development & testing, and release. Modeling tools have been digitized to generate mapping documents and knowledge artifacts automatically.
4. Data Lifecycle & Quality Management
Lifecycle considerations address storage growth and compute cost. Strategies include:
Reduce volume : compression (Parquet + Snappy), data deduplication, and archiving cold data.
Control growth : retention policies based on importance and recoverability, automated archiving or deletion.
Cost allocation : attributing storage and compute costs to business owners based on usage patterns.
Security measures involve encrypting sensitive fields at ingestion, isolating critical data in separate clusters, and applying tiered access controls.
Quality management covers accuracy, timeliness, and consistency across pre‑, in‑, and post‑processing stages, with alerting integrated into modern instant‑messaging tools.
5. Data Application Architecture & ROI
The application stack typically flows from the warehouse/data lake to a data‑engine layer (Presto, Kylin, Druid, MySQL), then to a unified data‑service layer exposing SQL APIs, followed by an indicator platform and multi‑dimensional query tools. Unified metadata management tracks lineage and usage, while fine‑grained permission controls ensure security.
ROI is evaluated using a simple model: value (activity, coverage, contribution) divided by cost (compute, storage). Data with low ROI may be candidates for deprecation.
6. Q&A Highlights
Compression: Parquet + Snappy; consider EC erasure coding for replica reduction.
Batch performance: examine resource contention, task prioritization, and data skew mitigation.
Hadoop vs. MPP: Hadoop scales horizontally with thousands of nodes; MPP may hit scaling limits.
Kylin use‑cases: stable business metrics with large, deduplicated datasets; flexibility can be improved via view abstraction.
Metadata alignment: use unified metadata tools to map business concepts to technical artifacts.
Overall, the speaker emphasizes efficiency gains through tooling and AI, flexibility via stream‑batch convergence, cost control from early stages, and the future impact of advancing compute (e.g., quantum, HATP).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
