Big Data 25 min read

Real-time Data Warehouse Governance: Optimization Practices and Technical Enhancements

This article presents a comprehensive overview of the current challenges, platform architecture, governance planning, and technical optimizations—including Flink SQL, Kafka batch processing, and partitioned stream tables—used to improve resource efficiency, stability, and scalability of a large‑scale real‑time data warehouse.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Real-time Data Warehouse Governance: Optimization Practices and Technical Enhancements

01 Current Situation and Problems

The Cloud Music data‑warehouse platform has been in production for over six years, serving more than 700 users and handling 1,600+ real‑time and 7,000‑8,000 offline SQL tasks daily on a cluster of over 2,000 compute nodes, processing petabytes of raw logs each day.

Because the platform serves almost every business line, most developers (including analysts, algorithm engineers, and QA) interact with big‑data processing, leading to high resource consumption and operational pressure.

02 Platform Philosophy

The platform aims to bridge technology and business by providing a customized, business‑centric data‑service layer that differs from generic enterprise solutions, focusing on cost‑effective usage and deep integration with internal workflows.

03 Overall Architecture

Built on shared cluster services, the platform leverages Flink‑based real‑time development (Sloth), the "Mammoth" offline engine (supporting MR, SparkSQL, Jar, HiveSQL), a metadata center for lineage, and Ranger‑based security. Over 80% of tasks rely on custom components that enable fine‑grained control and bulk optimizations.

04 Why Governance Is Needed

Cost‑reduction pressure from the company.

High Kafka water‑level caused by massive traffic spikes.

Three‑fold increase in upstream data due to a new event‑tracking system.

Growing number of non‑expert users leading to frequent basic performance and configuration issues.

05 Governance Planning

The plan is divided into four parts: (1) Diagnose the current state, (2) Conduct "exercise‑style" governance on legacy tasks, (3) Apply technical optimizations, and (4) Ensure sustainable, automated governance.

5.1 Diagnose the Current State

Integrated with the group‑wide Smildon monitoring service to collect real‑time resource usage and cost, converting usage into monetary metrics visible to users. Also gathered task concurrency vs. input‑flow data to identify abnormal resource allocations.

Implemented virtual queues per department to enforce resource limits and trigger expansion requests when thresholds are exceeded.

5.2 Efficient Governance

Using the collected metrics, tasks are ranked and optimized in bulk. Governance actions include:

Identifying and decommissioning unused tasks via lineage analysis and operational signals.

Adjusting unreasonable resource configurations based on per‑concurrency processing rates.

Reclaiming resources from tasks whose traffic has declined.

Technical tuning such as Flink‑SQL enhancements, Kafka batch improvements, and custom partitioned stream tables.

5.3 Technical Optimizations

Flink SQL Optimization

Implemented pre‑deserialization filtering to avoid unnecessary JSON parsing, added asynchronous dimension‑table joins, and introduced rescale/rebalance operators to decouple Kafka read parallelism from downstream processing, dramatically improving throughput.

Kafka Batch Optimization

Enhanced monitoring, rebalanced partition distribution, and adopted the Sticky Partitioner with tuned batch size, linger time, and message size to reduce Kafka water‑level from 80% to 30%.

Partitioned Stream Table

Inspired by Hive partitioning, added partition metadata to real‑time tables, modified the Kafka connector to write/read based on partition fields, and enabled automatic partition pruning, cutting unnecessary traffic and simplifying downstream development.

06 Future Plans

Two major directions: (1) Containerization of the data‑warehouse services on Kubernetes for fine‑grained resource isolation, precise vCore allocation, macro‑monitoring, and flexible scheduling; (2) Building an automated governance platform that stores metadata, enforces rule‑based checks before deployment, scans for violations continuously, and drives user‑initiated remediation.

07 Q&A

Answers cover the use of partitioned stream tables for batch‑stream integration, DSL generation for unified SQL across real‑time and offline, and methodological differences between real‑time and offline data‑warehouse governance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

stream processingReal-time Data Warehousebig data governanceFlink optimizationKafka batchresource efficiency
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.