Big Data 17 min read

Building an End‑to‑End Data Governance System: Challenges, Solutions & Impact

This article details DataCake's data‑governance journey, covering the problems of data silos, unclear costs, and tool fragmentation, then explains the strategic thinking, the multi‑layered solution architecture, and the measurable outcomes such as higher resource utilization and reclaimed storage.

Data Thinking Notes
Data Thinking Notes
Data Thinking Notes
Building an End‑to‑End Data Governance System: Challenges, Solutions & Impact

01 Problem and Challenges

DataCake has a massive data asset volume across many relatively independent business lines such as commercialization and content e‑commerce. Each line maintains its own data development team and pipelines, which creates typical data‑governance problems.

Data silos : cross‑business data retrieval requires knowing which line holds the data, locating documentation, and often results in duplicated storage and high communication cost.

Lack of governance guidance : tools vary (scripts, open‑source) and consensus is hard to reach.

Unclear cost accounting : billing tags are messy, making it difficult to quantify resource usage.

These issues hinder managers, developers, algorithm engineers and governance operators from making informed decisions.

02 Thinking and Positioning

After analyzing the challenges, the team decided to build a data‑governance platform that adapts to each business’s context (“tailored to local conditions”). The initial focus is on establishing policies and tools, then evolving toward a unified metadata observation, cost‑analysis, and governance workflow that enables self‑service control.

The concrete goal is to create a unified metadata management platform, a low‑threshold governance workbench, and a cost‑analysis tool that provides fine‑grained accounting for developers, managers and operators.

03 Solution and Implementation

The solution runs on a hybrid‑cloud infrastructure and is divided into several layers:

Infrastructure layer : supports data ingestion and flow.

Platform & tool layer : integrates data sources; DE is the task‑management module.

Data collection layer : gathers raw data into the platform data‑warehouse.

Platform data‑warehouse layer : synchronizes data to the application layer.

Data‑application & metadata middleware : connects to backend for metadata management and display.

Backend : modules for source data, permissions, cost, storage tasks, etc.

Metadata Management Module : defines technical, business and operational metadata; technical metadata includes storage, compute‑task, quality, cost and security metadata. The platform provides rich descriptors (region, lineage, tags, access heat, size) and allows users to follow or bookmark tables.

Governance Tool Module : a workbench for Spark task optimization (resource usage, data skew, error reduction) and storage governance (tiering, lifecycle, deletion). It offers one‑click publishing, AI‑driven recommendations, and tracks optimization impact.

Cost Analysis Module : builds a fine‑grained cost pipeline linking cloud‑provider bills to individual tasks and tables, enabling cost allocation across owners and departments.

Asset Inventory Module : aggregates metadata observation, automated governance actions, and cost analysis into a closed loop, allowing users to self‑manage and self‑control.

04 Summary and Planning

DataCake’s governance now provides observable assets, one‑click governance, cost tracking, and operational governance. 60 % of employees participate in regular governance, leading to a 25 % increase in compute resource utilization and 3.5 PB of storage reclaimed.

Future work will refine existing features, improve user experience, and incorporate industry best practices.

05 Q&A

Q1: Is there an evaluation system for governance effectiveness? – Yes, task‑level scores based on historical execution, resource usage, and cost trends are used.

Q2: How are governance actions promoted? – By exposing task cost, ranking, and incentives for developers.

Q3: Main difficulty of cloud‑native governance? – Fine‑grained cost allocation across heterogeneous clouds (Huawei Cloud, AWS) where provider bills are coarse.

Q4: How is lineage analysis performed? – By inspecting catalog metadata to see upstream/downstream dependencies and usage.

Q5: How does the governance process close the loop? – Cost analysis → fine‑grained observation → governance tools → feedback.

Q6: How is cost allocated to tasks? – Real‑time reporting of instance type, price, duration, then mapping to cloud bills and distributing gaps proportionally.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataData Governancemetadata managementcost analysis
Data Thinking Notes
Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.