Building an End‑to‑End Data Governance System: Challenges, Solutions & Impact
This article details DataCake's data‑governance journey, covering the problems of data silos, unclear costs, and tool fragmentation, then explains the strategic thinking, the multi‑layered solution architecture, and the measurable outcomes such as higher resource utilization and reclaimed storage.
01 Problem and Challenges
DataCake has a massive data asset volume across many relatively independent business lines such as commercialization and content e‑commerce. Each line maintains its own data development team and pipelines, which creates typical data‑governance problems.
Data silos : cross‑business data retrieval requires knowing which line holds the data, locating documentation, and often results in duplicated storage and high communication cost.
Lack of governance guidance : tools vary (scripts, open‑source) and consensus is hard to reach.
Unclear cost accounting : billing tags are messy, making it difficult to quantify resource usage.
These issues hinder managers, developers, algorithm engineers and governance operators from making informed decisions.
02 Thinking and Positioning
After analyzing the challenges, the team decided to build a data‑governance platform that adapts to each business’s context (“tailored to local conditions”). The initial focus is on establishing policies and tools, then evolving toward a unified metadata observation, cost‑analysis, and governance workflow that enables self‑service control.
The concrete goal is to create a unified metadata management platform, a low‑threshold governance workbench, and a cost‑analysis tool that provides fine‑grained accounting for developers, managers and operators.
03 Solution and Implementation
The solution runs on a hybrid‑cloud infrastructure and is divided into several layers:
Infrastructure layer : supports data ingestion and flow.
Platform & tool layer : integrates data sources; DE is the task‑management module.
Data collection layer : gathers raw data into the platform data‑warehouse.
Platform data‑warehouse layer : synchronizes data to the application layer.
Data‑application & metadata middleware : connects to backend for metadata management and display.
Backend : modules for source data, permissions, cost, storage tasks, etc.
Metadata Management Module : defines technical, business and operational metadata; technical metadata includes storage, compute‑task, quality, cost and security metadata. The platform provides rich descriptors (region, lineage, tags, access heat, size) and allows users to follow or bookmark tables.
Governance Tool Module : a workbench for Spark task optimization (resource usage, data skew, error reduction) and storage governance (tiering, lifecycle, deletion). It offers one‑click publishing, AI‑driven recommendations, and tracks optimization impact.
Cost Analysis Module : builds a fine‑grained cost pipeline linking cloud‑provider bills to individual tasks and tables, enabling cost allocation across owners and departments.
Asset Inventory Module : aggregates metadata observation, automated governance actions, and cost analysis into a closed loop, allowing users to self‑manage and self‑control.
04 Summary and Planning
DataCake’s governance now provides observable assets, one‑click governance, cost tracking, and operational governance. 60 % of employees participate in regular governance, leading to a 25 % increase in compute resource utilization and 3.5 PB of storage reclaimed.
Future work will refine existing features, improve user experience, and incorporate industry best practices.
05 Q&A
Q1: Is there an evaluation system for governance effectiveness? – Yes, task‑level scores based on historical execution, resource usage, and cost trends are used.
Q2: How are governance actions promoted? – By exposing task cost, ranking, and incentives for developers.
Q3: Main difficulty of cloud‑native governance? – Fine‑grained cost allocation across heterogeneous clouds (Huawei Cloud, AWS) where provider bills are coarse.
Q4: How is lineage analysis performed? – By inspecting catalog metadata to see upstream/downstream dependencies and usage.
Q5: How does the governance process close the loop? – Cost analysis → fine‑grained observation → governance tools → feedback.
Q6: How is cost allocated to tasks? – Real‑time reporting of instance type, price, duration, then mapping to cloud bills and distributing gaps proportionally.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
