Big Data 13 min read

Data Governance Practices and Cost Optimization at Ctrip's Data Asset Management Platform

The article outlines Ctrip's data governance framework, detailing background challenges, metadata construction, cost and quality optimization techniques, data flow improvements, platform modules, health metrics, and concludes with a summary of achievements and future directions.

Ctrip Technology

Jun 17, 2021

Data Governance Practices and Cost Optimization at Ctrip's Data Asset Management Platform

Background – Data volume and cost, value, quality, and security are critical concerns for data engineers, especially in large enterprises like Ctrip with multiple data warehouses and teams.

Governance Concept – Data governance is defined as the coordinated use of people, processes, and technology to treat data as an asset, focusing on quality, consistency, availability, security, and accessibility.

Implementation Plan

3.1 Metadata Construction – Build a metadata warehouse covering four categories: technical, operational, management, and business metadata.

Technical metadata: tables, fields, storage details.

Operational metadata: ETL scheduling, execution, lineage.

Management metadata: owners, logs, performance metrics.

Business metadata: standards, quality, dictionaries, security.

Rich technical and operational metadata enable analysis of compute/storage costs, metadata completeness, quality monitoring coverage, and temporary/unmaintained tables.

3.2 Targeted Governance

3.2.1 Cost Governance

Compute cost is estimated from CPU usage per task (e.g., 10元/1M VCS). High‑cost ETL jobs and adhoc queries are identified and optimized, potentially saving millions of yuan annually.

Storage cost reduction focuses on long‑inactive tables, unifying formats to ORC, applying hot/cold storage tiers, and removing duplicate files, achieving a 50% reduction in new storage demand.

Sample Hive configuration snippets:

set hive.enforce.bucketing = true;

set hive.optimize.bucketmapjoin = true;

set hive.optimize.bucketmapjoin.sortedmerge = true;

set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

Additional optimizations include merging small files, controlling reducer numbers ( hive.exec.reducers.bytes.per.reducer=1G, hive.exec.reducers.max=999), using common join keys, and enabling SMB joins.

3.2.2 Quality Standards

Enrich table metadata, configure Data Quality Checks (DQC) with strong/weak rules, address ownerless tables, and enforce a 7‑day retention policy for temporary tables.

3.2.3 Data Flow

Improve data sharing across business units while preventing permission leakage; introduce cascading approvals and sensitivity‑based approval workflows.

Platformization and Routine Governance

The Data Asset Management Platform (named “Dayu”) provides three core modules: asset inventory, governance tools, and health analysis.

Asset inventory – dashboards of cost, quality, and sharing metrics.

Governance – issue tagging, owner‑driven remediation.

Health analysis – scores for resource utilization, management compliance, delivery outcomes, and data security.

Health score visualizations illustrate CPU dispersion, high‑cost job ratios, and storage usage trends.

Conclusion

Data governance is a broad, evolving discipline; Ctrip focuses on cost, quality, and flow at present, with plans for stricter future requirements and continuous platform‑enabled self‑service governance.

Recruitment Notice

Ctrip’s big‑data application development team invites candidates interested in data platform engineering and data science to apply via email.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Metadata Data Quality Data Governance Ctrip

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.