Big Data 13 min read

Data Governance Practices and Cost Optimization at Ctrip's Data Asset Management Platform

The article outlines Ctrip's data governance framework, detailing background challenges, metadata construction, cost and quality optimization techniques, data flow improvements, platform modules, health metrics, and concludes with a summary of achievements and future directions.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Data Governance Practices and Cost Optimization at Ctrip's Data Asset Management Platform

Background – Data volume and cost, value, quality, and security are critical concerns for data engineers, especially in large enterprises like Ctrip with multiple data warehouses and teams.

Governance Concept – Data governance is defined as the coordinated use of people, processes, and technology to treat data as an asset, focusing on quality, consistency, availability, security, and accessibility.

Implementation Plan

3.1 Metadata Construction – Build a metadata warehouse covering four categories: technical, operational, management, and business metadata.

Technical metadata: tables, fields, storage details.

Operational metadata: ETL scheduling, execution, lineage.

Management metadata: owners, logs, performance metrics.

Business metadata: standards, quality, dictionaries, security.

Rich technical and operational metadata enable analysis of compute/storage costs, metadata completeness, quality monitoring coverage, and temporary/unmaintained tables.

3.2 Targeted Governance

3.2.1 Cost Governance

Compute cost is estimated from CPU usage per task (e.g., 10元/1M VCS ). High‑cost ETL jobs and adhoc queries are identified and optimized, potentially saving millions of yuan annually.

Storage cost reduction focuses on long‑inactive tables, unifying formats to ORC, applying hot/cold storage tiers, and removing duplicate files, achieving a 50% reduction in new storage demand.

Sample Hive configuration snippets:

set hive.enforce.bucketing = true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

Additional optimizations include merging small files, controlling reducer numbers ( hive.exec.reducers.bytes.per.reducer=1G , hive.exec.reducers.max=999 ), using common join keys, and enabling SMB joins.

3.2.2 Quality Standards

Enrich table metadata, configure Data Quality Checks (DQC) with strong/weak rules, address ownerless tables, and enforce a 7‑day retention policy for temporary tables.

3.2.3 Data Flow

Improve data sharing across business units while preventing permission leakage; introduce cascading approvals and sensitivity‑based approval workflows.

Platformization and Routine Governance

The Data Asset Management Platform (named “Dayu”) provides three core modules: asset inventory, governance tools, and health analysis.

Asset inventory – dashboards of cost, quality, and sharing metrics.

Governance – issue tagging, owner‑driven remediation.

Health analysis – scores for resource utilization, management compliance, delivery outcomes, and data security.

Health score visualizations illustrate CPU dispersion, high‑cost job ratios, and storage usage trends.

Conclusion

Data governance is a broad, evolving discipline; Ctrip focuses on cost, quality, and flow at present, with plans for stricter future requirements and continuous platform‑enabled self‑service governance.

Recruitment Notice

Ctrip’s big‑data application development team invites candidates interested in data platform engineering and data science to apply via email.

Big Datametadatadata qualityCost Optimizationdata governanceCtrip
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.