Data Governance Practices and Cost Optimization at Ctrip's Data Asset Management Platform
The article outlines Ctrip's data governance framework, detailing background challenges, metadata construction, cost and quality optimization techniques, data flow improvements, platform modules, health metrics, and concludes with a summary of achievements and future directions.
Background – Data volume and cost, value, quality, and security are critical concerns for data engineers, especially in large enterprises like Ctrip with multiple data warehouses and teams.
Governance Concept – Data governance is defined as the coordinated use of people, processes, and technology to treat data as an asset, focusing on quality, consistency, availability, security, and accessibility.
Implementation Plan
3.1 Metadata Construction – Build a metadata warehouse covering four categories: technical, operational, management, and business metadata.
Technical metadata: tables, fields, storage details.
Operational metadata: ETL scheduling, execution, lineage.
Management metadata: owners, logs, performance metrics.
Business metadata: standards, quality, dictionaries, security.
Rich technical and operational metadata enable analysis of compute/storage costs, metadata completeness, quality monitoring coverage, and temporary/unmaintained tables.
3.2 Targeted Governance
3.2.1 Cost Governance
Compute cost is estimated from CPU usage per task (e.g., 10元/1M VCS ). High‑cost ETL jobs and adhoc queries are identified and optimized, potentially saving millions of yuan annually.
Storage cost reduction focuses on long‑inactive tables, unifying formats to ORC, applying hot/cold storage tiers, and removing duplicate files, achieving a 50% reduction in new storage demand.
Sample Hive configuration snippets:
set hive.enforce.bucketing = true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;Additional optimizations include merging small files, controlling reducer numbers ( hive.exec.reducers.bytes.per.reducer=1G , hive.exec.reducers.max=999 ), using common join keys, and enabling SMB joins.
3.2.2 Quality Standards
Enrich table metadata, configure Data Quality Checks (DQC) with strong/weak rules, address ownerless tables, and enforce a 7‑day retention policy for temporary tables.
3.2.3 Data Flow
Improve data sharing across business units while preventing permission leakage; introduce cascading approvals and sensitivity‑based approval workflows.
Platformization and Routine Governance
The Data Asset Management Platform (named “Dayu”) provides three core modules: asset inventory, governance tools, and health analysis.
Asset inventory – dashboards of cost, quality, and sharing metrics.
Governance – issue tagging, owner‑driven remediation.
Health analysis – scores for resource utilization, management compliance, delivery outcomes, and data security.
Health score visualizations illustrate CPU dispersion, high‑cost job ratios, and storage usage trends.
Conclusion
Data governance is a broad, evolving discipline; Ctrip focuses on cost, quality, and flow at present, with plans for stricter future requirements and continuous platform‑enabled self‑service governance.
Recruitment Notice
Ctrip’s big‑data application development team invites candidates interested in data platform engineering and data science to apply via email.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.