Big Data 29 min read

Kuaishou's Year-Long White‑Box Cost Governance in Big Data: Engine, Data‑Warehouse, and Tool Optimizations

This article presents Kuaishou's comprehensive white‑box cost governance practice over the past year, detailing the data‑governance framework, engine and data‑warehouse white‑boxing techniques, compression algorithm replacement, HBO automatic tuning, operator analysis, and the resulting performance and cost benefits, as well as future plans.

DataFunTalk

Aug 27, 2024

Kuaishou's Year-Long White‑Box Cost Governance in Big Data: Engine, Data‑Warehouse, and Tool Optimizations

Introduction This article introduces our year‑long white‑box cost governance practice in big data, diving into the engine, data‑warehouse, and tools, opening the architecture for detailed analysis and achieving deep cost control with notable results.

Main contents include the following parts:

1. Data Governance System

2. Engine White‑Boxing

3. Data‑Warehouse White‑Boxing

4. Benefit Analysis

5. Future Planning

Speaker: Feng Zanfeng , Kuaishou, Big Data Architect

Editor: Su Yu

Proofreader: Li Yao

Produced by: DataFun Community

Data Governance System

Like most companies, Kuaishou's data governance is divided into four major aspects: cost, quality, efficiency, and security.

1. Efficiency

Efficiency includes data development efficiency and data consumption efficiency. Development efficiency focuses on model development speed, while consumption efficiency focuses on model usability and query response time.

2. Security

Security is split into production‑stage security and consumption‑stage security.

3. Quality

Quality covers prevention, proactive detection, fault impact, and fault post‑mortem.

Prevention: Ensure design, development, testing, and acceptance follow standards.

Proactive Detection: Detect issues internally before users notice them; requires comprehensive monitoring and effective alerts.

Fault Impact: Track fault counts at each level and keep them within tolerable ranges.

Fault Post‑mortem: Conduct deep post‑mortems to identify root causes and ensure remedial actions are taken promptly.

4. Cost

Data cost consists of three parts: storage cost, compute cost, and traffic cost.

Storage Cost: Focus on compression ratio, compression performance, and replica count to ensure high storage density, low compression latency, and data safety with minimal replicas.

Compute Cost: CPU average utilization reflects resource scheduling ability; single CU processing volume measures engine compute power. Optimizing the engine raises the per‑CU data processing amount.

Traffic Cost: (Details omitted in source)

This talk focuses on the cost part, achieving cost reduction through three white‑box implementations: Engine White‑Boxing, Data‑Warehouse White‑Boxing, and Tool White‑Boxing.

02 Engine White‑Boxing

"Engine White‑Boxing" is a project codename that includes many optimization points: HBO automatic tuning, compression algorithm replacement, engine operator analysis, etc.

1. HBO Automatic Tuning

Before HBO, tuning was manual and suffered three major drawbacks: high difficulty, easy to become ineffective over time, and high cost when scaling to thousands of jobs.

HBO solves these issues by analyzing historical task runs and automatically optimizing execution parameters, keeping jobs near optimal state.

HBO improves performance and reduces cost through three aspects:

(1) Reasonable resource quota: Identify CPU and memory needs and auto‑scale to avoid mismatches.

(2) Optimized task slicing: Adjust slice parameters based on task duration to avoid too short or too long slices.

(3) Task‑level parameter tuning: Adjust small‑file merging, compression algorithm, broadcast settings, etc., to boost performance.

The HBO tuning process is data‑driven:

Build profile – collect dozens of decision metrics with strict quality.

Coarse‑tune – apply preset rules for an initial conservative adjustment (e.g., reserve extra CPU/Memory).

Parameter release – push tuned parameters to tasks for the next execution cycle.

Fine‑tune – based on feedback, further refine parameters.

Next, we discuss compression algorithm replacement.

2. Compression Algorithm Replacement

Kuaishou's big‑data storage currently uses Parquet + GZIP. Data volume is growing (PB new, EB existing) with a read‑write ratio >20:1, so decompression performance is prioritized over compression speed.

Industry trend moves from GZIP to ZSTD, which offers comparable compression ratio to zlib and performance similar to Snappy. Our tests show a 3%‑12% compression‑ratio improvement; we cap ZSTD level at 12 because higher levels yield diminishing returns.

Note: ZSTD‑JNI versions below 1.5.0 have bugs; upgrade to ≥1.5 for full compatibility.

3. Engine Operator Analysis

Engine operator analysis deeply inspects Spark engine from multiple perspectives (execution process, physical operator, UDF functions) to make compute power explainable, identify bottlenecks, and uncover optimization opportunities.

Key analysis views:

Execution Process View: Break down computation into ten stages, analyzing composition, resource, and time cost.

Physical Operator View: Match logical operators to physical implementations; efficient implementation and correct usage greatly affect performance.

UDF Function View: Analyze usage of hundreds of built‑in and user‑defined functions.

Data sources for engine understanding:

QueryPlan: Parse SQL to generate execution plans for operator analysis.

StackTrace: Analyze call stacks, generate histograms and flame graphs for performance profiling.

EventLog: DAG and metric logs produced during job execution.

GcLog: Precise memory usage analysis to reduce OOM risk.

Sample data shows the most time‑consuming processes: Data Scan (≈30%), Data Exchange (≈20%), Data Aggregation (≈15%), and UDF calls (≈14%). Detailed breakdown reveals heavy JSON processing (≈33%) and high‑cost UDFs, indicating key optimization targets.

When data is clear, the response steps are:

Analyze and judge whether optimization is reasonable and how much space exists.

Ensure correct usage of methods; a good car needs a skilled driver.

Consider component replacements (e.g., JSON handling, compression algorithm).

If needed, rebuild critical components.

03 Data‑Warehouse White‑Boxing

1. Data‑Warehouse Architecture Metrics

Before white‑boxing the warehouse, we must define a good warehouse standard. Key quantitative metrics include:

Completeness: Measure whether data models are comprehensive and reusable across layers.

Reusability: Evaluate reference coefficient, duplicate computation, and link depth.

Standardization: Basic requirement; hard to retrofit if ignored early.

2. Reducing Duplicate Computation

Similar operators across queries cause massive resource waste. The workflow to identify and merge them:

Collect task list and retrieve all SQL.

Generate execution plans.

Generate operator signatures via post‑order AST traversal.

Detect duplicate operators by signature collisions.

Calculate operator cost; prioritize high‑cost duplicates.

Merge duplicates to improve speed and lower cost.

Internal data shows high duplicate ratios: 43% of aggregation operators, 23% of join operators, and 4.5% of INSERT operators are similar, indicating urgent need for automated governance.

3. Reducing Link‑Level Depth

Production chains can reach 39 layers with many cross‑layer dependencies, leading to latency, high cost, and quality issues.

Short‑term solution: machine‑assisted governance using operator‑level lineage to reconstruct data processing logic, generate remediation suggestions, and let users apply them.

Long‑term vision: fully decouple logical and physical layers. Logical layer defines business models; physical layer is automatically built and continuously optimized to stay near optimal.

4. Routine Governance Automation

Governance is often seen as post‑mortem work with low priority. To push automation, we propose a five‑step method:

Define strict standards.

Identify problems based on standards and rules.

Quality inspection – first firewall using anomaly detection and cross‑validation.

Governance preview – second firewall for critical actions, allowing users to intervene.

Fast rollback – third firewall ensuring every automated action is reversible.

04 Benefit Analysis

Results: storage compression ratio increased by 5%, compute resource efficiency improved by 16%, job runtime reduced by 14% (aligned with compute gains), and job failure rate, GC time, and OOM occurrences all decreased.

05 Future Planning

Future work will continue to deepen and broaden the current efforts:

Data Compression: Move beyond algorithm replacement to dynamic compression and encoding.

Data‑Warehouse Architecture: Advance model design and production, especially in model reuse and abstraction.

Depth Expansion: Further deepen engine white‑boxing governance.

Next‑Generation Technologies: Explore new paradigms to break through efficiency and cost limits.

Thank you for your attention.

Speaker

INTRODUCTION

Feng Zanfeng

Kuaishou

Big Data Architect

Previous Recommendations

Audio Representation Large Model in QQ Music Cold‑Start (Updated)

Financial Document Intelligence Practice

Accelerating Spark Engine with Native Technology

Alibaba Cloud's Big Data AI Integrated Best Practice

Alibaba Cloud Big Data AI One‑Stop Practice

E‑commerce Re‑ranking Model Online Learning Ahead of User Feedback

Enhancing Large Model Intelligence: Strategies for Improving Mathematical Reasoning

Data+AI Integrated Architecture for Product Innovation

Data Product Methodology: Pitfalls and Beyond!

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data cost optimization Data Warehouse engine tuning white-boxing

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.