Big Data 14 min read

How ByteDance’s DataLeap Automates Big Data Governance and Boosts Performance

ByteDance’s DataLeap suite tackles the complexities of large‑scale data platform governance by identifying manual tuning pain points, deploying automated rule‑engine recommendations, and optimizing resource allocation, ultimately improving stability, reducing costs, and enhancing overall system health across thousands of Spark, Flink and other tasks.

ByteDance Data Platform

Feb 21, 2024

How ByteDance’s DataLeap Automates Big Data Governance and Boosts Performance

Overview

ByteDance’s data platform runs over 10,000 task queues supporting more than 50 task types (DTS, HSQL, Spark, Python, Flink, Shell, etc.). The automated computation governance framework has integrated offline tasks such as HSQL, Hive‑to‑X DTS, AB test, and Spark jobs, covering thousands of queues with over 60% optimization coverage.

Pain Points of Manual Tuning

System complexity: Hundreds of Spark parameters interact, making manual adjustments difficult and risking stability.

Dynamic changes: Varying workloads and data volumes require flexible tuning.

Lack of expertise: Data analysts focus on business logic rather than low‑level engine parameters.

Inconsistency and reproducibility: Manual tuning results differ across operators and are hard to replicate.

Optimization Challenges

Business requirements include improving stability, reducing cost, resolving task blockage, and raising system health. Typical scenarios involve trade‑offs between stability and resource utilization, cost recovery, and blockage mitigation.

Automated Solutions

Real‑time Rule Engine

The engine collects Yarn container, Spark event, and Dtop status data, aggregates metrics by app ID, stores them in a historical database, and after a 3‑7 day observation period recommends parameters to Spark and other engines. It applies heuristic rule trees, evaluates resource usage, and supports two strategies:

Normal strategy: Prioritizes stability by recommending parameters based on actual resource consumption.

Aggressive strategy: Pushes resource utilization higher while handling OOM risk.

Failed tasks trigger automatic rollback to the last stable configuration; repeated failures pause optimization.

Real‑time Monitoring & Adaptive Adjustment

OOM‑adaptive handling: Isolate OOM‑prone tasks in dedicated executors.

Shuffle write splitting: Split containers when disk write thresholds are exceeded.

Shuffle tiered throttling: Allocate QPS based on task priority.

Node blacklist: Avoid scheduling on nodes with known failures.

Failure rollback and parameter management: Revert to stable parameters after failures.

DataLeap One‑stop Governance Suite

Provides a UI for users to launch governance actions, select optimization strategies (normal or aggressive), enable small‑file merging, and view estimated cost‑benefit outcomes.

Case Study: Queue Optimization

Before optimization, a queue was over‑provisioned before 10 am, causing severe blockage. After applying the automated recommendations, resource requests dropped, CPU usage improved by 3.5 % (usage up 6.2 %, utilization +46.3 %), memory requests fell 30.6 % (usage down 21.8 %, utilization +24 %). Average task runtime decreased by 1.7 minutes, saving roughly ¥100 per PB of CPU and memory.

Benefits

Increased tuning efficiency and accuracy.

Reduced human labor and operational costs.

Real‑time monitoring ensures optimal system state.

Limitations

Effectiveness depends on the chosen algorithms.

Explainability and controllability can be limited.

Certain edge cases still require manual expertise.

Future Directions

Metadata closed‑loop across products and tiered SLA guarantees.

User‑driven parameter recommendations with fixed values, thresholds, and masking.

Deeper integration of rule engine and algorithmic optimization.

Adapting to evolving big‑data environments and improving algorithm performance while maintaining explainability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

resource optimization real-time monitoring Data Governance

Written by

ByteDance Data Platform

The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.