How ByteDance’s DataLeap Automates Big Data Governance and Boosts Performance
ByteDance’s DataLeap suite tackles the complexities of large‑scale data platform governance by identifying manual tuning pain points, deploying automated rule‑engine recommendations, and optimizing resource allocation, ultimately improving stability, reducing costs, and enhancing overall system health across thousands of Spark, Flink and other tasks.
Overview
ByteDance’s data platform runs over 10,000 task queues supporting more than 50 task types (DTS, HSQL, Spark, Python, Flink, Shell, etc.). The automated computation governance framework has integrated offline tasks such as HSQL, Hive‑to‑X DTS, AB test, and Spark jobs, covering thousands of queues with over 60% optimization coverage.
Pain Points of Manual Tuning
System complexity: Hundreds of Spark parameters interact, making manual adjustments difficult and risking stability.
Dynamic changes: Varying workloads and data volumes require flexible tuning.
Lack of expertise: Data analysts focus on business logic rather than low‑level engine parameters.
Inconsistency and reproducibility: Manual tuning results differ across operators and are hard to replicate.
Optimization Challenges
Business requirements include improving stability, reducing cost, resolving task blockage, and raising system health. Typical scenarios involve trade‑offs between stability and resource utilization, cost recovery, and blockage mitigation.
Automated Solutions
Real‑time Rule Engine
The engine collects Yarn container, Spark event, and Dtop status data, aggregates metrics by app ID, stores them in a historical database, and after a 3‑7 day observation period recommends parameters to Spark and other engines. It applies heuristic rule trees, evaluates resource usage, and supports two strategies:
Normal strategy: Prioritizes stability by recommending parameters based on actual resource consumption.
Aggressive strategy: Pushes resource utilization higher while handling OOM risk.
Failed tasks trigger automatic rollback to the last stable configuration; repeated failures pause optimization.
Real‑time Monitoring & Adaptive Adjustment
OOM‑adaptive handling: Isolate OOM‑prone tasks in dedicated executors.
Shuffle write splitting: Split containers when disk write thresholds are exceeded.
Shuffle tiered throttling: Allocate QPS based on task priority.
Node blacklist: Avoid scheduling on nodes with known failures.
Failure rollback and parameter management: Revert to stable parameters after failures.
DataLeap One‑stop Governance Suite
Provides a UI for users to launch governance actions, select optimization strategies (normal or aggressive), enable small‑file merging, and view estimated cost‑benefit outcomes.
Case Study: Queue Optimization
Before optimization, a queue was over‑provisioned before 10 am, causing severe blockage. After applying the automated recommendations, resource requests dropped, CPU usage improved by 3.5 % (usage up 6.2 %, utilization +46.3 %), memory requests fell 30.6 % (usage down 21.8 %, utilization +24 %). Average task runtime decreased by 1.7 minutes, saving roughly ¥100 per PB of CPU and memory.
Benefits
Increased tuning efficiency and accuracy.
Reduced human labor and operational costs.
Real‑time monitoring ensures optimal system state.
Limitations
Effectiveness depends on the chosen algorithms.
Explainability and controllability can be limited.
Certain edge cases still require manual expertise.
Future Directions
Metadata closed‑loop across products and tiered SLA guarantees.
User‑driven parameter recommendations with fixed values, thresholds, and masking.
Deeper integration of rule engine and algorithmic optimization.
Adapting to evolving big‑data environments and improving algorithm performance while maintaining explainability.
ByteDance Data Platform
The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.