Big Data 12 min read

Huya Offline Job Scheduling System: Design, Baseline Scheduling, and Cost Optimization

This article introduces Huya's offline job scheduling platform, covering its positioning, evolution, system architecture, baseline scheduling techniques, cost‑optimization strategies, resource‑balancing methods, and future intelligent data‑warehouse directions, illustrating how data‑driven automation improves YARN utilization and SLA compliance.

DataFunTalk

Jun 12, 2022

Huya Offline Job Scheduling System: Design, Baseline Scheduling, and Cost Optimization

Huya's offline job scheduling system is a development platform dedicated to offline computing scenarios, aiming to simplify development, manage versions and permissions, and enable automated operations without developer intervention.

The platform has evolved since 2018, initially providing visual task development, version control, and collaboration, handling over 100,000 daily tasks. From 2019 to 2020, the focus shifted to cost optimization through baseline scheduling, multi‑datacenter cloud integration, and task governance. In 2021, intelligent operations such as fault analysis and self‑healing were explored, with future plans for intelligent data warehouses.

Cost optimization was achieved by balancing resource usage across time slots, improving YARN cluster utilization to over 90% and maintaining SLA above 90%. The system design includes a DAG‑based task configuration stored in a database, with plugins for Hive, SparkSQL, and crontab‑style frequency specifications.

Time dependencies are abstracted using time spans (e.g., "0+3" for every three hours) and event‑driven triggers, allowing downstream tasks to execute after upstream completion signals.

The deployment architecture consists of a stateless agent layer that retrieves task metadata, assigns tasks to appropriate interface machines, and submits them to YARN for execution.

Baseline scheduling practices focus on three pillars: defining and managing SLA limits, reducing resource concentration during peak hours, and minimizing safety buffer capacity.

SLA is defined by user‑specified timeliness expectations for critical tasks, with hierarchical importance levels and leadership approval to ensure realistic targets.

Cost‑optimization tactics include improving ETL efficiency, adjusting unreasonable user expectations, scaling compute resources, and re‑scheduling lower‑priority tasks to balance load.

Compute balancing involves prioritizing tasks, estimating real execution time by subtracting resource‑waiting periods, and applying fair‑share scheduling for urgent or temporary jobs, while monitoring YARN load to throttle submissions.

Safety buffer capacity is managed by configuring baseline, guaranteed, and elastic compute resources, with cloud resources used for rapid scaling during peak events.

Multi‑datacenter execution is achieved by partitioning DAGs into blocks and distributing them across sites, while small‑IO tasks fill gaps, further boosting cluster utilization.

The future roadmap emphasizes intelligent task operations, automated exception diagnosis, and an intelligent data‑warehouse where 90% of ETL tasks are auto‑generated, abstracting compute details from users.

A Q&A section addresses topics such as offline vs. real‑time task allocation, DAG workflow with time attributes, multi‑datacenter resource routing, cost attribution, and scaling strategies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

DAG cost optimization YARN baseline scheduling offline scheduling

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.