How Huolala Boosted Offline Scheduling Performance: Strategies & Lessons
Huolala’s big‑data offline platform, built from scratch, faced escalating scheduling delays as task instances grew, prompting a series of short‑ and mid‑term optimizations—including zombie task cleanup, retention policies, memory caching, algorithmic tweaks, and high‑availability enhancements—to dramatically reduce dependency computation time and sustain million‑scale daily workloads.
Introduction
The big data offline development platform is a foundational capability of Huolala’s data system, built from zero, supporting arbitrary database exchanges, rich task analysis, and a one‑stop interactive query service.
As Huolala’s business expands globally across eight lines, the data team aims to deeply optimize core capabilities, efficiently integrate middle‑platform abilities, and drive intelligent business applications, leading to increasing challenges for the platform’s performance. This article briefly introduces the recent six‑month optimizations to offline scheduling performance.
Scheduling Background & Logic
Huolala’s data architecture follows a typical Lambda model, choosing a low‑maintenance batch layer for offline processing due to early immaturity of real‑time frameworks and ease of integration.
With business expansion and the maturation of data warehousing, the offline focus shifted from supporting corporate reports to scientifically empowering business, requiring the platform to reliably support data applications, online services, and intelligent scenarios such as feature computation.
Current scheduling uses a single‑layer entity model: tasks are scanned periodically, dependencies and boundary conditions are computed in memory, and runnable instances are dispatched to distributed Runner nodes. While this ensures 100% SLA accuracy, instance count growth leads to linear degradation in dependency calculation latency and increased database pressure, potentially causing average scheduling time to reach minute‑level within a year.
Optimization Plan
Performance issues are first quantified, then targeted actions are defined. The actions are divided into short‑term and mid‑term phases, covering reduction of dependency computation, algorithmic improvements, and tiered safeguards.
Short‑Term Measures
To lower the base of each scheduling computation, “zombie tasks”—tasks without upstream/downstream dependencies or producing no data—are identified and cleaned.
Zombie Task & Instance Cleanup
Four zombie task types are defined: empty data, failure, ready, and no‑dependency. Approximately 100,000 such instances exist, growing by hundreds daily. Direct actions like freezing or forcing success reduce average scheduling computation by about three seconds per task.
Instance Retention Policy
Without a proper retention strategy, instances accumulate, stressing query and computation. The platform defines retention periods aligned with task cycles to keep instance growth within acceptable limits.
Task Cycle
Retention
Month
1 year
Week
3 months
Day
1 month
Hour
1 week
Link Prioritization
Core‑link tasks are always given higher scheduling priority than non‑core tasks, ensuring they are dispatched first when conditions are met.
Mid‑Term Measures
Space‑for‑time trade‑offs are applied by redesigning data structures and algorithms.
In‑Memory Structures
Metadata originally stored in MySQL is now cached in memory per task cycle (details, dependencies, instance lists). This enables direct cache lookup during dispatch, with cache updates synchronized on re‑run, termination, forced success, or status reports.
Algorithm Adjustments
Dependency calculation dominates scheduling. Two common dependency types are optimized:
Self‑dependency: previously examined all historical instances (complexity n). New “previous‑cycle dependency” reduces it to a single instance, achieving O(1) time.
Parent dependency: previously required a database scan per check (complexity O(n)). By sorting instances in memory and applying binary search, complexity improves to O(log n).
Current
Improved
Self‑dependency
Depends on all historical instances ( n)
Depends only on previous cycle ( 1), O(1) Parent dependency
Database query each time ( O(n))
In‑memory binary search ( O(log n))
High Availability
Cache updates and status reports are now coordinated via ZooKeeper for partition fault tolerance, and fail‑over processes include both database transaction and cache refresh to keep consistency.
Outlook
Even after these optimizations, processing millions of daily instances still reveals performance risks. Long‑term improvements will focus on redesigning the overall dispatch mechanism, including computation and delivery reconstruction and targeted scheduling capabilities.
Scheduling Mode Improvement
Current
Improved
Computation Passive calculation – Runner triggers calculation on each request. Active calculation – A dependency calculator triggers dispatch when conditions change.
Dispatch
Runner pull model.
Base push model.
Targeted Scheduling Capability
Current Runner scheduling is random. Implementing targeted scheduling allows specifying the Runner for gray‑release deployments, preventing version updates from affecting running tasks.
Author: Ling Xiao, Head of Huolala Data Platform, leading profiling, data API, offline & real‑time platform development.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
