Big Data 12 min read

How Huolala Boosted Offline Scheduling Performance: Strategies & Lessons

Huolala’s big‑data offline platform, built from scratch, faced escalating scheduling delays as task instances grew, prompting a series of short‑ and mid‑term optimizations—including zombie task cleanup, retention policies, memory caching, algorithmic tweaks, and high‑availability enhancements—to dramatically reduce dependency computation time and sustain million‑scale daily workloads.

Huolala Tech

Nov 11, 2022

How Huolala Boosted Offline Scheduling Performance: Strategies & Lessons

Introduction

The big data offline development platform is a foundational capability of Huolala’s data system, built from zero, supporting arbitrary database exchanges, rich task analysis, and a one‑stop interactive query service.

As Huolala’s business expands globally across eight lines, the data team aims to deeply optimize core capabilities, efficiently integrate middle‑platform abilities, and drive intelligent business applications, leading to increasing challenges for the platform’s performance. This article briefly introduces the recent six‑month optimizations to offline scheduling performance.

Scheduling Background & Logic

Huolala’s data architecture follows a typical Lambda model, choosing a low‑maintenance batch layer for offline processing due to early immaturity of real‑time frameworks and ease of integration.

With business expansion and the maturation of data warehousing, the offline focus shifted from supporting corporate reports to scientifically empowering business, requiring the platform to reliably support data applications, online services, and intelligent scenarios such as feature computation.

Current scheduling uses a single‑layer entity model: tasks are scanned periodically, dependencies and boundary conditions are computed in memory, and runnable instances are dispatched to distributed Runner nodes. While this ensures 100% SLA accuracy, instance count growth leads to linear degradation in dependency calculation latency and increased database pressure, potentially causing average scheduling time to reach minute‑level within a year.

Optimization Plan

Performance issues are first quantified, then targeted actions are defined. The actions are divided into short‑term and mid‑term phases, covering reduction of dependency computation, algorithmic improvements, and tiered safeguards.

Short‑Term Measures

To lower the base of each scheduling computation, “zombie tasks”—tasks without upstream/downstream dependencies or producing no data—are identified and cleaned.

Zombie Task & Instance Cleanup

Four zombie task types are defined: empty data, failure, ready, and no‑dependency. Approximately 100,000 such instances exist, growing by hundreds daily. Direct actions like freezing or forcing success reduce average scheduling computation by about three seconds per task.

Instance Retention Policy

Without a proper retention strategy, instances accumulate, stressing query and computation. The platform defines retention periods aligned with task cycles to keep instance growth within acceptable limits.

Task Cycle

Retention

Month

1 year

Week

3 months

Day

1 month

Hour

1 week

Link Prioritization

Core‑link tasks are always given higher scheduling priority than non‑core tasks, ensuring they are dispatched first when conditions are met.

Mid‑Term Measures

Space‑for‑time trade‑offs are applied by redesigning data structures and algorithms.

In‑Memory Structures

Metadata originally stored in MySQL is now cached in memory per task cycle (details, dependencies, instance lists). This enables direct cache lookup during dispatch, with cache updates synchronized on re‑run, termination, forced success, or status reports.

Algorithm Adjustments

Dependency calculation dominates scheduling. Two common dependency types are optimized:

Self‑dependency: previously examined all historical instances (complexity n). New “previous‑cycle dependency” reduces it to a single instance, achieving O(1) time.

Parent dependency: previously required a database scan per check (complexity O(n)). By sorting instances in memory and applying binary search, complexity improves to O(log n).

Current

Improved

Self‑dependency

Depends on all historical instances ( n)

Depends only on previous cycle ( 1), O(1) Parent dependency

Database query each time ( O(n))

In‑memory binary search ( O(log n))

High Availability

Cache updates and status reports are now coordinated via ZooKeeper for partition fault tolerance, and fail‑over processes include both database transaction and cache refresh to keep consistency.

Outlook

Even after these optimizations, processing millions of daily instances still reveals performance risks. Long‑term improvements will focus on redesigning the overall dispatch mechanism, including computation and delivery reconstruction and targeted scheduling capabilities.

Scheduling Mode Improvement

Current

Improved

Computation Passive calculation – Runner triggers calculation on each request. Active calculation – A dependency calculator triggers dispatch when conditions change.

Dispatch

Runner pull model.

Base push model.

Targeted Scheduling Capability

Current Runner scheduling is random. Implementing targeted scheduling allows specifying the Runner for gray‑release deployments, preventing version updates from affecting running tasks.

Author: Ling Xiao, Head of Huolala Data Platform, leading profiling, data API, offline & real‑time platform development.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Big Data offline scheduling

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.