How Dynamic Task‑Grabbing Cuts Distributed Batch Jobs from Hours to Minutes
This article presents a detailed case study of optimizing a distributed batch processing system by replacing static shard‑key concurrency with a dynamic task‑grabbing mechanism, dramatically reducing execution time from several hours to under fifteen minutes while maintaining stable resource usage.
Background
A distributed batch system uses a distributed database architecture where the core functionality is implemented by batch programs. Performance testing must monitor server resource usage and detect anomalies, but it also needs to watch for data‑distribution imbalance that can cause some concurrent sub‑programs to run much longer, becoming a bottleneck for the whole batch job.
System Characteristics and Problem
The tested system stores business data from multiple branches of a company in a distributed database. It contains dozens of database shards and over 500 shard keys (the primary key that determines data placement). Each shard key groups data from several institutions. During batch execution, roughly 500 sub‑programs run concurrently, each handling the data belonging to a specific shard key. Because the amount of data per shard key varies widely, sub‑programs processing larger shards take significantly longer, slowing down the overall batch process.
First Optimization: Dynamic Task‑Grabbing
To eliminate the static‑concurrency bottleneck, the system adopts a "task‑grabbing" dynamic concurrency model. The approach consists of three steps:
All data to be processed are divided into tasks based on a chosen dimension (e.g., institution ID plus a specific parameter). Each task corresponds to the data of one institution for that parameter.
A task‑generation program creates a record for every task in a task table, marking each as "initialized" before any processing starts.
Worker sub‑programs no longer follow a fixed shard‑key schedule. Instead, each worker queries the task table for tasks that are still initialized and have a large data volume (high priority). After completing a task, the worker marks it as finished and fetches the next available task. This loop continues until no initialized tasks remain.
In a test scenario with 16 million rows, 300 concurrent sub‑programs processed 110 000 generated tasks. All sub‑programs finished within 33 minutes, and the total batch execution time was 32 minutes 21 seconds, with no individual sub‑program becoming a noticeable outlier.
Second Optimization: Controlling Task Size
Even after the first optimization, a few tasks still contained excessive data, and the sheer number of small tasks caused overhead when workers repeatedly grabbed and updated them. The second refinement adds two mechanisms:
A configurable parameter limits the maximum data size per task. When multiple small tasks together stay below this threshold, they are merged into a larger task; a task boundary is created when the shard key changes.
For originally large tasks, an additional dimension field is introduced to split them into smaller chunks, also respecting the maximum‑size parameter.
With the same data set and unchanged concurrency, the number of generated tasks dropped from over 100 000 to fewer than 1 000. Task generation now takes 2 minutes 40 seconds, while the data‑processing phase shortens to 12 minutes 33 seconds, yielding a clear overall performance gain.
Conclusion and Outlook
The two‑stage optimization reduced the production batch runtime from nearly four hours to roughly fifteen minutes—a reduction of over 90%—while keeping system and database resource utilization stable and free of bottlenecks. Future work will continue to explore distributed batch testing techniques, deeper system analysis, and further tuning methods to improve efficiency and reliability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
