How a GC Alert Led Me to Split and Shard Scheduled Jobs for Better Performance
After receiving a Young Generation GC alarm on a Pinduoduo service, I traced the issue to a high‑frequency scheduled task that created massive objects, then resolved it by breaking the job into finer‑grained tasks and finally sharding the work across multiple machines.
Problem Detection
One night I received a WeChat Enterprise Mail alert indicating that the Young Generation (G1) GC count exceeded the threshold. The alarm suggested excessive object creation and rapid reclamation, prompting me to investigate the next day at the office.
Root Cause Analysis
Monitoring data from the CAT platform showed two peak periods, likely caused by uneven scheduling that concentrated load on a single machine. By reviewing logs around the alarm times and searching for Command entries, I identified the offending scheduled job: a task that synchronizes advertising transaction data from Toutiao.
The task pulls daily, yesterday’s, the day before’s, and the day before that’s data, repackages it, and reports it to downstream platforms. Although the code limits each batch to 1,000 records, the sheer volume of transaction data still forces the creation and destruction of a large number of objects, triggering the GC alarm.
Task Splitting
My first mitigation was to split the original job into three separate tasks, each handling a specific day’s data. This reduced the load per machine because the scheduler could distribute the three tasks across different nodes.
Although splitting lowered the per‑machine pressure, the GC alarm re‑occurred, indicating that further optimization was needed.
Task Sharding
Instead of assigning whole tasks to individual machines, I introduced sharding: the job is divided into a fixed number of slices (e.g., 10). Each slice is processed by any available machine based on the data’s identifier modulo the slice count.
For example, with two machines A and B and ten slices, A might handle slices [0‑4] and B slices [5‑9]. When processing 14,267 records, each record’s id % 10 determines which slice—and thus which machine—processes it.
This approach spreads the workload evenly and avoids the single‑machine bottleneck, provided the task can be safely partitioned. Tasks that require aggregated processing across accounts, for instance, may not be suitable for sharding.
Takeaways
The monitoring platform gives clear visibility into micro‑service health, and alerts are promptly routed to owners.
A unified scheduling platform prevents scattered @Scheduled annotations and makes task management transparent.
Comprehensive documentation and internal wikis improve knowledge sharing and operational efficiency.
Technical communities (e.g., the Gavin scheduling group) offer valuable support for platform‑specific challenges.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
