Interview Question: How to Handle a Crashed Scheduled‑Task Server? Most Miss It
When a scheduled‑task server crashes, simply restarting it is insufficient; a robust solution must combine clustering, distributed locks, idempotent designs, checkpointing, and monitoring to ensure tasks resume correctly across non‑runtime and runtime failures, as detailed with SpringTask‑Redis and XXL‑JOB implementations.
Interview Question
When asked "If the server that runs a scheduled task crashes, how would you solve it?" the expected answer covers fallback strategies, idempotency, and checkpointing rather than simply restarting the server.
Crash Scenarios
Non‑runtime crash : the server fails before the scheduled task starts (e.g., task scheduled at 02:00, server crashes at 01:00).
Runtime crash : the server fails while the task is executing (e.g., power loss or process crash halfway through).
Non‑runtime Crash – Remove Single Point of Failure
Deploy the task in a cluster so any node can acquire a lock and run the job.
Solution 1 – SpringTask + Redis Distributed Lock
Each node attempts to execute the task, but a Redis lock guarantees that only one node actually runs the business logic. If a node crashes, another node can still obtain the lock.
@Component
public class UserActiveScheduledTask {
@Autowired
private RedissonClient redissonClient;
@Autowired
private UserService userService;
/** Executes every day at 02:00. */
@Scheduled(cron = "0 0 2 * * ?")
public void markActiveUsers() {
String lockKey = "scheduled:task:mark-active-users";
RLock lock = redissonClient.getLock(lockKey);
try {
// tryLock(waitTime=0, leaseTime=60 minutes)
boolean isLocked = lock.tryLock(0, 60, TimeUnit.MINUTES);
if (!isLocked) {
log.info("Failed to acquire distributed lock, task will be executed by another server");
return;
}
log.info("Successfully acquired lock, executing scheduled task");
userService.markActiveUsers();
} catch (InterruptedException e) {
log.error("Exception while acquiring lock", e);
Thread.currentThread().interrupt();
} finally {
if (lock.isHeldByCurrentThread()) {
lock.unlock();
log.info("Lock released");
}
}
}
}Solution 2 – XXL‑JOB Failover
XXL‑JOB is a distributed scheduler. The admin node decides which executor runs the task; if the chosen executor does not respond, XXL‑JOB automatically fails over to another executor.
@Component
public class MarkActiveUsersJob {
@Autowired
private UserService userService;
@XxlJob("markActiveUsersHandler")
public void execute() {
XxlJobHelper.log("Start executing mark‑active‑users task");
try {
userService.markActiveUsers();
XxlJobHelper.log("Task executed successfully");
} catch (Exception e) {
XxlJobHelper.log("Task failed: " + e.getMessage());
XxlJobHelper.handleFail("Task execution exception");
}
}
}
# application.yml (partial)
xxl:
job:
admin:
addresses: http://xxl-job-admin:8080/xxl-job-admin
executor:
appname: user-service
port: 9999Runtime Crash – Strategy Depends on Business Characteristics
Scenario 1 – Mark Active Users (Repeatable)
The operation can be repeated without side effects. A task_status table records the task state.
@Service
public class UserActiveService {
@Autowired
private TaskStatusMapper taskStatusMapper;
@Autowired
private UserMapper userMapper;
@Transactional(rollbackFor = Exception.class)
public void markActiveUsers() {
String taskDate = LocalDate.now().minusDays(1).toString();
String taskId = "mark_active_users_" + taskDate;
// check if already completed
TaskStatus taskStatus = taskStatusMapper.selectByTaskId(taskId);
if (taskStatus != null && "COMPLETED".equals(taskStatus.getStatus())) {
log.info("Task already completed, skip: {}", taskId);
return;
}
// mark as running
if (taskStatus == null) {
taskStatus = new TaskStatus();
taskStatus.setTaskId(taskId);
taskStatus.setStatus("RUNNING");
taskStatus.setStartTime(LocalDateTime.now());
taskStatusMapper.insert(taskStatus);
} else {
taskStatus.setStatus("RUNNING");
taskStatusMapper.updateById(taskStatus);
}
try {
List<Long> userIds = userMapper.selectOrderUsersByDate(taskDate);
for (Long userId : userIds) {
userMapper.updateUserActive(userId, true);
}
taskStatus.setStatus("COMPLETED");
taskStatus.setEndTime(LocalDateTime.now());
taskStatusMapper.updateById(taskStatus);
} catch (Exception e) {
taskStatus.setStatus("FAILED");
taskStatus.setErrorMsg(e.getMessage());
taskStatusMapper.updateById(taskStatus);
throw e;
}
}
}
# task_status table (simplified)
CREATE TABLE `task_status` (
`id` BIGINT NOT NULL AUTO_INCREMENT,
`task_id` VARCHAR(100) NOT NULL COMMENT 'Task ID',
`status` VARCHAR(20) NOT NULL COMMENT 'RUNNING/COMPLETED/FAILED',
`start_time` DATETIME COMMENT 'Start time',
`end_time` DATETIME COMMENT 'End time',
`error_msg` TEXT COMMENT 'Error message',
PRIMARY KEY (`id`),
UNIQUE KEY `uk_task_id` (`task_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;Scenario 2 – Grant Points (Non‑Idempotent)
Repeating the task would double‑grant points, causing financial loss. Idempotency is enforced via a unique index on a composite key date_userId.
@Service
public class PointsGrantService {
@Autowired
private PointsDetailMapper pointsDetailMapper;
@Autowired
private UserPointsMapper userPointsMapper;
public void grantPointsToOrderUsers() {
String grantDate = LocalDate.now().minusDays(1).toString();
List<UserOrderInfo> orderUsers = queryOrderUsers(grantDate);
int successCount = 0;
int skipCount = 0;
for (UserOrderInfo userOrder : orderUsers) {
try {
int points = (int) (userOrder.getOrderAmount() * 0.01);
boolean granted = tryGrantPoints(userOrder.getUserId(), points, grantDate);
if (granted) {
successCount++;
} else {
skipCount++;
}
} catch (Exception e) {
log.error("Failed to grant points, userId: {}", userOrder.getUserId(), e);
}
}
log.info("Points grant completed, success: {}, skip: {}", successCount, skipCount);
}
@Transactional(rollbackFor = Exception.class)
public boolean tryGrantPoints(Long userId, int points, String grantDate) {
String idempotentKey = grantDate + "_" + userId;
try {
PointsDetail detail = new PointsDetail();
detail.setIdempotentKey(idempotentKey);
detail.setUserId(userId);
detail.setPoints(points);
detail.setGrantDate(grantDate);
detail.setCreateTime(LocalDateTime.now());
pointsDetailMapper.insert(detail);
userPointsMapper.increasePoints(userId, points);
return true;
} catch (DuplicateKeyException e) {
log.info("Points already granted, skip, userId: {}", userId);
return false;
}
}
}
# points_detail table (simplified)
CREATE TABLE `points_detail` (
`id` BIGINT NOT NULL AUTO_INCREMENT,
`idempotent_key` VARCHAR(100) NOT NULL COMMENT 'date_userId',
`user_id` BIGINT NOT NULL COMMENT 'User ID',
`points` INT NOT NULL COMMENT 'Points',
`grant_date` VARCHAR(20) NOT NULL COMMENT 'Grant date',
`create_time` DATETIME NOT NULL COMMENT 'Create time',
PRIMARY KEY (`id`),
UNIQUE KEY `uk_idempotent_key` (`idempotent_key`),
KEY `idx_user_id` (`user_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;Scenario 3 – 200 Million Users (Checkpoint‑Resume)
For massive batches, a checkpoint‑resume mechanism processes users in ascending ID order, records the maximum processed ID after each batch, and resumes from that ID after a failure.
@Service
public class MassPointsGrantService {
@Autowired
private PointsDetailMapper pointsDetailMapper;
@Autowired
private UserMapper userMapper;
@Autowired
private TaskCheckpointMapper checkpointMapper;
private static final int BATCH_SIZE = 1000;
public void grantPointsToAllUsers() {
String grantDate = LocalDate.now().minusDays(1).toString();
String taskId = "grant_points_all_users_" + grantDate;
Long startUserId = findCheckpoint(taskId);
log.info("Start batch grant, startUserId: {}", startUserId);
long processedCount = 0;
Long currentMaxUserId = startUserId;
while (true) {
List<User> users = userMapper.selectBatchByIdGreaterThan(currentMaxUserId, BATCH_SIZE);
if (users.isEmpty()) {
log.info("All users processed, total: {}", processedCount);
break;
}
for (User user : users) {
try {
int points = 100; // each user gets 100 points
tryGrantPoints(user.getId(), points, grantDate);
currentMaxUserId = Math.max(currentMaxUserId, user.getId());
} catch (Exception e) {
log.error("Failed to grant points, userId: {}", user.getId(), e);
}
}
processedCount += users.size();
saveCheckpoint(taskId, currentMaxUserId, grantDate);
log.info("Batch completed, processed: {}, currentMaxUserId: {}", processedCount, currentMaxUserId);
}
clearCheckpoint(taskId);
}
private Long findCheckpoint(String taskId) {
TaskCheckpoint checkpoint = checkpointMapper.selectByTaskId(taskId);
if (checkpoint != null) {
log.info("Found checkpoint, continue from userId {}", checkpoint.getLastUserId());
return checkpoint.getLastUserId();
}
Long maxUserId = pointsDetailMapper.selectMaxUserIdByDate(LocalDate.now().minusDays(1).toString());
return maxUserId != null ? maxUserId : 0L;
}
@Transactional(rollbackFor = Exception.class)
private void saveCheckpoint(String taskId, Long lastUserId, String grantDate) {
TaskCheckpoint checkpoint = checkpointMapper.selectByTaskId(taskId);
if (checkpoint == null) {
checkpoint = new TaskCheckpoint();
checkpoint.setTaskId(taskId);
checkpoint.setGrantDate(grantDate);
checkpoint.setLastUserId(lastUserId);
checkpoint.setUpdateTime(LocalDateTime.now());
checkpointMapper.insert(checkpoint);
} else {
checkpoint.setLastUserId(lastUserId);
checkpoint.setUpdateTime(LocalDateTime.now());
checkpointMapper.updateById(checkpoint);
}
}
private void clearCheckpoint(String taskId) {
checkpointMapper.deleteByTaskId(taskId);
log.info("Task completed, checkpoint cleared: {}", taskId);
}
}
# task_checkpoint table (simplified)
CREATE TABLE `task_checkpoint` (
`id` BIGINT NOT NULL AUTO_INCREMENT,
`task_id` VARCHAR(100) NOT NULL COMMENT 'Task ID',
`last_user_id` BIGINT NOT NULL COMMENT 'Last processed user ID',
`grant_date` VARCHAR(20) NOT NULL COMMENT 'Grant date',
`update_time` DATETIME NOT NULL COMMENT 'Update time',
PRIMARY KEY (`id`),
UNIQUE KEY `uk_task_id` (`task_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;Timing comparison: without checkpointing a 2‑day batch takes 38 hours (19 h + 19 h re‑run), while with checkpointing it takes only 20 hours (19 h + 1 h resume).
Decision Summary
Non‑runtime crash → Cluster deployment, distributed lock or XXL‑JOB failover.
Repeatable runtime task → Status flag table.
Non‑repeatable runtime task → Idempotent design with unique index.
Large‑scale batch → Checkpointing + idempotency.
Best‑Practice Template
/**
* Scheduled‑task best‑practice template
* Integrates distributed lock, status management, checkpointing, and idempotency.
*/
@Component
public class BestPracticeScheduledTask {
@Autowired
private RedissonClient redissonClient;
@Autowired
private TaskStatusMapper taskStatusMapper;
@Autowired
private TaskCheckpointMapper checkpointMapper;
@Scheduled(cron = "0 0 2 * * ?")
public void execute() {
String lockKey = "scheduled:task:best-practice";
RLock lock = redissonClient.getLock(lockKey);
try {
if (!lock.tryLock(0, 60, TimeUnit.MINUTES)) {
log.info("Lock not acquired, another server will run the task");
return;
}
if (isTaskCompleted()) {
log.info("Task already completed, skip");
return;
}
markTaskRunning();
Long checkpoint = findCheckpoint();
log.info("Resume from checkpoint: {}", checkpoint);
executeBusiness(checkpoint);
markTaskCompleted();
clearCheckpoint();
log.info("Scheduled task executed successfully");
} catch (Exception e) {
log.error("Scheduled task exception", e);
markTaskFailed(e.getMessage());
sendAlert(e);
} finally {
if (lock.isHeldByCurrentThread()) {
lock.unlock();
}
}
}
private boolean isTaskCompleted() {
String taskId = generateTaskId();
TaskStatus status = taskStatusMapper.selectByTaskId(taskId);
return status != null && "COMPLETED".equals(status.getStatus());
}
private void markTaskRunning() {
String taskId = generateTaskId();
TaskStatus status = new TaskStatus();
status.setTaskId(taskId);
status.setStatus("RUNNING");
status.setStartTime(LocalDateTime.now());
taskStatusMapper.insertOrUpdate(status);
}
private void markTaskCompleted() {
String taskId = generateTaskId();
taskStatusMapper.updateStatus(taskId, "COMPLETED", LocalDateTime.now());
}
private void markTaskFailed(String errorMsg) {
String taskId = generateTaskId();
taskStatusMapper.updateStatusWithError(taskId, "FAILED", errorMsg);
}
private Long findCheckpoint() {
String taskId = generateTaskId();
TaskCheckpoint cp = checkpointMapper.selectByTaskId(taskId);
return cp != null ? cp.getLastProcessedId() : 0L;
}
private void clearCheckpoint() {
String taskId = generateTaskId();
checkpointMapper.deleteByTaskId(taskId);
}
private void executeBusiness(Long checkpoint) {
// Implement business logic here, ensuring idempotency.
}
private void sendAlert(Exception e) {
// Send alert to DingTalk, email, etc.
}
private String generateTaskId() {
return "task_" + LocalDate.now().toString();
}
}Monitoring & Alerting
@Component
public class ScheduledTaskMetrics {
@Autowired
private MeterRegistry meterRegistry;
/** Record execution count, success/failure and duration */
public void recordTaskExecution(String taskName, long duration, boolean success) {
meterRegistry.counter("scheduled.task.executions", "task", taskName, "status", success ? "success" : "failed").increment();
meterRegistry.timer("scheduled.task.duration", "task", taskName).record(duration, TimeUnit.MILLISECONDS);
if (!success) {
meterRegistry.counter("scheduled.task.failures", "task", taskName).increment();
}
}
} @Component
public class ScheduledTaskMonitor {
@Autowired
private AlarmService alarmService;
@Autowired
private RedisTemplate<String, String> redisTemplate;
@Scheduled(cron = "0 */5 * * * ?")
public void monitorTaskExecution() {
String taskKey = "scheduled:task:heartbeat:mark-active-users";
String lastExecuteTime = redisTemplate.opsForValue().get(taskKey);
if (lastExecuteTime == null) {
alarmService.sendAlert("Scheduled Task Alert", "Mark‑active‑users task may not have run.");
return;
}
LocalDateTime lastTime = LocalDateTime.parse(lastExecuteTime);
long hoursSince = ChronoUnit.HOURS.between(lastTime, LocalDateTime.now());
if (hoursSince > 25) { // task should run daily
alarmService.sendAlert("Scheduled Task Alert", String.format("Mark‑active‑users task has not run for %d hours.", hoursSince));
}
}
}Interview Deep‑Dive Q&A
Q1: Why not rely solely on the framework’s retry mechanism? – Retries are coarse‑grained and cannot provide checkpointing or idempotent guarantees; they complement, not replace, the discussed strategies.
Q2: Does idempotency hurt performance? – The overhead is minimal (O(log n) for unique‑index lookups) and can be mitigated with Bloom filters if needed.
Q3: How to choose checkpoint granularity? – Base it on task duration: < 1 h (no checkpoint), 1‑4 h (save every 10‑30 min), > 4 h (save every 5‑10 min).
Q4: How to set distributed‑lock timeout? – Timeout = normal execution time × 1.5 + buffer (10‑30 min). Example: 40 min task → 80 min lock.
Q5: When to pick XXL‑JOB vs SpringTask? – Small projects (<10 tasks) → SpringTask + Redis; medium‑large projects needing visual UI, complex dependencies → XXL‑JOB.
Q6: How to ensure checkpoint reliability? – Persist checkpoints in a relational DB (not only Redis) and update them in the same transaction as business data.
Q7: How to avoid duplicate execution in a cluster? – Use a distributed lock (Redis or XXL‑JOB) with proper timeout and always release it in a finally block.
Final Summary
Distinguish non‑runtime and runtime crashes. Non‑runtime crashes are solved by clustering (SpringTask + Redis lock or XXL‑JOB failover). Runtime crashes require a strategy based on business impact: simple status flags for repeatable jobs, idempotent designs for side‑effect‑ful jobs, and checkpoint‑resume for massive batches. Combining clustering, idempotency, checkpointing, monitoring, and alerting provides high availability and reliability for scheduled tasks.
💡 Non‑runtime crash → Cluster deployment (distributed lock / failover) 💡 Runtime crash + repeatable → Task status flag 💡 Runtime crash + side‑effects → Idempotent design (unique index) 💡 Massive batch → Checkpointing + sharding + idempotency 💡 Ultimate solution → Layered protection + monitoring & alerting + manual fallback
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
