Interview Question: How to Handle a Crashed Scheduled‑Task Server? Most Miss It

When a scheduled‑task server crashes, simply restarting it is insufficient; a robust solution must combine clustering, distributed locks, idempotent designs, checkpointing, and monitoring to ensure tasks resume correctly across non‑runtime and runtime failures, as detailed with SpringTask‑Redis and XXL‑JOB implementations.

Java Companion
Java Companion
Java Companion
Interview Question: How to Handle a Crashed Scheduled‑Task Server? Most Miss It

Interview Question

When asked "If the server that runs a scheduled task crashes, how would you solve it?" the expected answer covers fallback strategies, idempotency, and checkpointing rather than simply restarting the server.

Crash Scenarios

Non‑runtime crash : the server fails before the scheduled task starts (e.g., task scheduled at 02:00, server crashes at 01:00).

Runtime crash : the server fails while the task is executing (e.g., power loss or process crash halfway through).

Non‑runtime Crash – Remove Single Point of Failure

Deploy the task in a cluster so any node can acquire a lock and run the job.

Solution 1 – SpringTask + Redis Distributed Lock

Each node attempts to execute the task, but a Redis lock guarantees that only one node actually runs the business logic. If a node crashes, another node can still obtain the lock.

@Component
public class UserActiveScheduledTask {
    @Autowired
    private RedissonClient redissonClient;
    @Autowired
    private UserService userService;

    /** Executes every day at 02:00. */
    @Scheduled(cron = "0 0 2 * * ?")
    public void markActiveUsers() {
        String lockKey = "scheduled:task:mark-active-users";
        RLock lock = redissonClient.getLock(lockKey);
        try {
            // tryLock(waitTime=0, leaseTime=60 minutes)
            boolean isLocked = lock.tryLock(0, 60, TimeUnit.MINUTES);
            if (!isLocked) {
                log.info("Failed to acquire distributed lock, task will be executed by another server");
                return;
            }
            log.info("Successfully acquired lock, executing scheduled task");
            userService.markActiveUsers();
        } catch (InterruptedException e) {
            log.error("Exception while acquiring lock", e);
            Thread.currentThread().interrupt();
        } finally {
            if (lock.isHeldByCurrentThread()) {
                lock.unlock();
                log.info("Lock released");
            }
        }
    }
}

Solution 2 – XXL‑JOB Failover

XXL‑JOB is a distributed scheduler. The admin node decides which executor runs the task; if the chosen executor does not respond, XXL‑JOB automatically fails over to another executor.

@Component
public class MarkActiveUsersJob {
    @Autowired
    private UserService userService;

    @XxlJob("markActiveUsersHandler")
    public void execute() {
        XxlJobHelper.log("Start executing mark‑active‑users task");
        try {
            userService.markActiveUsers();
            XxlJobHelper.log("Task executed successfully");
        } catch (Exception e) {
            XxlJobHelper.log("Task failed: " + e.getMessage());
            XxlJobHelper.handleFail("Task execution exception");
        }
    }
}

# application.yml (partial)
xxl:
  job:
    admin:
      addresses: http://xxl-job-admin:8080/xxl-job-admin
    executor:
      appname: user-service
      port: 9999

Runtime Crash – Strategy Depends on Business Characteristics

Scenario 1 – Mark Active Users (Repeatable)

The operation can be repeated without side effects. A task_status table records the task state.

@Service
public class UserActiveService {
    @Autowired
    private TaskStatusMapper taskStatusMapper;
    @Autowired
    private UserMapper userMapper;

    @Transactional(rollbackFor = Exception.class)
    public void markActiveUsers() {
        String taskDate = LocalDate.now().minusDays(1).toString();
        String taskId = "mark_active_users_" + taskDate;
        // check if already completed
        TaskStatus taskStatus = taskStatusMapper.selectByTaskId(taskId);
        if (taskStatus != null && "COMPLETED".equals(taskStatus.getStatus())) {
            log.info("Task already completed, skip: {}", taskId);
            return;
        }
        // mark as running
        if (taskStatus == null) {
            taskStatus = new TaskStatus();
            taskStatus.setTaskId(taskId);
            taskStatus.setStatus("RUNNING");
            taskStatus.setStartTime(LocalDateTime.now());
            taskStatusMapper.insert(taskStatus);
        } else {
            taskStatus.setStatus("RUNNING");
            taskStatusMapper.updateById(taskStatus);
        }
        try {
            List<Long> userIds = userMapper.selectOrderUsersByDate(taskDate);
            for (Long userId : userIds) {
                userMapper.updateUserActive(userId, true);
            }
            taskStatus.setStatus("COMPLETED");
            taskStatus.setEndTime(LocalDateTime.now());
            taskStatusMapper.updateById(taskStatus);
        } catch (Exception e) {
            taskStatus.setStatus("FAILED");
            taskStatus.setErrorMsg(e.getMessage());
            taskStatusMapper.updateById(taskStatus);
            throw e;
        }
    }
}

# task_status table (simplified)
CREATE TABLE `task_status` (
  `id` BIGINT NOT NULL AUTO_INCREMENT,
  `task_id` VARCHAR(100) NOT NULL COMMENT 'Task ID',
  `status` VARCHAR(20) NOT NULL COMMENT 'RUNNING/COMPLETED/FAILED',
  `start_time` DATETIME COMMENT 'Start time',
  `end_time` DATETIME COMMENT 'End time',
  `error_msg` TEXT COMMENT 'Error message',
  PRIMARY KEY (`id`),
  UNIQUE KEY `uk_task_id` (`task_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Scenario 2 – Grant Points (Non‑Idempotent)

Repeating the task would double‑grant points, causing financial loss. Idempotency is enforced via a unique index on a composite key date_userId.

@Service
public class PointsGrantService {
    @Autowired
    private PointsDetailMapper pointsDetailMapper;
    @Autowired
    private UserPointsMapper userPointsMapper;

    public void grantPointsToOrderUsers() {
        String grantDate = LocalDate.now().minusDays(1).toString();
        List<UserOrderInfo> orderUsers = queryOrderUsers(grantDate);
        int successCount = 0;
        int skipCount = 0;
        for (UserOrderInfo userOrder : orderUsers) {
            try {
                int points = (int) (userOrder.getOrderAmount() * 0.01);
                boolean granted = tryGrantPoints(userOrder.getUserId(), points, grantDate);
                if (granted) {
                    successCount++;
                } else {
                    skipCount++;
                }
            } catch (Exception e) {
                log.error("Failed to grant points, userId: {}", userOrder.getUserId(), e);
            }
        }
        log.info("Points grant completed, success: {}, skip: {}", successCount, skipCount);
    }

    @Transactional(rollbackFor = Exception.class)
    public boolean tryGrantPoints(Long userId, int points, String grantDate) {
        String idempotentKey = grantDate + "_" + userId;
        try {
            PointsDetail detail = new PointsDetail();
            detail.setIdempotentKey(idempotentKey);
            detail.setUserId(userId);
            detail.setPoints(points);
            detail.setGrantDate(grantDate);
            detail.setCreateTime(LocalDateTime.now());
            pointsDetailMapper.insert(detail);
            userPointsMapper.increasePoints(userId, points);
            return true;
        } catch (DuplicateKeyException e) {
            log.info("Points already granted, skip, userId: {}", userId);
            return false;
        }
    }
}

# points_detail table (simplified)
CREATE TABLE `points_detail` (
  `id` BIGINT NOT NULL AUTO_INCREMENT,
  `idempotent_key` VARCHAR(100) NOT NULL COMMENT 'date_userId',
  `user_id` BIGINT NOT NULL COMMENT 'User ID',
  `points` INT NOT NULL COMMENT 'Points',
  `grant_date` VARCHAR(20) NOT NULL COMMENT 'Grant date',
  `create_time` DATETIME NOT NULL COMMENT 'Create time',
  PRIMARY KEY (`id`),
  UNIQUE KEY `uk_idempotent_key` (`idempotent_key`),
  KEY `idx_user_id` (`user_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Scenario 3 – 200 Million Users (Checkpoint‑Resume)

For massive batches, a checkpoint‑resume mechanism processes users in ascending ID order, records the maximum processed ID after each batch, and resumes from that ID after a failure.

@Service
public class MassPointsGrantService {
    @Autowired
    private PointsDetailMapper pointsDetailMapper;
    @Autowired
    private UserMapper userMapper;
    @Autowired
    private TaskCheckpointMapper checkpointMapper;
    private static final int BATCH_SIZE = 1000;

    public void grantPointsToAllUsers() {
        String grantDate = LocalDate.now().minusDays(1).toString();
        String taskId = "grant_points_all_users_" + grantDate;
        Long startUserId = findCheckpoint(taskId);
        log.info("Start batch grant, startUserId: {}", startUserId);
        long processedCount = 0;
        Long currentMaxUserId = startUserId;
        while (true) {
            List<User> users = userMapper.selectBatchByIdGreaterThan(currentMaxUserId, BATCH_SIZE);
            if (users.isEmpty()) {
                log.info("All users processed, total: {}", processedCount);
                break;
            }
            for (User user : users) {
                try {
                    int points = 100; // each user gets 100 points
                    tryGrantPoints(user.getId(), points, grantDate);
                    currentMaxUserId = Math.max(currentMaxUserId, user.getId());
                } catch (Exception e) {
                    log.error("Failed to grant points, userId: {}", user.getId(), e);
                }
            }
            processedCount += users.size();
            saveCheckpoint(taskId, currentMaxUserId, grantDate);
            log.info("Batch completed, processed: {}, currentMaxUserId: {}", processedCount, currentMaxUserId);
        }
        clearCheckpoint(taskId);
    }

    private Long findCheckpoint(String taskId) {
        TaskCheckpoint checkpoint = checkpointMapper.selectByTaskId(taskId);
        if (checkpoint != null) {
            log.info("Found checkpoint, continue from userId {}", checkpoint.getLastUserId());
            return checkpoint.getLastUserId();
        }
        Long maxUserId = pointsDetailMapper.selectMaxUserIdByDate(LocalDate.now().minusDays(1).toString());
        return maxUserId != null ? maxUserId : 0L;
    }

    @Transactional(rollbackFor = Exception.class)
    private void saveCheckpoint(String taskId, Long lastUserId, String grantDate) {
        TaskCheckpoint checkpoint = checkpointMapper.selectByTaskId(taskId);
        if (checkpoint == null) {
            checkpoint = new TaskCheckpoint();
            checkpoint.setTaskId(taskId);
            checkpoint.setGrantDate(grantDate);
            checkpoint.setLastUserId(lastUserId);
            checkpoint.setUpdateTime(LocalDateTime.now());
            checkpointMapper.insert(checkpoint);
        } else {
            checkpoint.setLastUserId(lastUserId);
            checkpoint.setUpdateTime(LocalDateTime.now());
            checkpointMapper.updateById(checkpoint);
        }
    }

    private void clearCheckpoint(String taskId) {
        checkpointMapper.deleteByTaskId(taskId);
        log.info("Task completed, checkpoint cleared: {}", taskId);
    }
}

# task_checkpoint table (simplified)
CREATE TABLE `task_checkpoint` (
  `id` BIGINT NOT NULL AUTO_INCREMENT,
  `task_id` VARCHAR(100) NOT NULL COMMENT 'Task ID',
  `last_user_id` BIGINT NOT NULL COMMENT 'Last processed user ID',
  `grant_date` VARCHAR(20) NOT NULL COMMENT 'Grant date',
  `update_time` DATETIME NOT NULL COMMENT 'Update time',
  PRIMARY KEY (`id`),
  UNIQUE KEY `uk_task_id` (`task_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Timing comparison: without checkpointing a 2‑day batch takes 38 hours (19 h + 19 h re‑run), while with checkpointing it takes only 20 hours (19 h + 1 h resume).

Decision Summary

Non‑runtime crash → Cluster deployment, distributed lock or XXL‑JOB failover.

Repeatable runtime task → Status flag table.

Non‑repeatable runtime task → Idempotent design with unique index.

Large‑scale batch → Checkpointing + idempotency.

Best‑Practice Template

/**
 * Scheduled‑task best‑practice template
 * Integrates distributed lock, status management, checkpointing, and idempotency.
 */
@Component
public class BestPracticeScheduledTask {
    @Autowired
    private RedissonClient redissonClient;
    @Autowired
    private TaskStatusMapper taskStatusMapper;
    @Autowired
    private TaskCheckpointMapper checkpointMapper;

    @Scheduled(cron = "0 0 2 * * ?")
    public void execute() {
        String lockKey = "scheduled:task:best-practice";
        RLock lock = redissonClient.getLock(lockKey);
        try {
            if (!lock.tryLock(0, 60, TimeUnit.MINUTES)) {
                log.info("Lock not acquired, another server will run the task");
                return;
            }
            if (isTaskCompleted()) {
                log.info("Task already completed, skip");
                return;
            }
            markTaskRunning();
            Long checkpoint = findCheckpoint();
            log.info("Resume from checkpoint: {}", checkpoint);
            executeBusiness(checkpoint);
            markTaskCompleted();
            clearCheckpoint();
            log.info("Scheduled task executed successfully");
        } catch (Exception e) {
            log.error("Scheduled task exception", e);
            markTaskFailed(e.getMessage());
            sendAlert(e);
        } finally {
            if (lock.isHeldByCurrentThread()) {
                lock.unlock();
            }
        }
    }

    private boolean isTaskCompleted() {
        String taskId = generateTaskId();
        TaskStatus status = taskStatusMapper.selectByTaskId(taskId);
        return status != null && "COMPLETED".equals(status.getStatus());
    }

    private void markTaskRunning() {
        String taskId = generateTaskId();
        TaskStatus status = new TaskStatus();
        status.setTaskId(taskId);
        status.setStatus("RUNNING");
        status.setStartTime(LocalDateTime.now());
        taskStatusMapper.insertOrUpdate(status);
    }

    private void markTaskCompleted() {
        String taskId = generateTaskId();
        taskStatusMapper.updateStatus(taskId, "COMPLETED", LocalDateTime.now());
    }

    private void markTaskFailed(String errorMsg) {
        String taskId = generateTaskId();
        taskStatusMapper.updateStatusWithError(taskId, "FAILED", errorMsg);
    }

    private Long findCheckpoint() {
        String taskId = generateTaskId();
        TaskCheckpoint cp = checkpointMapper.selectByTaskId(taskId);
        return cp != null ? cp.getLastProcessedId() : 0L;
    }

    private void clearCheckpoint() {
        String taskId = generateTaskId();
        checkpointMapper.deleteByTaskId(taskId);
    }

    private void executeBusiness(Long checkpoint) {
        // Implement business logic here, ensuring idempotency.
    }

    private void sendAlert(Exception e) {
        // Send alert to DingTalk, email, etc.
    }

    private String generateTaskId() {
        return "task_" + LocalDate.now().toString();
    }
}

Monitoring & Alerting

@Component
public class ScheduledTaskMetrics {
    @Autowired
    private MeterRegistry meterRegistry;

    /** Record execution count, success/failure and duration */
    public void recordTaskExecution(String taskName, long duration, boolean success) {
        meterRegistry.counter("scheduled.task.executions", "task", taskName, "status", success ? "success" : "failed").increment();
        meterRegistry.timer("scheduled.task.duration", "task", taskName).record(duration, TimeUnit.MILLISECONDS);
        if (!success) {
            meterRegistry.counter("scheduled.task.failures", "task", taskName).increment();
        }
    }
}
@Component
public class ScheduledTaskMonitor {
    @Autowired
    private AlarmService alarmService;
    @Autowired
    private RedisTemplate<String, String> redisTemplate;

    @Scheduled(cron = "0 */5 * * * ?")
    public void monitorTaskExecution() {
        String taskKey = "scheduled:task:heartbeat:mark-active-users";
        String lastExecuteTime = redisTemplate.opsForValue().get(taskKey);
        if (lastExecuteTime == null) {
            alarmService.sendAlert("Scheduled Task Alert", "Mark‑active‑users task may not have run.");
            return;
        }
        LocalDateTime lastTime = LocalDateTime.parse(lastExecuteTime);
        long hoursSince = ChronoUnit.HOURS.between(lastTime, LocalDateTime.now());
        if (hoursSince > 25) { // task should run daily
            alarmService.sendAlert("Scheduled Task Alert", String.format("Mark‑active‑users task has not run for %d hours.", hoursSince));
        }
    }
}

Interview Deep‑Dive Q&A

Q1: Why not rely solely on the framework’s retry mechanism? – Retries are coarse‑grained and cannot provide checkpointing or idempotent guarantees; they complement, not replace, the discussed strategies.

Q2: Does idempotency hurt performance? – The overhead is minimal (O(log n) for unique‑index lookups) and can be mitigated with Bloom filters if needed.

Q3: How to choose checkpoint granularity? – Base it on task duration: < 1 h (no checkpoint), 1‑4 h (save every 10‑30 min), > 4 h (save every 5‑10 min).

Q4: How to set distributed‑lock timeout? – Timeout = normal execution time × 1.5 + buffer (10‑30 min). Example: 40 min task → 80 min lock.

Q5: When to pick XXL‑JOB vs SpringTask? – Small projects (<10 tasks) → SpringTask + Redis; medium‑large projects needing visual UI, complex dependencies → XXL‑JOB.

Q6: How to ensure checkpoint reliability? – Persist checkpoints in a relational DB (not only Redis) and update them in the same transaction as business data.

Q7: How to avoid duplicate execution in a cluster? – Use a distributed lock (Redis or XXL‑JOB) with proper timeout and always release it in a finally block.

Final Summary

Distinguish non‑runtime and runtime crashes. Non‑runtime crashes are solved by clustering (SpringTask + Redis lock or XXL‑JOB failover). Runtime crashes require a strategy based on business impact: simple status flags for repeatable jobs, idempotent designs for side‑effect‑ful jobs, and checkpoint‑resume for massive batches. Combining clustering, idempotency, checkpointing, monitoring, and alerting provides high availability and reliability for scheduled tasks.

💡 Non‑runtime crash → Cluster deployment (distributed lock / failover) 💡 Runtime crash + repeatable → Task status flag 💡 Runtime crash + side‑effects → Idempotent design (unique index) 💡 Massive batch → Checkpointing + sharding + idempotency 💡 Ultimate solution → Layered protection + monitoring & alerting + manual fallback
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Backenddistributed lockidempotencyXXL-Jobscheduled taskscheckpointingSpringTask
Java Companion
Written by

Java Companion

A highly professional Java public account

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.