7 Proven Retry Strategies to Keep Your System Running Smoothly

This article explores seven practical retry solutions—from simple loops and Spring Retry to Resilience4j, message queues, scheduled tasks, two‑phase commits, and distributed locks—explaining their scenarios, core code, and how they prevent costly system failures.

Su San Talks Tech
Su San Talks Tech
Su San Talks Tech
7 Proven Retry Strategies to Keep Your System Running Smoothly

Introduction

Five years ago a refund API on an e‑commerce platform repeatedly failed due to a bank network glitch; a naive retry loop invoked the bank interface 82 times, causing duplicate refunds and losses of over a million dollars. The boss questioned why such a basic retry caused a disaster, highlighting the need for proper retry mechanisms.

This article discusses seven commonly used retry solutions to help you avoid similar pitfalls.

1. Brutal Loop

Problem scenario

An intern wrote a user‑registration SMS sending method that repeatedly called a third‑party SMS API inside a while loop.

Code

public void sendSms(String phone) {
    int retry = 0;
    while (retry < 5) { // blind loop
        try {
            smsClient.send(phone);
            break;
        } catch (Exception e) {
            retry++;
            Thread.sleep(1000); // fixed 1‑second sleep
        }
    }
}

Incident

When the SMS server was overloaded and delayed responses by 3 seconds, the loop generated tens of thousands of retries within 0.5 seconds, overwhelming the SMS platform and triggering circuit‑breaker bans that also blocked normal requests.

Lesson

Don’t use a fixed delay : a constant interval causes request bursts.

Don’t ignore exception types : non‑transient errors (e.g., parameter errors) are retried unnecessarily.

Fix : add random back‑off intervals and filter non‑retryable exceptions.

2. Spring Retry

Use case

Suitable for small to medium projects; annotations quickly enable basic retry and circuit‑breaker behavior (e.g., order‑status queries).

Configuration example

@Retryable(
    value = {TimeoutException.class}, // only retry timeouts
    maxAttempts = 3,
    backoff = @Backoff(delay = 1000, multiplier = 2) // 1s → 2s → 4s
)
public boolean queryOrderStatus(String orderId) {
    return httpClient.get("/order/" + orderId);
}

@Recover // fallback method
public boolean fallback() {
    return false;
}

Advantages

Declarative annotation : clean code, decoupled from business logic.

Exponential back‑off : automatically lengthens retry intervals.

Circuit‑breaker integration : combine with @CircuitBreaker to quickly stop error traffic.

3. Resilience4j

Advanced scenario

For medium‑to‑large systems that need custom back‑off algorithms, circuit‑breaker policies, and multi‑layer protection (e.g., payment core services).

Core code

// 1. Retry config: exponential back‑off + random jitter
RetryConfig retryConfig = RetryConfig.custom()
    .maxAttempts(3)
    .intervalFunction(IntervalFunction.ofExponentialRandomBackoff(
        1000L, 2.0, 0.3)) // initial 1s, multiplier 2, jitter 0.3
    .retryOnException(e -> e instanceof TimeoutException)
    .build();

// 2. Circuit‑breaker config: open when error rate > 50%
CircuitBreakerConfig cbConfig = CircuitBreakerConfig.custom()
    .slidingWindow(10, 10, CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
    .failureRateThreshold(50)
    .build();

// 3. Combine usage
Retry retry = Retry.of("payment", retryConfig);
CircuitBreaker cb = CircuitBreaker.of("payment", cbConfig);
Supplier<Boolean> supplier = () -> paymentService.pay();
Supplier<Boolean> decorated = Decorators.ofSupplier(supplier)
    .withRetry(retry)
    .withCircuitBreaker(cb)
    .decorate();

Effect

After deploying this solution, a large e‑commerce platform saw a 60 % reduction in timeout rates and a near‑90 % drop in circuit‑breaker trigger frequency.

4. MQ Queue

Applicable scenario

High‑concurrency, asynchronous delay‑tolerant situations such as logistics status synchronization.

Implementation principle

On first failure, push the message to a delay queue.

The queue retries consumption after a preset delay (e.g., 5 s, 30 s, 1 min).

If the maximum retry count is reached, move the message to a dead‑letter queue for manual handling.

RocketMQ code

// Producer sends delayed message
Message<String> message = new Message();
message.setBody("order data");
message.setDelayTimeLevel(3); // RocketMQ preset 10‑second level
rocketMQTemplate.send(message);

// Consumer retries
@RocketMQMessageListener(topic = "DELAY_TOPIC")
public class DelayConsumer {
    @Override
    public void handleMessage(Message message) {
        try {
            syncLogistics(message);
        } catch (Exception e) {
            // retry count +1 and resend with higher delay level
            resendWithDelay(message, retryCount + 1);
        }
    }
}

5. Scheduled Task

Applicable scenario

Tasks that don’t require real‑time feedback and can be processed in batches, such as file imports.

Example with Quartz

@Scheduled(cron = "0 0/5 * * * ?") // every 5 minutes
public void retryFailedTasks() {
    List<FailedTask> list = failedTaskDao.listUnprocessed(5); // fetch failed tasks
    list.forEach(task -> {
        try {
            retryTask(task);
            task.markSuccess();
        } catch (Exception e) {
            task.incrRetryCount();
        }
        failedTaskDao.update(task);
    });
}

6. Two‑Phase Commit

Applicable scenario

Strict data‑consistency requirements, such as financial transfers.

Key implementation

Phase 1: Record the operation in the database with status “in progress”.

Phase 2: Call the remote service and update the record status based on the result.

Compensation: Scan timed‑out “in‑progress” records and retry them.

Code

@Transactional
public void transfer(TransferRequest req) {
    // 1. Record the transaction
    transferRecordDao.create(req, PENDING);

    // 2. Call bank API
    boolean success = bankClient.transfer(req);

    // 3. Update record status
    transferRecordDao.updateStatus(req.getId(), success ? SUCCESS : FAILED);

    // 4. If failed, send to async retry queue
    if (!success) {
        mqTemplate.send("TRANSFER_RETRY_QUEUE", req);
    }
}

7. Distributed Lock

Applicable scenario

Prevent duplicate submissions in multi‑instance, multi‑thread environments such as flash‑sale systems.

Redis + Lua example

public boolean retryWithLock(String key, int maxRetry) {
    String lockKey = "api_retry_lock:" + key;
    for (int i = 0; i < maxRetry; i++) {
        // try to acquire distributed lock
        if (redis.setnx(lockKey, "1", 30, TimeUnit.SECONDS)) {
            try {
                return callApi();
            } finally {
                redis.delete(lockKey);
            }
        }
        Thread.sleep(1000 * (i + 1)); // wait before next attempt
    }
    return false;
}

Conclusion

Retry mechanisms are like fire extinguishers in a data center—ideally you never need them, but they must work reliably when emergencies arise. Choose the solution that matches your business’s “weapon and shield” requirements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendRetryMQdistributed-locktwo-phase commitresilience4j
Su San Talks Tech
Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.