7 Proven Retry Strategies to Keep Your System Running Smoothly
This article explores seven practical retry solutions—from simple loops and Spring Retry to Resilience4j, message queues, scheduled tasks, two‑phase commits, and distributed locks—explaining their scenarios, core code, and how they prevent costly system failures.
Introduction
Five years ago a refund API on an e‑commerce platform repeatedly failed due to a bank network glitch; a naive retry loop invoked the bank interface 82 times, causing duplicate refunds and losses of over a million dollars. The boss questioned why such a basic retry caused a disaster, highlighting the need for proper retry mechanisms.
This article discusses seven commonly used retry solutions to help you avoid similar pitfalls.
1. Brutal Loop
Problem scenario
An intern wrote a user‑registration SMS sending method that repeatedly called a third‑party SMS API inside a while loop.
Code
public void sendSms(String phone) {
int retry = 0;
while (retry < 5) { // blind loop
try {
smsClient.send(phone);
break;
} catch (Exception e) {
retry++;
Thread.sleep(1000); // fixed 1‑second sleep
}
}
}Incident
When the SMS server was overloaded and delayed responses by 3 seconds, the loop generated tens of thousands of retries within 0.5 seconds, overwhelming the SMS platform and triggering circuit‑breaker bans that also blocked normal requests.
Lesson
Don’t use a fixed delay : a constant interval causes request bursts.
Don’t ignore exception types : non‑transient errors (e.g., parameter errors) are retried unnecessarily.
Fix : add random back‑off intervals and filter non‑retryable exceptions.
2. Spring Retry
Use case
Suitable for small to medium projects; annotations quickly enable basic retry and circuit‑breaker behavior (e.g., order‑status queries).
Configuration example
@Retryable(
value = {TimeoutException.class}, // only retry timeouts
maxAttempts = 3,
backoff = @Backoff(delay = 1000, multiplier = 2) // 1s → 2s → 4s
)
public boolean queryOrderStatus(String orderId) {
return httpClient.get("/order/" + orderId);
}
@Recover // fallback method
public boolean fallback() {
return false;
}Advantages
Declarative annotation : clean code, decoupled from business logic.
Exponential back‑off : automatically lengthens retry intervals.
Circuit‑breaker integration : combine with @CircuitBreaker to quickly stop error traffic.
3. Resilience4j
Advanced scenario
For medium‑to‑large systems that need custom back‑off algorithms, circuit‑breaker policies, and multi‑layer protection (e.g., payment core services).
Core code
// 1. Retry config: exponential back‑off + random jitter
RetryConfig retryConfig = RetryConfig.custom()
.maxAttempts(3)
.intervalFunction(IntervalFunction.ofExponentialRandomBackoff(
1000L, 2.0, 0.3)) // initial 1s, multiplier 2, jitter 0.3
.retryOnException(e -> e instanceof TimeoutException)
.build();
// 2. Circuit‑breaker config: open when error rate > 50%
CircuitBreakerConfig cbConfig = CircuitBreakerConfig.custom()
.slidingWindow(10, 10, CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
.failureRateThreshold(50)
.build();
// 3. Combine usage
Retry retry = Retry.of("payment", retryConfig);
CircuitBreaker cb = CircuitBreaker.of("payment", cbConfig);
Supplier<Boolean> supplier = () -> paymentService.pay();
Supplier<Boolean> decorated = Decorators.ofSupplier(supplier)
.withRetry(retry)
.withCircuitBreaker(cb)
.decorate();Effect
After deploying this solution, a large e‑commerce platform saw a 60 % reduction in timeout rates and a near‑90 % drop in circuit‑breaker trigger frequency.
4. MQ Queue
Applicable scenario
High‑concurrency, asynchronous delay‑tolerant situations such as logistics status synchronization.
Implementation principle
On first failure, push the message to a delay queue.
The queue retries consumption after a preset delay (e.g., 5 s, 30 s, 1 min).
If the maximum retry count is reached, move the message to a dead‑letter queue for manual handling.
RocketMQ code
// Producer sends delayed message
Message<String> message = new Message();
message.setBody("order data");
message.setDelayTimeLevel(3); // RocketMQ preset 10‑second level
rocketMQTemplate.send(message);
// Consumer retries
@RocketMQMessageListener(topic = "DELAY_TOPIC")
public class DelayConsumer {
@Override
public void handleMessage(Message message) {
try {
syncLogistics(message);
} catch (Exception e) {
// retry count +1 and resend with higher delay level
resendWithDelay(message, retryCount + 1);
}
}
}5. Scheduled Task
Applicable scenario
Tasks that don’t require real‑time feedback and can be processed in batches, such as file imports.
Example with Quartz
@Scheduled(cron = "0 0/5 * * * ?") // every 5 minutes
public void retryFailedTasks() {
List<FailedTask> list = failedTaskDao.listUnprocessed(5); // fetch failed tasks
list.forEach(task -> {
try {
retryTask(task);
task.markSuccess();
} catch (Exception e) {
task.incrRetryCount();
}
failedTaskDao.update(task);
});
}6. Two‑Phase Commit
Applicable scenario
Strict data‑consistency requirements, such as financial transfers.
Key implementation
Phase 1: Record the operation in the database with status “in progress”.
Phase 2: Call the remote service and update the record status based on the result.
Compensation: Scan timed‑out “in‑progress” records and retry them.
Code
@Transactional
public void transfer(TransferRequest req) {
// 1. Record the transaction
transferRecordDao.create(req, PENDING);
// 2. Call bank API
boolean success = bankClient.transfer(req);
// 3. Update record status
transferRecordDao.updateStatus(req.getId(), success ? SUCCESS : FAILED);
// 4. If failed, send to async retry queue
if (!success) {
mqTemplate.send("TRANSFER_RETRY_QUEUE", req);
}
}7. Distributed Lock
Applicable scenario
Prevent duplicate submissions in multi‑instance, multi‑thread environments such as flash‑sale systems.
Redis + Lua example
public boolean retryWithLock(String key, int maxRetry) {
String lockKey = "api_retry_lock:" + key;
for (int i = 0; i < maxRetry; i++) {
// try to acquire distributed lock
if (redis.setnx(lockKey, "1", 30, TimeUnit.SECONDS)) {
try {
return callApi();
} finally {
redis.delete(lockKey);
}
}
Thread.sleep(1000 * (i + 1)); // wait before next attempt
}
return false;
}Conclusion
Retry mechanisms are like fire extinguishers in a data center—ideally you never need them, but they must work reliably when emergencies arise. Choose the solution that matches your business’s “weapon and shield” requirements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Su San Talks Tech
Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
