Mastering Failure Recovery: Fast‑Fail, Auto‑Retry, and Resilience Patterns for Distributed Systems
This guide outlines core principles and practical solutions for building resilient backend systems, covering fast‑failure handling, automatic retries with exponential back‑off, circuit‑breaker usage, idempotency, batch job strategies, online transaction patterns, and robust message‑queue processing.
Core Principles
Use a fast‑fail strategy for predictable exceptions and an automatic recovery strategy for unpredictable ones. A circuit‑breaker (e.g., Sentinel or Hystrix) aborts the current flow when the failure rate exceeds a threshold, preventing fault propagation. Combine this with an exponential back‑off retry policy (initial interval 1 s, doubling each attempt, up to three retries).
Fast‑fail suitable scenarios : input validation errors, permission checks, business rule violations (e.g., insufficient stock or balance), programming bugs, and extremely time‑consuming interfaces.
Automatic‑retry suitable scenarios : network time‑outs, rate‑limiting responses, and concurrency conflicts.
Define clear fault boundaries and guarantee idempotent transactional operations so that state can be rolled back safely.
Never assume external services are 100 % available. Always configure timeout, retry and fallback logic (default data, queuing, or failure‑record storage) to keep the main flow alive.
Log each request with a globally unique request ID, exception type, key business parameters and processing latency in a uniform format for rapid troubleshooting.
Scenario‑Specific Solutions
General Exception Handling Mechanism
Parameter validation : Strictly validate type, format, length and required fields for all external inputs.
Dependency health checks : Use Kubernetes liveness/readiness probes together with Sentinel to monitor databases, caches and other middleware at startup and runtime.
Batch‑Job Patterns
Data import/export (short‑duration jobs) : Schedule jobs with XXL‑Job. Set an execution deadline; if the deadline is exceeded, raise an alarm for manual intervention. When a partial write/read failure occurs, define a failure‑count threshold, log the failed batch identifiers and abort the job if the threshold is exceeded. Record failed records to a dedicated file for later retry; if retries still exceed the threshold, involve humans.
Data export grouping : Split export data by business category into separate batch files so that problematic categories can be re‑exported independently.
Data polling (long‑running jobs) : Persist batch and detail records after each processed chunk. Support parameter‑driven re‑execution of unprocessed or failed data. Define a tolerable failure‑count threshold and trigger alerts when exceeded. Provide an operation‑management feature to roll back batch updates to the pre‑batch snapshot when rapid rollback is required.
Avoid calling external systems inside batch jobs; if unavoidable, use bulk APIs and record failures without retrying the whole batch.
Online Transactional Workloads
Single‑application web jobs : Return explicit failure messages to the front‑end on exceptions and monitor exception counts within a sliding time window for alerting.
Multi‑application coordination (no local transactions): Choose one of the following consistency patterns:
TCC (Try‑Confirm‑Cancel) : Provide compensation interfaces for each step, allow limited retries, and record failures for manual or automatic compensation.
Saga : Similar to TCC but with a series of compensating actions.
2PC / AT mode (Seata) or transactional messages (RocketMQ) : Record transaction execution details for rollback; avoid retries in AT mode to prevent global lock contention.
If strict consistency is not required, treat downstream calls as notifications and handle them asynchronously via message queues or thread pools, with limited retries and persistent failure records.
For time‑consuming operations (file upload/download, heavy calculations), split the workflow into submission and execution phases, run execution asynchronously (e.g., chunked upload, streaming download) and enforce per‑user rate limits using a cache.
Implement idempotency on the front‑end (prevent duplicate clicks) and on the back‑end using a “one‑lock‑two‑check‑three‑update” pattern.
Message Processing
Message sending : Use acknowledgment mechanisms and retry delivery for failed messages. Persist high‑reliability failures to a database or file for scheduled retries.
Message receiving (golden rule: idempotent consumption):
For short‑term deduplication, store a unique identifier (e.g., Redis key) with a TTL.
For permanent deduplication, enforce a unique constraint or a processed‑message table in the database.
Avoid heavy processing in the consumer; if needed, store the message and process it asynchronously. On consumer exceptions, let the message remain for re‑consumption; after exceeding the retry limit, move it to a dead‑letter queue for manual handling.
When consuming batch messages, split the batch into per‑message handling, capture individual failures and apply the same retry/alert logic so that a single failure does not block the entire batch.
Appendix
A sample retry‑record table stores failed attempts and supports automated back‑off. Key columns include:
id – primary key (bigint).
sys_code – system identifier (varchar(8)).
event_type – type of event (varchar(32)).
message – JSON payload of the failed request (varchar(2048)).
status – processing status (tinyint, 0 = failure, 1 = success).
retry_count – number of retry attempts (int).
create_time / update_time – timestamps (datetime).
delete_flag – logical delete flag (tinyint).
tenant_id – tenant identifier for multi‑tenant isolation (int).
Typical data volume is on the order of a few hundred rows per day, retained for one month based on update_time.
Below is a concrete Java implementation of an exponential back‑off retry handler that respects the Retry‑After header and limits the number of retries.
@Component
public class SmartRetryHandler {
private static final int MAX_RETRIES = 3;
private final Map<String, Integer> retryCounts = new ConcurrentHashMap<>();
public Response handleRateLimit(Response response, String serviceKey) {
int current = retryCounts.getOrDefault(serviceKey, 0);
if (current >= MAX_RETRIES) {
retryCounts.remove(serviceKey);
throw new MaxRetriesExceededException("Max retries exceeded for " + serviceKey);
}
String retryAfter = response.getHeader("Retry-After");
long wait = calculateWaitTime(retryAfter, current);
try {
Thread.sleep(wait);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new RetryInterruptedException("Retry interrupted", e);
}
retryCounts.put(serviceKey, current + 1);
return retryRequest(response.getOriginalRequest());
}
private long calculateWaitTime(String retryAfter, int retryCount) {
if (retryAfter != null) {
return Long.parseLong(retryAfter) * 1000L; // seconds to ms
}
// exponential back‑off: 1s, 2s, 4s, ...
return (long) Math.pow(2, retryCount) * 1000L;
}
}Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect-Kip
Daily architecture work and learning summaries. Not seeking lengthy articles—only real practical experience.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
