Backend Development 13 min read

Design and Implementation of a Distributed Retry System Based on Distributed Scheduling

This article presents a comprehensive distributed retry system that leverages a distributed scheduling mechanism to ensure eventual consistency, reduce manual recovery costs, and provide flexible retry strategies, automatic recovery detection, visual management, rate limiting, and intelligent retry for backend services.

58 Tech
58 Tech
58 Tech
Design and Implementation of a Distributed Retry System Based on Distributed Scheduling

Background In distributed environments, service dependencies increase overall system complexity and the likelihood of failures, leading to data consistency problems. While consensus algorithms like Paxos or two‑phase commit can solve consistency, they are complex and reduce availability, so most business systems rely on eventual consistency via retry mechanisms.

Overall Architecture The system is composed of four main modules:

Retry Client: provided as a JAR, it reports data via the company’s MQ component and exposes RPC services (Jetty) for retry point execution.

Registration Management: handles registration of clusters, retry points, client nodes, and alarm configurations.

Retry Center: centrally controls retry strategies, collects data, listens to retry events, performs automatic retry scheduling, and implements rate‑limiting and circuit‑breaker logic.

Monitoring & Alarm: monitors retry queues and generates alerts based on cluster, retry point, and queue size.

Retry Point Concept A retry point is the entry where an exception occurs; it can be split into multiple points to isolate different failure paths. Each point carries configuration such as retry count, interval, and data validity period.

Retry Flow When an exception occurs, the retry client reports data to the retry center, which stores it in a retry queue. An automatic retry scanner reads pending items, acquires a distributed lock per retry point, and processes up to 5,000 items at a time. Successful retries are dequeued; failures trigger either a retry count increment with a calculated back‑off interval or, if thresholds are exceeded, move the item to a trigger‑retry queue.

Retry Strategies The system supports multiple strategies:

Fixed‑interval retry.

Gradient (exponential or linear) back‑off.

Custom interval sequences.

Event‑driven manual retries.

Automatic retries driven by the scheduler.

Back‑off intervals are defined by a function f(n) where n is the retry count, allowing constant, linear, exponential, or custom patterns.

Rate Limiting To prevent overload during massive retry bursts, a pseudo‑distributed rate limiter is implemented using Zookeeper node watches combined with a token‑bucket algorithm, distributing the global QPS across nodes.

Intelligent Retry By analyzing retry execution logs, the system detects target service health, applies circuit‑breaker rules, and performs probe‑recovery attempts with a small data sample. Successful probes restore normal retry status, while persistent failures keep the point in a probe state.

Conclusion The distributed retry system enhances backend service stability by providing automated, configurable, and observable retry mechanisms, reducing manual intervention and improving fault tolerance. Future work includes automatic detection and generation of retry points to further minimize manual configuration.

distributed systemsBackend Developmentfault toleranceRate Limitingretry mechanismintelligent retry
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.