Backend Development 13 min read

Reducing MTTR in a High‑Availability SaaS Platform through Chaos Engineering and Middleware Resilience

This article explains how a SaaS platform for employee incentives reduces mean time to recovery (MTTR) during large‑scale promotions by applying chaos‑engineering drills, automating fault detection, and leveraging JSF middleware features such as timeout‑retry, adaptive load balancing, and circuit breaking to improve overall system stability.

JD Retail Technology

Jun 14, 2023

Reducing MTTR in a High‑Availability SaaS Platform through Chaos Engineering and Middleware Resilience

In the enterprise business domain, the Jinli platform provides a one‑stop SaaS solution for employee benefits, marketing, and incentives, serving all company employees; high availability is therefore critical. This article introduces how, before a major sales promotion, chaos‑engineering exercises were used to lower the application’s mean time to recovery (MTTR).

MTTR (Mean Time To Recovery) is the average time required to restore a product or system after a failure, covering the entire interruption period from fault occurrence to full operational recovery.

The article poses questions about whether fault handling must rely on manual monitoring and feedback, or if automation and specific strategies can be employed to achieve rapid “stop‑bleeding” and improve system stability.

It proceeds to answer these questions through a combination of short‑term measures and long‑term solutions.

Faults are ubiquitous and unavoidable.

The discussion starts with host‑restart issues and underlying service chaos‑engineering investigations that caused prolonged impacts on system availability and performance, especially for core interfaces that affect multiple downstream services and customer experience.

Client perspective: a large number of interface timeouts (including order submission) and availability spikes caused customer complaints, with the impact amplified for major B2B customers.

Since faults cannot be eliminated, the approach is to embrace them and build capabilities to detect and mitigate them, ensuring high availability.

Because over 90% of internal calls are JSF RPC calls, the focus shifts to JSF middleware’s fault‑tolerance capabilities, namely timeout‑retry, adaptive load balancing, and service circuit breaking.

Practice is the only test of truth.

PART 3.1 – About Timeout and Retry

Improper or missing timeout settings lead to slow responses that can cascade into application‑wide failures. Both internal services and external dependencies (HTTP or middleware) should have reasonable timeout‑retry policies.

Read‑heavy services can safely retry (e.g., two retries after a reasonable timeout), while write‑heavy services generally should not retry unless they are idempotent.

Timeout values should be based on the service’s TP99 (or TP95) response time, adding a 50% buffer; for example, a service with TP99 = 6 ms may use a 10 ms timeout with two retries.

Retry attempts should be limited (typically 2‑3) to avoid excessive load that can resemble a DDoS attack; retries work best when combined with circuit breaking and fast‑failure mechanisms.

Beyond introducing these mechanisms, it is essential to verify their effectiveness.

Summary of timeout‑retry: proper configuration smooths request flow and, combined with failover, significantly improves interface availability.

Supplement: method‑level timeout‑retry configuration is not supported by JSF annotations; XML configuration is required for method‑level settings.

PART 3.2 – About Adaptive Load Balancing

The "shortestresponse" adaptive load‑balancing strategy aims to reduce traffic to weaker provider nodes, preventing them from degrading overall consumer latency and availability.

Potential issues include over‑concentration on high‑performance instances, response time not fully reflecting throughput, and limited benefit when provider response times are similar.

The implementation resembles a Power‑of‑Two‑Choices (P2C) algorithm, selecting two providers and comparing a weighted metric (average response time × current request count) to choose the faster one, thereby distributing load toward higher‑throughput machines. <span>略</span> Fault injection with adaptive load balancing (illustrated by the following images) demonstrates improved availability under simulated network delays.

Summary of adaptive load balancing: by directing traffic to stronger nodes from the first call, the system maintains higher availability and performance even during injected faults.

PART 3.3 – About Service Circuit Breaking

Analogous to an electrical circuit breaker, service circuit breaking protects the call chain by halting requests to an unstable service once failure thresholds are exceeded, preventing downstream impact.

It wraps vulnerable function calls in a circuit‑breaker object that monitors error rates and latency; when thresholds are crossed, calls are blocked for a defined window. <span>略</span> Fault injection with circuit breaking (shown in the images) shows that the circuit closes during the fault window, but without a fail‑back strategy and with a short open window, availability may still drop after the window reopens.

Supplement: a unified circuit‑breaker component is available across the organization to avoid duplicate implementations.

Note: different resilience mechanisms can conflict; circuit breaking converts partial failures into total failures to prevent cascade, which may oppose the goal of partial‑failure tolerance in distributed systems.

Conclusion: capabilities are means, stability is the goal. Continuous balancing between business demands and stability engineering is required to build a high‑availability architecture that supports long‑term growth.

References:

The power of two random choices: https://brooker.co.za/blog/2012/01/17/two-random.html

Load balancing: https://cn.dubbo.apache.org/zh-cn/overview/core-features/load-balance/#shortestresponse

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

MTTR circuit breaking Backend Resilience Timeout Retry

Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.