Design and Implementation of a Fault Injection Platform for High‑Availability Backend Systems
This article describes the motivation, architecture, and implementation details of a fault‑injection platform that uses Java Instrumentation and dynamic bytecode weaving to validate high‑availability strategies, isolate failures, and support zero‑cost, runtime fault injection for complex distributed backend services.
Since Qunar's ticketing business grew to hundreds of applications, the increasing coupling and complex call chains created serious challenges for building a distributed, highly‑available architecture. To verify that fault‑recovery plans work in production, a fault‑injection platform was created.
Background – The system topology (see Fig. 1) shows deep dependencies and multiple failure scenarios, such as weak‑dependency crashes, traffic spikes causing cascade failures, and hardware/network outages. These issues are amplified by the exponential effect of many dependent services, leading to measurable downtime.
High‑availability methodology – A set of best‑practice practices (Fig. 2) is presented, but the article argues that merely completing these steps does not guarantee true high availability. Runtime fault injection is needed to validate the effectiveness of degradation and circuit‑breaker strategies.
Fault‑injection platform overview – The platform consists of four components:
Front‑end display (WEB) showing service topology and allowing selection of methods for fault injection.
Deploy system that packages and launches the Agent and Binder on target APP machines.
Server that distributes commands, records fault‑injection state, performs permission checks, and receives Agent feedback.
Agent and Binder programs that perform bytecode enhancement on target JVMs without restarting them.
The overall architecture is illustrated in Fig. 3.
Agent architecture – Two weaving approaches are discussed: static weaving (requires restart) and dynamic weaving (runtime injection). The platform adopts dynamic weaving with a standardized API to avoid class‑loader conflicts.
The Agent uses the JDK Instrumentation‑API and HotSwap to inject code at runtime, supporting use cases such as fault injection, tracing, traffic recording, and dynamic logging.
Event model – Three event types (BEFORE, RETURN, THROWS) allow code injection before method execution, before return, or after an exception. The following code snippet shows the structure of the injected logic:
// BEFORE
try {
/* do something... */
foo();
// RETURN
return;
} catch (Throwable e) {
// THROWS
}The model enables three capabilities: returning a custom result before execution, modifying the return value, or replacing an exception with a different outcome.
Class‑loader isolation – To avoid polluting the original application, the Agent and its libraries are loaded by a custom AgentClassLoader. The platform injects a Drill class into the BootstrapClassLoader for communication, then uses reflection to apply bytecode transformations to target classes loaded by the application’s ClassLoader.
Examples using Dubbo illustrate how the platform injects faults into client‑side proxies and how the Drill class invokes transformed methods.
Challenges and solutions – The original approach generated large, hard‑to‑debug Drill classes. The new design isolates classes, makes event implementation compilable, and supports custom return values. A dedicated AgentClassLoader loads all Agent code, and the platform can dynamically load transformed classes without restarting the service.
Benefits – The platform supports zero‑cost onboarding, no service restarts for fault injection/removal, full topology visualization, permission checks via QSSO, automatic deployment, and extensibility for modules such as mock, traffic recording, and other fault‑injection scenarios.
Conclusion – The core of the fault‑injection platform is the Agent component, a pure‑Java AOP solution built on the Instrumentation‑API. It provides developers with a flexible, low‑overhead way to perform bytecode weaving for fault‑injection, traffic recording, and other runtime instrumentation needs.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.