Operations 13 min read

How Qunar’s Fault Injection Platform Ensures High‑Availability in Complex Backend Systems

Qunar built a fault‑injection platform that dynamically injects runtime errors into its densely coupled backend services, enabling verification of degradation and circuit‑breaker strategies, with a four‑part architecture comprising a web UI, deployment system, command server, and Java agents using Instrumentation‑API for bytecode weaving.

dbaplus Community
dbaplus Community
dbaplus Community
How Qunar’s Fault Injection Platform Ensures High‑Availability in Complex Backend Systems

Background

Since its founding in 2005, Qunar’s ticket business has grown to hundreds of applications whose inter‑service dependencies and call chains have become increasingly tangled, making the construction of a distributed, highly‑available architecture a major challenge. To validate fault‑handling plans during runtime, Qunar created a fault‑injection (chaos) platform.

High‑Availability Methodology

The platform follows a typical high‑availability practice: identify all factors that can cause unavailability, implement isolation, rate‑limiting, circuit‑breaker mechanisms, and verify them by injecting faults in a controlled manner. An illustration of the typical practice is shown below.

Fault‑Injection Platform Overview

The platform’s purpose is to test whether predefined fault‑handling plans actually take effect. It supports fault types such as runtime exceptions and timeouts, which are injected dynamically into selected services to trigger the corresponding degradation or circuit‑breaker logic.

Overall Architecture

The platform consists of four major parts:

Web UI : visualizes service topology, clusters, and methods; allows users to select a method for fault injection or removal.

Deploy System : publishes the Agent and Binder packages to target application machines and starts them. It receives the AppCode and target IP from the UI, downloads the appropriate JAR, and launches it.

Command Server : distributes commands, records fault‑injection state, performs permission checks, and receives responses from Agents via long‑lived connections.

Agent & Binder : the Agent proxies the target application, performs bytecode weaving at runtime, and the Binder locates the target JVM process based on the supplied AppCode and port.

Agent Architecture

The Agent uses Java’s Instrumentation‑API and HotSwap to weave bytecode without restarting the JVM. Two weaving approaches exist:

Static weaving : performed at class‑generation time, requiring a JVM restart to apply changes.

Dynamic weaving : performed at runtime by renaming the original method and creating a proxy method that delegates to the injected logic.

Dynamic weaving was chosen for its flexibility, and a standardized API was built around it.

The event model defines three injection points:

BEFORE – code executed before the original method.

THROWS – code executed when the method throws an exception.

RETURN – code executed after the method returns normally.

Example code illustrating the three points:

// BEFORE
try {
    /* do something... */
    foo();
    // RETURN
    return;
} catch (Throwable e) {
    // THROWS
}

Using this model, the Agent can:

Return a custom result before the original method runs, skipping its execution.

Replace the return value or throw a different exception after the method completes.

Intercept a thrown exception and transform it into a different result or a normal return.

Class‑Loader Considerations

The Agent and its libraries are loaded by the AppClassLoader. To modify classes loaded by other class‑loaders (e.g., Tomcat’s WebClassLoader), the Instrumentation API obtains the target class’s bytecode and rewrites it, then reloads the transformed class into the JVM.

Dubbo Example

When injecting a fault into a Dubbo RPC call, the steps are:

Apply AOP to the client‑side proxy of Service A calling Service B.

Start the Agent, which generates a Drill.invoke() method that throws a runtime exception.

Weave bytecode to insert Drill.invoke() at the beginning of the target method.

Change the Drill implementation to switch fault types (e.g., replace the exception with a 3‑second sleep).

Challenges and Solutions

Initial implementations suffered from class‑pollution, cumbersome code generation, and difficulty debugging generated bytecode. The next‑generation Agent addresses three problems:

Class isolation – avoid contaminating the original application.

Compile‑time event definitions – make event logic compilable and type‑safe.

Custom result support – allow returning user‑defined outcomes.

A custom AgentClassLoader isolates all Agent classes, and the bytecode is injected via reflection to invoke specific event implementations.

Usage Guide

Typical usage consists of four steps:

Enter the target AppCode .

Select the method to fault‑inject.

Specify the target machine(s).

Trigger the fault injection.

Conclusion

The core of Qunar’s fault‑injection platform is the Agent component – a pure‑Java, Instrumentation‑API based AOP framework that enables developers to perform bytecode instrumentation for chaos testing, traffic recording, and other runtime extensions without restarting services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendOperationsaopFault InjectionJava Instrumentation
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.