Fault Drill: Traffic Replication and Fault Injection Platform for Hotel Backend

The Fault‑Drill platform for hotel back‑end services combines real‑time traffic replication to shadow clusters with UI‑driven fault injection via a java‑agent, enabling developers to validate incident‑response plans, measure latency impacts, and reduce MTTR by testing normal and abnormal conditions on live traffic.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Fault Drill: Traffic Replication and Fault Injection Platform for Hotel Backend

Background: High request volume and system complexity cause frequent failures. Lack of reliable incident response plans leads to increased MTTR.

Solution: Propose a regular fault‑drill mechanism and tools to validate plans, ensuring services behave correctly under both normal and abnormal conditions.

Overall plan includes capacity & performance assessment and fault‑drill exercises to map dependencies and verify pre‑plans.

Traffic Replication System: Copies live traffic to shadow clusters for stress testing and fault‑scenario validation. Core features: real‑time traffic copying, configurable sampling, method‑level control, low‑cost integration.

Key design steps:

Developers annotate methods with

@Copy(attribute = CopyMethodAttribute.READ_METHOD, simplingRate = 1.0f)

to enable copying.

Live traffic is asynchronously forwarded to a copy‑server.

Copy‑server determines target shadow cluster and amplification factor, then replicates traffic accordingly.

Fault‑Drill System: Provides a UI‑driven platform to inject and recover from faults at the AppKey (cluster) level without requiring root privileges.

Client side loads a javaagent, fetches scripted fault actions (e.g., Thread.sleep, throw TException), compiles them into a method, and executes the script before the target method.

Server side stores user configurations and dispatches copied traffic to shadow clusters, applying amplification and fault injection logic.

Case study: Using the system, 5,000 calls to distributeGoodsService.queryPrepayList were replicated 5‑fold to a target cluster, and subsequent Redis and Thrift faults were injected, demonstrating measurable latency impact and recovery behavior.

Conclusion: The Fault‑Drill platform combines traffic replication and fault injection to improve system availability for hotel backend services, with future work on busy‑time traffic capture, response collection, and more complex fault scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Backend EngineeringDistributed SystemsPerformance TestingFault Injectiontraffic replication
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.