Fault Drill: Traffic Replication and Fault Injection Platform for Hotel Backend
The Fault‑Drill platform for hotel back‑end services combines real‑time traffic replication to shadow clusters with UI‑driven fault injection via a java‑agent, enabling developers to validate incident‑response plans, measure latency impacts, and reduce MTTR by testing normal and abnormal conditions on live traffic.
Background: High request volume and system complexity cause frequent failures. Lack of reliable incident response plans leads to increased MTTR.
Solution: Propose a regular fault‑drill mechanism and tools to validate plans, ensuring services behave correctly under both normal and abnormal conditions.
Overall plan includes capacity & performance assessment and fault‑drill exercises to map dependencies and verify pre‑plans.
Traffic Replication System: Copies live traffic to shadow clusters for stress testing and fault‑scenario validation. Core features: real‑time traffic copying, configurable sampling, method‑level control, low‑cost integration.
Key design steps:
Developers annotate methods with
@Copy(attribute = CopyMethodAttribute.READ_METHOD, simplingRate = 1.0f)to enable copying.
Live traffic is asynchronously forwarded to a copy‑server.
Copy‑server determines target shadow cluster and amplification factor, then replicates traffic accordingly.
Fault‑Drill System: Provides a UI‑driven platform to inject and recover from faults at the AppKey (cluster) level without requiring root privileges.
Client side loads a javaagent, fetches scripted fault actions (e.g., Thread.sleep, throw TException), compiles them into a method, and executes the script before the target method.
Server side stores user configurations and dispatches copied traffic to shadow clusters, applying amplification and fault injection logic.
Case study: Using the system, 5,000 calls to distributeGoodsService.queryPrepayList were replicated 5‑fold to a target cluster, and subsequent Redis and Thrift faults were injected, demonstrating measurable latency impact and recovery behavior.
Conclusion: The Fault‑Drill platform combines traffic replication and fault injection to improve system availability for hotel backend services, with future work on busy‑time traffic capture, response collection, and more complex fault scenarios.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
