Improving Ctrip's AB Experiment Splitter: Design, Performance Optimization, and Backend Architecture
The article details Ctrip's challenges with multiple AB testing splitters, presents performance gains after migrating to a new splitter, and explains the comprehensive redesign covering overall architecture, interface consolidation, SDK slimming, and a custom distributed cache backend to achieve higher throughput and lower latency.
Background: Ctrip has been an early adopter of AB testing, using multiple AB splitter interfaces across apps, mini‑programs, and online pages, leading to issues such as interface confusion, degraded response efficiency under high traffic, and tight coupling with experiment configuration tables.
Improvement results: After migrating most traffic to a new AB splitter via the slbportal tool, QPS increased from 200.7 to 290.2 and P99.9 latency dropped dramatically from 363.1 ms to 5.2 ms, demonstrating significant performance gains.
Improvement plan includes four parts:
Overall design: Choosing a service‑based architecture but ultimately adopting a resident SDK to distribute splitting logic across departments and maximize efficiency.
Consolidation (收口): Reducing dozens of department‑specific splitter endpoints to one or two unified interfaces, simplifying development and maintenance.
SDK redesign: Transforming the “fat” SDK into a “thin” SDK that only holds a minimal wide‑table of essential experiment fields, moving heavy queries to the backend and introducing a CopyOnWrite cache for rapid rule updates.
Backend selection: Rejecting qconfig due to scalability limits, evaluating Redis, and finally implementing a custom distributed cache based on Apache Ignite with a snapshot service that provides read‑write separation, real‑time updates, and high availability.
The backend architecture consists of an SOA service layer, the Ignite‑based distributed cache, and the experiment configuration database, with snapshot and real‑time update services ensuring consistency and low latency.
Post‑implementation observations note stable performance during disaster‑recovery drills, resolved issues with unique identifier control and snapshot service deployment, and remaining work such as real‑time cache monitoring and alerting.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
