How Ctrip Achieved High Availability for Service Mesh Across Multiple Data Centers
Facing large‑scale production demands, Ctrip’s Cloud Container team designed a multi‑data‑center Service Mesh architecture that isolates failures, employs dual‑active IDC, multi‑cluster deployments, and optimized xDS push mechanisms, ensuring high availability, rapid fault recovery, and efficient rollout across hundreds of services.
Background
In recent years Chinese enterprises have rapidly adopted Kubernetes and Service Mesh, launching a cloud‑native transformation. Starting in 2019 Ctrip began deploying Istio Gateways in limited scenarios, accumulating Service Mesh experience. By mid‑2020, in collaboration with the framework department, Service Mesh was rolled out to production, eventually connecting hundreds of applications with ongoing coverage expansion.
Service Mesh offers many advantages over traditional micro‑service frameworks, but without guaranteed availability large‑scale failures can erode confidence and damage the brand. Ctrip therefore invested heavily in availability engineering to avoid single points of failure and to align with existing high‑availability practices.
Service Mesh High‑Availability Architecture
IDC‑Level Disaster Recovery – Active‑Active in the Same City
Each internal application consists of multiple Groups, the smallest unit for publishing and traffic routing. A disaster‑recovery (DR) group typically contains two IDC sites. Application Groups are deployed across both sites; when one IDC fails, traffic is switched to the other. Critical applications have multi‑IDC deployments, databases use cross‑IDC primary‑secondary, and the traffic‑routing layer also supports cross‑IDC control.
Intra‑IDC HA – Multi‑Cluster Application Deployment
With large‑scale containerization most applications run on Kubernetes clusters. Instances of the same Group are spread across multiple clusters. Operators in each cluster register instance information with an external registry, ensuring that a control‑plane outage in a single cluster only affects that cluster’s instances.
The design isolates fault domains, cuts fault propagation, and ensures that failures at the IDC level or within a Kubernetes cluster do not cascade, providing a solid foundation for Service Mesh HA.
Service Mesh High‑Availability Design
Fault Scenarios
Data‑plane failure may cause request errors and, at large scale, lead to IDC‑level outages.
Control‑plane failure prevents configuration updates; instance IP changes and routing updates are delayed.
Because the control plane relies on Kubernetes, a Kubernetes control‑plane failure also impacts Service Mesh control.
Goal Definition
IDC‑level fault isolation: limit failures to a single IDC and enable seamless IDC switch‑over.
Support application‑agnostic multi‑cluster deployment, decoupling Service Mesh HA from specific deployment patterns.
Solution Design
Each IDC hosts an independent control plane. Cross‑IDC traffic passes through a Gateway; ServiceEntry imports remote Gateways, consolidating external service entry per IDC. Inside an IDC, the control plane runs on a dedicated “Primary” Kubernetes cluster, while application clusters act as “Remote” clusters sharing the same control plane.
Availability Analysis
If an IDC fails, traffic is switched to another IDC, preserving the existing architecture.
When the data plane fails, Envoy’s FailOver shifts traffic to other IDC Gateways, minimizing impact.
If the Primary control plane fails, the data plane can still serve via other IDC Gateways; configuration updates may be delayed but service continuity is maintained.
Remote cluster failures affect only HPA and deployment, not the Service Mesh layer.
Fault Drills
Simulate an IDC‑wide service outage, verify data‑plane FailOver, and measure success rate and latency.
Analyze drill results for hidden issues such as routing loops across IDC Gateways.
Enhancing Service Mesh Intrinsic Availability
Scenarios & Goals
Control‑plane must handle massive data‑plane connections and push configurations quickly.
Rapid recovery from node or service failures.
Support fast canary releases and rollbacks for both control‑plane and data‑plane.
xDS Push Metrics
The end‑to‑end push latency is measured from the moment the control plane receives an event to the moment the data plane acknowledges it. This metric enables targeted optimization of the configuration‑delivery pipeline.
Reliable xDS Pushes
During rolling updates, instance IP changes may not be pushed before the instance is deleted, causing data‑plane access errors. Ctrip uses a WorkloadEntry finalizer that keeps the Pod alive until the control plane confirms the deletion has been propagated to the data plane.
ServiceEntry/WorkloadEntry Event Handling Optimization
In namespaces with 5,000 ServiceEntry and 10,000 WorkloadEntry objects, the original maybeRefreshIndexes method performed up to 50 million label‑matching operations, leading to minute‑level processing delays.
func (s *ServiceEntryStore) maybeRefreshIndexes() {
// ...
wles, err := s.store.List(gvk.WorkloadEntry, model.NamespaceAll) // 10000
for _, wcfg := range wles {
// ...
entries := seWithSelectorByNamespace[wcfg.Namespace]
for _, se := range entries { // 5000
workloadLabels := labels.Collection{wle.Labels}
if !workloadLabels.IsSupersetOf(se.entry.WorkloadSelector.Labels) { // 5000 * 10000
continue
}
}
}
}Switching to an incremental approach—only iterating ServiceEntry objects within the affected namespace—reduces the loop count by four orders of magnitude, bringing processing time down to milliseconds.
Control‑Plane Cold Start Improvements
After the control plane becomes ready, pending events in internal queues can delay configuration delivery to the data plane. Ctrip now blocks the ready signal until the queue is drained, ensuring immediate processing of new events.
PR merged to the Istio project to add a DiscoveryNamespacesFilter that skips irrelevant namespaces during discovery: https://github.com/istio/istio/pull/36628
func (c *Controller) SyncAll() error {
c.beginSync.Store(true)
var err *multierror.Error
err = multierror.Append(err, c.syncDiscoveryNamespaces())
err = multierror.Append(err, c.syncSystemNamespace())
err = multierror.Append(err, c.syncNodes())
err = multierror.Append(err, c.syncServices())
return err
}
func (c *Controller) syncDiscoveryNamespaces() error {
var err error
if c.nsLister != nil {
err = c.opts.DiscoveryNamespacesFilter.SyncNamespaces()
}
return err
}Canary and Gradual Release Strategy
Deploy a separate canary control‑plane instance within the same cluster.
Adjust sidecar injection to route a subset of traffic to the canary control plane.
Automate validation against predefined test scenarios.
Gradually scale the canary control plane, monitor metrics, and shift traffic.
If issues arise, roll back by scaling down the canary and reverting traffic to the stable control plane.
Future Outlook
Service Mesh will continue to play a pivotal role in traffic management, offering strong extensibility and a unified model for heterogeneous, multi‑language systems. Ctrip plans to further invest in observability, improve xDS push performance, support larger‑scale deployments, and deepen collaboration with the open‑source community.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
