Cloud Native 18 min read

How Ctrip Achieved High Availability for Service Mesh Across Multiple Data Centers

Facing large‑scale production demands, Ctrip’s Cloud Container team designed a multi‑data‑center Service Mesh architecture that isolates failures, employs dual‑active IDC, multi‑cluster deployments, and optimized xDS push mechanisms, ensuring high availability, rapid fault recovery, and efficient rollout across hundreds of services.

ITPUB

Jan 11, 2022

How Ctrip Achieved High Availability for Service Mesh Across Multiple Data Centers

Background

In recent years Chinese enterprises have rapidly adopted Kubernetes and Service Mesh, launching a cloud‑native transformation. Starting in 2019 Ctrip began deploying Istio Gateways in limited scenarios, accumulating Service Mesh experience. By mid‑2020, in collaboration with the framework department, Service Mesh was rolled out to production, eventually connecting hundreds of applications with ongoing coverage expansion.

Service Mesh offers many advantages over traditional micro‑service frameworks, but without guaranteed availability large‑scale failures can erode confidence and damage the brand. Ctrip therefore invested heavily in availability engineering to avoid single points of failure and to align with existing high‑availability practices.

Service Mesh High‑Availability Architecture

IDC‑Level Disaster Recovery – Active‑Active in the Same City

Each internal application consists of multiple Groups, the smallest unit for publishing and traffic routing. A disaster‑recovery (DR) group typically contains two IDC sites. Application Groups are deployed across both sites; when one IDC fails, traffic is switched to the other. Critical applications have multi‑IDC deployments, databases use cross‑IDC primary‑secondary, and the traffic‑routing layer also supports cross‑IDC control.

Intra‑IDC HA – Multi‑Cluster Application Deployment

With large‑scale containerization most applications run on Kubernetes clusters. Instances of the same Group are spread across multiple clusters. Operators in each cluster register instance information with an external registry, ensuring that a control‑plane outage in a single cluster only affects that cluster’s instances.

The design isolates fault domains, cuts fault propagation, and ensures that failures at the IDC level or within a Kubernetes cluster do not cascade, providing a solid foundation for Service Mesh HA.

Service Mesh High‑Availability Design

Fault Scenarios

Data‑plane failure may cause request errors and, at large scale, lead to IDC‑level outages.

Control‑plane failure prevents configuration updates; instance IP changes and routing updates are delayed.

Because the control plane relies on Kubernetes, a Kubernetes control‑plane failure also impacts Service Mesh control.

Goal Definition

IDC‑level fault isolation: limit failures to a single IDC and enable seamless IDC switch‑over.

Support application‑agnostic multi‑cluster deployment, decoupling Service Mesh HA from specific deployment patterns.

Solution Design

Each IDC hosts an independent control plane. Cross‑IDC traffic passes through a Gateway; ServiceEntry imports remote Gateways, consolidating external service entry per IDC. Inside an IDC, the control plane runs on a dedicated “Primary” Kubernetes cluster, while application clusters act as “Remote” clusters sharing the same control plane.

Service Mesh control‑plane isolation diagram

Availability Analysis

If an IDC fails, traffic is switched to another IDC, preserving the existing architecture.

When the data plane fails, Envoy’s FailOver shifts traffic to other IDC Gateways, minimizing impact.

If the Primary control plane fails, the data plane can still serve via other IDC Gateways; configuration updates may be delayed but service continuity is maintained.

Remote cluster failures affect only HPA and deployment, not the Service Mesh layer.

Fault Drills

Simulate an IDC‑wide service outage, verify data‑plane FailOver, and measure success rate and latency.

Analyze drill results for hidden issues such as routing loops across IDC Gateways.

Enhancing Service Mesh Intrinsic Availability

Scenarios & Goals

Control‑plane must handle massive data‑plane connections and push configurations quickly.

Rapid recovery from node or service failures.

Support fast canary releases and rollbacks for both control‑plane and data‑plane.

xDS Push Metrics

The end‑to‑end push latency is measured from the moment the control plane receives an event to the moment the data plane acknowledges it. This metric enables targeted optimization of the configuration‑delivery pipeline.

Reliable xDS Pushes

During rolling updates, instance IP changes may not be pushed before the instance is deleted, causing data‑plane access errors. Ctrip uses a WorkloadEntry finalizer that keeps the Pod alive until the control plane confirms the deletion has been propagated to the data plane.

ServiceEntry/WorkloadEntry Event Handling Optimization

In namespaces with 5,000 ServiceEntry and 10,000 WorkloadEntry objects, the original maybeRefreshIndexes method performed up to 50 million label‑matching operations, leading to minute‑level processing delays.

func (s *ServiceEntryStore) maybeRefreshIndexes() {
    // ...
    wles, err := s.store.List(gvk.WorkloadEntry, model.NamespaceAll) // 10000
    for _, wcfg := range wles {
        // ...
        entries := seWithSelectorByNamespace[wcfg.Namespace]
        for _, se := range entries { // 5000
            workloadLabels := labels.Collection{wle.Labels}
            if !workloadLabels.IsSupersetOf(se.entry.WorkloadSelector.Labels) { // 5000 * 10000
                continue
            }
        }
    }
}

Switching to an incremental approach—only iterating ServiceEntry objects within the affected namespace—reduces the loop count by four orders of magnitude, bringing processing time down to milliseconds.

Control‑Plane Cold Start Improvements

After the control plane becomes ready, pending events in internal queues can delay configuration delivery to the data plane. Ctrip now blocks the ready signal until the queue is drained, ensuring immediate processing of new events.

PR merged to the Istio project to add a DiscoveryNamespacesFilter that skips irrelevant namespaces during discovery: https://github.com/istio/istio/pull/36628

func (c *Controller) SyncAll() error {
    c.beginSync.Store(true)
    var err *multierror.Error
    err = multierror.Append(err, c.syncDiscoveryNamespaces())
    err = multierror.Append(err, c.syncSystemNamespace())
    err = multierror.Append(err, c.syncNodes())
    err = multierror.Append(err, c.syncServices())
    return err
}

func (c *Controller) syncDiscoveryNamespaces() error {
    var err error
    if c.nsLister != nil {
        err = c.opts.DiscoveryNamespacesFilter.SyncNamespaces()
    }
    return err
}

Canary and Gradual Release Strategy

Deploy a separate canary control‑plane instance within the same cluster.

Adjust sidecar injection to route a subset of traffic to the canary control plane.

Automate validation against predefined test scenarios.

Gradually scale the canary control plane, monitor metrics, and shift traffic.

If issues arise, roll back by scaling down the canary and reverting traffic to the stable control plane.

Future Outlook

Service Mesh will continue to play a pivotal role in traffic management, offering strong extensibility and a unified model for heterogeneous, multi‑language systems. Ctrip plans to further invest in observability, improve xDS push performance, support larger‑scale deployments, and deepen collaboration with the open‑source community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native kubernetes Multi-Cluster Istio Service Mesh xDS

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.