Cloud Native 16 min read

High‑Availability Architecture and Performance Optimizations for Service Mesh at Ctrip

This article describes Ctrip's cloud‑native Service Mesh deployment, detailing its multi‑IDC high‑availability design, fault‑scenario analysis, xDS push metrics, event‑handling optimizations, cold‑start improvements, and progressive canary release strategies to ensure reliable, scalable service traffic management.

Ctrip Technology
Ctrip Technology
Ctrip Technology
High‑Availability Architecture and Performance Optimizations for Service Mesh at Ctrip

Background – Over the past few years Ctrip has been rolling out Kubernetes and Service Mesh at large scale, initially adopting Istio Gateway in 2019 and expanding to hundreds of production applications by 2020. Ensuring high availability is critical because any major outage would erode confidence in the new technology.

High‑Availability Architecture

2.1 IDC‑level High Availability – Ctrip uses a dual‑city active‑active deployment where each application group spans two data centers (DR groups). Core services run across multiple IDC sites, databases are replicated in master‑slave mode, and traffic routing supports cross‑IDC control.

2.1.2 In‑IDC High Availability – Multi‑Cluster Deployment – Applications run on many Kubernetes clusters; instances of the same group are distributed across clusters. Each cluster registers its instances with a central service registry, so the failure of a single Kubernetes control plane only affects that cluster.

Service Mesh High‑Availability Design

Fault scenarios identified include data‑plane failures, control‑plane failures, and underlying Kubernetes control‑plane failures. Goals are to isolate failures at the data‑center level, support multi‑cluster deployments, and keep traffic flowing via Envoy fail‑over.

The solution isolates each data center with its own control plane and uses a gateway‑based entry point. The primary control plane runs in a dedicated Kubernetes cluster, while remote clusters share the same mesh control plane, allowing flexible multi‑cluster topologies.

Availability Analysis – When a data‑center fails, traffic is switched to another data‑center. Data‑plane failures trigger Envoy fail‑over to other gateways, minimizing impact. If the primary control plane fails, the data‑plane can still serve traffic via other control planes, and configuration updates are handled gracefully.

Fault Drills – Simulated outages verify that traffic fail‑over behaves as expected and that no routing loops occur.

Improving Service Mesh Reliability

To enhance observability, Ctrip defined xDS push latency metrics by measuring the time from event receipt on the control plane to acknowledgment from the data plane.

Reliability of configuration pushes is ensured by using WorkloadEntry finalizers and a ledger that tracks generation numbers and nonces, guaranteeing that instance deletions are fully propagated before pods are terminated.

Event‑Handling Performance Optimizations

Processing 5,000 ServiceEntry and 10,000 WorkloadEntry objects caused minute‑level delays due to a full‑scan algorithm. The optimized approach limits scans to the affected namespace, reducing the loop from 50 million checks to a few hundred, achieving millisecond‑level processing.

Code snippet illustrating the original full‑scan implementation:

func (s *ServiceEntryStore) maybeRefreshIndexes() {
    ...
    wles, err := s.store.List(gvk.WorkloadEntry, model.NamespaceAll) // 10000
    for _, wcfg := range wles {
        ...
        entries := seWithSelectorByNamespace[wcfg.Namespace]
        for _, se := range entries { // 5000
            workloadLabels := labels.Collection{wle.Labels}
            if !workloadLabels.IsSupersetOf(se.entry.WorkloadSelector.Labels) { // 5000 * 10000
                continue
            }
        }
    }
}

Further improvements replace the global lock with segmented Sync.Map locks and increase MaxConcurrentReconciles to enable concurrent event processing.

Cold‑start enhancements delay the control‑plane ready state until internal queues are drained, and a DiscoveryNamespacesFilter skips irrelevant namespaces, reducing start‑up latency.

Code snippet showing the added cold‑start sync method:

func (c *Controller) SyncAll() error {
    c.beginSync.Store(true)
    var err *multierror.Error
    err = multierror.Append(err, c.syncDiscoveryNamespaces())
    err = multierror.Append(err, c.syncSystemNamespace())
    err = multierror.Append(err, c.syncNodes())
    err = multierror.Append(err, c.syncServices())
    return err
}

func (c *Controller) syncDiscoveryNamespaces() error {
    var err error
    if c.nsLister != nil {
        err = c.opts.DiscoveryNamespacesFilter.SyncNamespaces()
    }
    return err
}

Canary Release Strategy – Deploy a separate canary control plane, gradually shift sidecars, validate automatically, then expand traffic while retaining the ability to roll back instantly by scaling down the new instances.

Future Outlook – Ctrip will continue investing in observability, xDS performance, large‑scale deployments, and community collaboration to further improve Service Mesh reliability and scalability.

Cloud NativePerformance OptimizationHigh Availabilitykubernetesistioservice meshcanary release
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.