Cloud Native 14 min read

How We Slashed Istio xDS Latency from Minutes to Seconds at Scale

This article details the challenges Ctrip faced with Istio control‑plane performance at massive scale and explains the systematic methodology, concrete optimizations—including O(n²) to O(n) patching, Merkle‑tree based success‑rate metrics, and startup improvements—that reduced xDS push latency from minutes to seconds while enhancing reliability.

ITPUB
ITPUB
ITPUB
How We Slashed Istio xDS Latency from Minutes to Seconds at Scale

Background

Ctrip had a mature micro‑service governance system based on an SDK model, but global expansion and hybrid‑multi‑cloud scenarios required a standardized, decoupled, and portable infrastructure. Cloud‑native Service Mesh was chosen as the solution, and the Cloud Container & Service team adopted Istio.

During large‑scale rollout, the Istio control plane exhibited severe performance problems: long xDS push latency, opaque results, incomplete monitoring, memory leaks, and other issues that blocked Service Mesh adoption.

Methodology and Goals

2.1 Mapping Core Paths and Scenarios

The control plane must promptly detect changes to Kubernetes resources and propagate them to the relevant Envoy proxies. Key scenarios include:

Whether xDS push latency satisfies requirements as change volume grows.

Success of configuration delivery to each Envoy connected to different control‑plane nodes.

Startup time of the control plane during releases or failure‑restarts.

2.2 Defining SLOs

With 30 k ServiceEntry and 30 k WorkloadEntry objects, achieve xDS P95 < 3 s and P99 < 5 s.

Make Istio configuration delivery success rate measurable.

Reduce service startup time to ≤ 5 min.

2.3 Test and Measurement Framework

Automated test tools were built to concurrently create, update, and delete resources such as WorkloadEntry and ServiceEntry. Monitoring records request volume, end‑to‑end push latency, error rates, and provides query APIs to verify configuration delivery.

Optimization Solutions and Core Implementations

3.1 Reducing xDS Push Latency

Problem analysis: CDS and RDS push latency exploded when >30 k EnvoyFilters and >5 k VirtualServices were present; P99 approached minutes. The root cause was an O(n²) algorithm used to patch EnvoyFilters.

Algorithm improvement: Replaced nested loops with hash‑map buckets keyed by service, subset, and port, turning the patch operation into O(n). This reduces complexity for the common one‑to‑one EnvoyFilter‑cluster relationship.

Other xDS components (RDS, LDS, EDS) can adopt the same bucket‑based approach. Additional optimizations include on‑demand push and gateway data splitting to reduce the number of VirtualServices.

3.2 Merkle‑Tree Based Success‑Rate Measurement

Each push event is linked to a Merkle‑tree‑derived ledgerVersion (root hash). An API exposing each Envoy’s acked ledgerVersion allows computation of the success rate as the ratio of Envoys that have the latest version.

This mechanism also enables safe pod deletion by using finalizers until all Envoys acknowledge the change.

3.3 Optimizing Query Latency and Memory Usage

The original Merkle‑tree stored every node’s hash, causing memory growth and occasional OOM. By persisting only nodes at heights divisible by four and using a TTL cache to expire stale hashes, query latency dropped from >10 s to <20 ms and memory usage stabilized.

3.4 Startup Time Reduction

Redesigning in‑memory data structures and optimizing map lookups significantly shortened control‑plane startup time, alleviating bottlenecks during releases and restarts.

Results and Outlook

4.1 Performance Improvements

CDS push latency (P99): reduced from >40 s to <5 s for 30 k ServiceEntry / 30 k WorkloadEntry.

RDS push latency (P99): reduced from >1.5 min to <5 s for the same scale.

API query latency: reduced from >10 s to <20 ms.

4.2 Future Work

Extend the Merkle‑tree measurement system with additional metadata (e.g., timestamps) to compute end‑to‑end latency.

Increase queue concurrency to further boost throughput.

Collaborate with the open‑source community to contribute the improvements upstream.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativePerformance OptimizationKubernetesIstioService MeshxDSMerkl e Tree
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.