How We Slashed Istio xDS Latency from Minutes to Seconds at Scale
This article details the challenges Ctrip faced with Istio control‑plane performance at massive scale and explains the systematic methodology, concrete optimizations—including O(n²) to O(n) patching, Merkle‑tree based success‑rate metrics, and startup improvements—that reduced xDS push latency from minutes to seconds while enhancing reliability.
Background
Ctrip had a mature micro‑service governance system based on an SDK model, but global expansion and hybrid‑multi‑cloud scenarios required a standardized, decoupled, and portable infrastructure. Cloud‑native Service Mesh was chosen as the solution, and the Cloud Container & Service team adopted Istio.
During large‑scale rollout, the Istio control plane exhibited severe performance problems: long xDS push latency, opaque results, incomplete monitoring, memory leaks, and other issues that blocked Service Mesh adoption.
Methodology and Goals
2.1 Mapping Core Paths and Scenarios
The control plane must promptly detect changes to Kubernetes resources and propagate them to the relevant Envoy proxies. Key scenarios include:
Whether xDS push latency satisfies requirements as change volume grows.
Success of configuration delivery to each Envoy connected to different control‑plane nodes.
Startup time of the control plane during releases or failure‑restarts.
2.2 Defining SLOs
With 30 k ServiceEntry and 30 k WorkloadEntry objects, achieve xDS P95 < 3 s and P99 < 5 s.
Make Istio configuration delivery success rate measurable.
Reduce service startup time to ≤ 5 min.
2.3 Test and Measurement Framework
Automated test tools were built to concurrently create, update, and delete resources such as WorkloadEntry and ServiceEntry. Monitoring records request volume, end‑to‑end push latency, error rates, and provides query APIs to verify configuration delivery.
Optimization Solutions and Core Implementations
3.1 Reducing xDS Push Latency
Problem analysis: CDS and RDS push latency exploded when >30 k EnvoyFilters and >5 k VirtualServices were present; P99 approached minutes. The root cause was an O(n²) algorithm used to patch EnvoyFilters.
Algorithm improvement: Replaced nested loops with hash‑map buckets keyed by service, subset, and port, turning the patch operation into O(n). This reduces complexity for the common one‑to‑one EnvoyFilter‑cluster relationship.
Other xDS components (RDS, LDS, EDS) can adopt the same bucket‑based approach. Additional optimizations include on‑demand push and gateway data splitting to reduce the number of VirtualServices.
3.2 Merkle‑Tree Based Success‑Rate Measurement
Each push event is linked to a Merkle‑tree‑derived ledgerVersion (root hash). An API exposing each Envoy’s acked ledgerVersion allows computation of the success rate as the ratio of Envoys that have the latest version.
This mechanism also enables safe pod deletion by using finalizers until all Envoys acknowledge the change.
3.3 Optimizing Query Latency and Memory Usage
The original Merkle‑tree stored every node’s hash, causing memory growth and occasional OOM. By persisting only nodes at heights divisible by four and using a TTL cache to expire stale hashes, query latency dropped from >10 s to <20 ms and memory usage stabilized.
3.4 Startup Time Reduction
Redesigning in‑memory data structures and optimizing map lookups significantly shortened control‑plane startup time, alleviating bottlenecks during releases and restarts.
Results and Outlook
4.1 Performance Improvements
CDS push latency (P99): reduced from >40 s to <5 s for 30 k ServiceEntry / 30 k WorkloadEntry.
RDS push latency (P99): reduced from >1.5 min to <5 s for the same scale.
API query latency: reduced from >10 s to <20 ms.
4.2 Future Work
Extend the Merkle‑tree measurement system with additional metadata (e.g., timestamps) to compute end‑to‑end latency.
Increase queue concurrency to further boost throughput.
Collaborate with the open‑source community to contribute the improvements upstream.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
