How to Upgrade Ztunnel in ASM Ambient Mode Without Traffic Disruption
This article explains the Ztunnel upgrade process in Alibaba Service Mesh Ambient mode, details the rolling and graceful‑shutdown mechanisms, presents performance test results with and without graceful shutdown, and offers best‑practice recommendations to minimize traffic interruption during upgrades.
Background
ASM 1.25 officially supports Ambient mode, which offers higher data‑plane forwarding performance and lower resource consumption while retaining most advanced mesh features. Ztunnel is a node‑level L4 proxy that handles all inbound and outbound pod traffic, performs mTLS encryption/decryption and identity verification, and forwards traffic to Waypoint when present.
Principle Explanation
Ztunnel’s primary requirement is to minimize traffic interruption. Because it operates at layer 4, TCP’s stateful nature prevents simple state transfer to another process, and Ztunnel cannot issue layer 7 notifications to applications. Within these constraints Ztunnel can:
Guarantee that any new connection succeeds at any time, never dropping a fresh connection.
Allow the old Ztunnel a grace period to continue processing existing connections.
Ztunnel Rolling Process
Start a new Ztunnel instance.
ASM CNI reports the status of all pods on the node to the new Ztunnel, which creates listeners for each pod and marks itself as “ready”.
Both old and new Ztunnel run concurrently; new connections are load‑balanced between them.
Kubernetes sends a SIGTERM to the old Ztunnel, which begins draining.
The old Ztunnel closes its listeners, ensuring at least one Ztunnel remains active.
The old Ztunnel stops accepting new connections but continues handling existing ones.
After the drain timeout, the old Ztunnel force‑terminates any remaining connections.
The offline process then completes.
Graceful Shutdown
Since Ztunnel works at L4, it cannot achieve the sophisticated graceful shutdown of HTTP/gRPC, but configuring an appropriate shutdown timeout can still significantly reduce the impact of restarts. Long‑lived TCP connections will inevitably be interrupted, yet most protocols allow configuring a maximum connection age to avoid excessive reliance on uninterrupted TCP streams.
Simulation Test
The test scenario simulates a client accessing the ASM gateway, which forwards traffic to an httpbin service running in Ambient mode; all traffic to httpbin passes through Ztunnel. The client runs inside the cluster without Ambient enabled and uses a ClusterIP to reach the gateway, minimizing external network factors.
1. Without graceful shutdown
fortio load -qps 0 -c 1000 -t 60s -timeout 10s istio-ingressgateway.istio-system/status/418 Code 418 : 289965 (100.0 %) Code 503 : 5 (0.0 %) All done 289970 calls (plus 1000 warmup) 208.285 ms avg, 4700.8 qps fortio load -qps 0 -c 100 -t 60s -timeout 10s istio-ingressgateway.istio-system/status/418 Code 418 : 297514 (100.0 %) Code 503 : 4 (0.0 %) All done 297518 calls (plus 100 warmup) 20.180 ms avg, 4946.8 qps2. With graceful shutdown set to 120 s
To ensure the test duration covers the Ztunnel startup window, the interval was extended.
fortio load -qps 0 -c 1000 -t 200s -timeout 10s istio-ingressgateway.istio-system/status/418 Code 418 : 961124 (100.0 %) Code 503 : 2 (0.0 %) All done 961126 calls (plus 1000 warmup) 208.484 ms avg, 4766.9 qps fortio load -qps 0 -c 100 -t 200s -timeout 10s istio-ingressgateway.istio-system/status/418 Code 418 : 960527 (100.0 %) Code 503 : 1 (0.0 %) All done 960528 calls (plus 100 warmup) 20.825 ms avg, 4799.2 qpsResult Analysis
Because the number of 503 responses is very low (single digits), each test was run multiple times. Before enabling graceful shutdown, failed requests ranged from 4 to 6. After enabling graceful shutdown, failed requests dropped to 1‑2.
Even though the test is simple, configuring a graceful shutdown noticeably reduces 503 errors. The observations are:
Long‑lived TCP connections will inevitably be interrupted, but Ztunnel’s design ensures new connections can always be established, resulting in only a few affected requests under high concurrency.
For connections with limited duration, setting a reasonable graceful‑shutdown timeout can dramatically lower traffic disruption; a 120 s timeout reduced failures significantly in our tests.
Best Practices
Proactively configure parameters such as maxConnectionAge or maxConnectionDuration in your services; even without a mesh, longer TCP connections increase interruption risk, and these settings work well with Ztunnel for lossless upgrades.
After enabling ASM Ambient mode, Ztunnel provides built‑in observability. Analyze Ztunnel logs to obtain the duration of each TCP connection and set the graceful‑shutdown timeout accordingly. The timeout used in this article was derived from those log‑derived durations.
In the test scenario, configuring maxConnectionDuration via a DestinationRule on the gateway could theoretically eliminate traffic interruption entirely.
Conclusion
ASM 1.25’s Ambient mode is now stable and well‑integrated with Alibaba Cloud Container Service. By following the recommended upgrade and graceful‑shutdown practices, you can achieve near‑zero‑downtime upgrades for Ztunnel.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
