vivo Kubernetes Cluster Zero-Downtime Upgrade from v1.10 to v1.17: Practices and Solutions
Vivo’s internet team performed a zero‑downtime, in‑place upgrade of a 1,000‑node Kubernetes cluster from v1.10 to v1.17 by analyzing changelogs, backporting fixes, adjusting kubelet hash validation, adding tolerations, ensuring node labels, and using staged binary rollout, completing the process in roughly ten minutes.
This article provides a comprehensive guide to upgrading a Kubernetes cluster from v1.10 to v1.17 in a production environment without impacting running business applications.
Background: With Kubernetes releasing new versions every 3 months, enterprises face the challenge of balancing frequent updates with production stability. Vivo's internet team needed to upgrade their large-scale Kubernetes clusters from v1.10 to address performance bottlenecks, support CNCF projects like OpenKruise, improve resource utilization, and reduce version fragmentation.
Key Challenges and Solutions:
1. Upgrade Method: Two approaches exist - replacement upgrade and in-place upgrade. Vivo chose in-place upgrade for binary-deployed clusters due to shorter downtime and simpler operations.
2. Cross-Version Upgrade: Despite community recommendations for incremental upgrades, Vivo successfully performed direct v1.10 to v1.17 upgrade after thorough ChangeLog analysis and API compatibility testing.
3. Avoiding Container Restarts: A critical challenge was that kubelet's container hash calculation changed between v1.10 and v1.17, causing container restarts. The solution involved creating a local configuration file tracking old cluster version and startup time, allowing kubelet to skip hash validation for pods created by the old version.
4. Pod Unexpected Eviction: The TaintBasedEvictions feature introduced in v1.13 and enabled by default in v1.16 could cause unexpected pod evictions. Solution: Add tolerations to existing pods by applying labels to trigger the DefaultTolerationSeconds admission controller.
5. MatchNodeSelector Issue: Pods entered Failed state with MatchNodeSelector reason when kubelet restarted. Solution: Ensure nodes have required labels before upgrade.
6. kube-apiserver Connection Issues: HTTP/2.0 connection bugs in golang net/http2 package caused kubelet connection failures. Solution: Backport the fix to v1.17 branch and use golang 1.15.15.
7. TCP Connection Count Increase: A code change caused kubelet to create multiple connections instead of reusing them. Solution: Partially reverted the problematic code change.
Upgrade Procedure:
Backup cluster (binaries, config files, ETCD)
Gray upgrade some nodes to verify correctness
Pre-distribute upgrade binaries
Stop controllers, scheduler, and alerts
Update control plane service configs and upgrade components
Update compute node service configs and upgrade components
Add labels to nodes to trigger pod toleration additions
Re-enable controllers, scheduler, and alerts
Verify cluster health
The upgrade of a 1,000-node cluster took approximately 10 minutes using batched operations.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.