Cloud Native 19 min read

vivo Kubernetes Cluster Zero-Downtime Upgrade from v1.10 to v1.17: Practices and Solutions

Vivo’s internet team performed a zero‑downtime, in‑place upgrade of a 1,000‑node Kubernetes cluster from v1.10 to v1.17 by analyzing changelogs, backporting fixes, adjusting kubelet hash validation, adding tolerations, ensuring node labels, and using staged binary rollout, completing the process in roughly ten minutes.

vivo Internet Technology

Dec 16, 2021

vivo Kubernetes Cluster Zero-Downtime Upgrade from v1.10 to v1.17: Practices and Solutions

This article provides a comprehensive guide to upgrading a Kubernetes cluster from v1.10 to v1.17 in a production environment without impacting running business applications.

Background: With Kubernetes releasing new versions every 3 months, enterprises face the challenge of balancing frequent updates with production stability. Vivo's internet team needed to upgrade their large-scale Kubernetes clusters from v1.10 to address performance bottlenecks, support CNCF projects like OpenKruise, improve resource utilization, and reduce version fragmentation.

Key Challenges and Solutions:

1. Upgrade Method: Two approaches exist - replacement upgrade and in-place upgrade. Vivo chose in-place upgrade for binary-deployed clusters due to shorter downtime and simpler operations.

2. Cross-Version Upgrade: Despite community recommendations for incremental upgrades, Vivo successfully performed direct v1.10 to v1.17 upgrade after thorough ChangeLog analysis and API compatibility testing.

3. Avoiding Container Restarts: A critical challenge was that kubelet's container hash calculation changed between v1.10 and v1.17, causing container restarts. The solution involved creating a local configuration file tracking old cluster version and startup time, allowing kubelet to skip hash validation for pods created by the old version.

4. Pod Unexpected Eviction: The TaintBasedEvictions feature introduced in v1.13 and enabled by default in v1.16 could cause unexpected pod evictions. Solution: Add tolerations to existing pods by applying labels to trigger the DefaultTolerationSeconds admission controller.

5. MatchNodeSelector Issue: Pods entered Failed state with MatchNodeSelector reason when kubelet restarted. Solution: Ensure nodes have required labels before upgrade.

6. kube-apiserver Connection Issues: HTTP/2.0 connection bugs in golang net/http2 package caused kubelet connection failures. Solution: Backport the fix to v1.17 branch and use golang 1.15.15.

7. TCP Connection Count Increase: A code change caused kubelet to create multiple connections instead of reusing them. Solution: Partially reverted the problematic code change.

Upgrade Procedure:

Backup cluster (binaries, config files, ETCD)

Gray upgrade some nodes to verify correctness

Pre-distribute upgrade binaries

Stop controllers, scheduler, and alerts

Update control plane service configs and upgrade components

Update compute node service configs and upgrade components

Add labels to nodes to trigger pod toleration additions

Re-enable controllers, scheduler, and alerts

Verify cluster health

The upgrade of a 1,000-node cluster took approximately 10 minutes using batched operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native Kubernetes cluster upgrade Zero Downtime kubelet Container Orchestration K8s migration

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.