Cloud Computing 14 min read

How Cloud Providers Achieve Near‑Zero Downtime: Hot Upgrades, Live Migration, and Auto‑Backup

This article summarizes a 2016 Global Operations conference talk that explains the challenges of cloud high‑availability, the classification of failures, and the practical solutions implemented by Kingsoft Cloud—including hot kernel upgrades, online migration, incremental disk formats, and automated backup—to keep service interruptions under a few hundred milliseconds.

Efficient Ops
Efficient Ops
Efficient Ops
How Cloud Providers Achieve Near‑Zero Downtime: Hot Upgrades, Live Migration, and Auto‑Backup

Opening Remarks

The speaker introduces himself and humorously notes the obscurity of his surname.

What Does Cloud Operations Do?

Operations is a demanding field, especially when it is about doing operations for cloud computing itself. The goal is to improve service high‑availability.

Overview

The talk covers three main topics:

Challenges of cloud high‑availability

Requirements and goals for high‑availability

How Kingsoft Cloud addresses these challenges

1. Challenges of Cloud High‑Availability

Rapid Growth

Cloud services scale quickly, with new machines added weekly, leading to hardware and software failure rates that can cause daily incidents.

Equipment heterogeneity and hardware lifecycle (typically three‑year cycles) create additional reliability concerns.

Failures are analyzed along two dimensions: horizontal (hardware vs. software) and vertical (planned vs. unplanned), resulting in four intersecting categories graded from level 0 (most critical) to level 3.

2. Requirements and Goals for High‑Availability

High‑availability is measured by SLA, but users care about actual downtime. The target is to keep monthly unavailable time under 20 minutes, either as many short incidents or a single longer one, while striving to reduce both frequency and duration of failures, aiming for 99.95% availability.

3. How We Respond

Planned (Zero Impact)

We perform hot kernel upgrades and online migration without rebooting physical machines.

Online migration moves VMs to healthy hosts before a failure occurs.

Unplanned (Continuous Reduction)

We aim to convert some unplanned failures into planned ones and shorten the outage window. For shared storage we use auto‑failover; for local storage we employ auto‑backup to mitigate data loss.

Hot Kernel Upgrade

Community solutions like ksplice and kpatch exist, but high‑frequency functions (CPU scheduling, KVM interrupts) make patching difficult.

We reduced the impact of these high‑frequency calls through custom optimizations, achieving zero kernel‑level failures in production.

Hypervisor Hot Upgrade

We combine hot upgrade with online migration, targeting a downtime of 300 ms (far below the typical 3 s promised by vendors).

Online Migration

Two scenarios: shared storage migration and local storage migration.

For local disks we avoid full‑copy by using an incremental disk format that records only changed blocks, reducing migration time by more than tenfold.

Incremental Disk & Auto Backup

Our custom incremental disk tracks changed data, allowing rapid backup and restore with less than 3 % impact on running services.

Q&A

Physical Machine Fault Detection

We build on open‑source monitoring tools, heavily customizing them to achieve platform‑wide high‑availability and handle massive real‑time data while reducing false positives.

Storage Choices

We use a self‑developed distributed storage system called KDFS instead of Ceph, because Ceph’s complexity and performance did not meet our SSD‑driven requirements.

Network Virtualization

Our solutions include EIP, VPC, and hybrid cloud networking, providing end‑to‑end connectivity for customers.

cloud computingHigh Availabilityhot upgradelive migrationauto backupincremental disk
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.