How Cloud Providers Achieve Near‑Zero Downtime: Hot Upgrades, Live Migration, and Auto‑Backup
This article summarizes a 2016 Global Operations conference talk that explains the challenges of cloud high‑availability, the classification of failures, and the practical solutions implemented by Kingsoft Cloud—including hot kernel upgrades, online migration, incremental disk formats, and automated backup—to keep service interruptions under a few hundred milliseconds.
Opening Remarks
The speaker introduces himself and humorously notes the obscurity of his surname.
What Does Cloud Operations Do?
Operations is a demanding field, especially when it is about doing operations for cloud computing itself. The goal is to improve service high‑availability.
Overview
The talk covers three main topics:
Challenges of cloud high‑availability
Requirements and goals for high‑availability
How Kingsoft Cloud addresses these challenges
1. Challenges of Cloud High‑Availability
Rapid Growth
Cloud services scale quickly, with new machines added weekly, leading to hardware and software failure rates that can cause daily incidents.
Equipment heterogeneity and hardware lifecycle (typically three‑year cycles) create additional reliability concerns.
Failures are analyzed along two dimensions: horizontal (hardware vs. software) and vertical (planned vs. unplanned), resulting in four intersecting categories graded from level 0 (most critical) to level 3.
2. Requirements and Goals for High‑Availability
High‑availability is measured by SLA, but users care about actual downtime. The target is to keep monthly unavailable time under 20 minutes, either as many short incidents or a single longer one, while striving to reduce both frequency and duration of failures, aiming for 99.95% availability.
3. How We Respond
Planned (Zero Impact)
We perform hot kernel upgrades and online migration without rebooting physical machines.
Online migration moves VMs to healthy hosts before a failure occurs.
Unplanned (Continuous Reduction)
We aim to convert some unplanned failures into planned ones and shorten the outage window. For shared storage we use auto‑failover; for local storage we employ auto‑backup to mitigate data loss.
Hot Kernel Upgrade
Community solutions like ksplice and kpatch exist, but high‑frequency functions (CPU scheduling, KVM interrupts) make patching difficult.
We reduced the impact of these high‑frequency calls through custom optimizations, achieving zero kernel‑level failures in production.
Hypervisor Hot Upgrade
We combine hot upgrade with online migration, targeting a downtime of 300 ms (far below the typical 3 s promised by vendors).
Online Migration
Two scenarios: shared storage migration and local storage migration.
For local disks we avoid full‑copy by using an incremental disk format that records only changed blocks, reducing migration time by more than tenfold.
Incremental Disk & Auto Backup
Our custom incremental disk tracks changed data, allowing rapid backup and restore with less than 3 % impact on running services.
Q&A
Physical Machine Fault Detection
We build on open‑source monitoring tools, heavily customizing them to achieve platform‑wide high‑availability and handle massive real‑time data while reducing false positives.
Storage Choices
We use a self‑developed distributed storage system called KDFS instead of Ceph, because Ceph’s complexity and performance did not meet our SSD‑driven requirements.
Network Virtualization
Our solutions include EIP, VPC, and hybrid cloud networking, providing end‑to‑end connectivity for customers.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.