Achieving Zero‑Downtime VM Live Migration in 360 VPC Overlay
This article explains the network interruption issues of VM live migration in 360's VPC overlay (V1), analyzes the root causes, and presents a V2 redesign that eliminates downtime through pre‑loaded forwarding policies, traffic redirection, and streamlined component collaboration.
Optimization Background
VM live migration involves compute, storage, and network aspects, with the network layer focusing on whether business traffic is interrupted. In 360's VPC (V1) overlay, live migration caused 15‑30 seconds of network outage, leading to severe resource fragmentation and making live migration unsuitable for routine operations.
Problem Causes
Nova's live migration changes the VM's PORT HOST, affecting whether Neutron L2 Agent brings the PORT online.
Neutron L2 Agent installs flow tables as a post‑operation; in overlay mode the many flow‑table policies cause delays.
V1 relies on Neutron L3 Agent for north‑south traffic; asynchronous multi‑agent processes cannot synchronize L2/L3 policies, disrupting north‑south flow.
Gateway forwarding nodes sync policies every 10 seconds; during this period traffic may still be directed to the source node, causing loss after migration.
Neutron broadcasts full‑mesh FDB asynchronously, so east‑west traffic may be blocked while FDB updates.
Solution Comparison
The following compares the V1 (pre‑optimization) and V2 (post‑optimization) approaches.
V1 architecture diagram:
In V1, OpenStack Neutron’s native L3 Agent routes north‑south and east‑west traffic, with custom gateway services handling floating IP, SNAT, and CCN. The hot‑migration flow suffers multiple interruption points:
After Libvirt completes migration, Nova Compute updates the VM's PORT host, but the VPC control plane’s subsequent updates cause traffic to remain directed to the source node, resulting in outage.
Port‑up events on the destination node trigger delayed L2/L3 policy loading, extending downtime.
L3 Agent router policy changes pause three‑layer traffic.
Asynchronous FDB broadcasts may leave some nodes without updated east‑west flow entries.
If the migrated PORT is the first of its VPC on the destination, FDB anomalies block north‑south traffic.
Gateway agents apply Etcd watch updates on a 10‑second cycle, during which traffic still points to the old node.
These factors cause prolonged network disconnection during migration.
V2 Optimization
V2 removes the Neutron L3 Agent and lets the vSwitch handle all three‑layer routing. Neutron Agent computes and distributes L2/L3 policies, simplifying the control plane. Three key enhancements are applied to the live‑migration process:
Pre‑load port‑independent policies: During Nova’s pre‑live‑migration phase, the target node’s VPC Agent pre‑loads forwarding policies via an API call, ensuring traffic can be kept on‑path or redirected before migration starts.
Source‑node traffic redirection: While the VM runs on the source node, traffic is mirrored to the destination node; after migration, the destination node intercepts QEMU RARP and, after a short delay, removes the redirection.
Trigger‑based port‑specific policy loading: Once the destination creates the virtual NIC, the Agent loads port‑related forwarding rules, completing the network path.
Benefits
Dynamic resource consolidation raises cluster utilization to 70‑90 %.
Physical node cost reduction by consolidating workloads and powering down idle servers.
Improved daily operations, enabling OS, QEMU, OVS upgrades via live migration and faster fault recovery.
Future Outlook
Zero‑downtime live migration exemplifies flexible traffic orchestration and scheduling in virtual networks, supporting use cases such as traffic mirroring, inter‑VPC connectivity, and cross‑IDC communication. Future work will enhance traffic orchestration capabilities to meet diverse cloud‑native networking demands.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.