Cloud Computing 14 min read

How Distributed DHCP Boosts VM Creation Speed in 360 Cloud

This article explains the challenges of centralized DHCP in 360’s OpenStack‑based virtual network, analyzes performance and reliability issues, and presents a distributed DHCP redesign that moves DHCP processing to the compute node, reducing latency, improving stability, and cutting operational costs.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
How Distributed DHCP Boosts VM Creation Speed in 360 Cloud

Optimization Background

360’s virtual network is built on the OpenStack Neutron component. The native Neutron architecture includes Neutron Server, L2 Agent, L3 Agent, Metadata Agent, and a centralized DHCP Agent. In this environment, VM creation suffered from long provisioning times, slow or unstable IP allocation via DHCP, and Cloud‑Init failures, prompting a need to improve DHCP stability and overall performance.

Problem Causes

VM creation latency: Both PORT activation and DHCP tasks are asynchronous; any delay in these tasks extends the time before Nova receives the PORT‑up event, with 90% of latency anomalies traced to the DHCP task.

DHCP acquisition issues: DHCP relies on broadcast packets that must traverse the compute node and the DHCP Agent node, a multi‑step asynchronous path that is prone to instability.

Cloud‑Init failures: Metadata service access depends on the DHCP path; instability in DHCP leads to hostname retrieval failures, affecting services that heavily rely on the hostname.

Solution Comparison

The following sections describe the DHCP service before (V1) and after (V2) the redesign.

V1 DHCP topology (centralized).

V1 DHCP topology
V1 DHCP topology

The V1 design reuses the OpenStack Neutron DHCP Agent with DNSMasq. The DHCP Agent receives RPC calls from Neutron Server, builds network namespaces, TAP interfaces, and configures DNSMasq for each PORT. The L2 Agent updates forwarding rules, and the overall VM creation flow involves multiple asynchronous steps across Neutron Server, DHCP Agent, and L2 Agent.

Nova Compute calls Neutron API to create a PORT; Neutron creates L2 and DHCP tasks and waits for the vif‑plug event.

Neutron Server sends a message to the DHCP Agent, which generates DHCP configuration and triggers DNSMasq; upon completion, the DHCP task is marked done.

The compute node’s L2 Agent detects the PORT activation, pulls PORT info, and updates the vSwitch.

Neutron Server, after confirming all tasks are complete, notifies Nova to continue VM boot.

DHCP packet flow in V1:

VM sends DHCP Discover broadcast; the compute node’s vSwitch forwards it to all DHCP Agents via VXLAN tunnels.

DHCP Agent receives the broadcast, DNSMasq matches the MAC to a configuration, and sends a DHCP Offer back to the VM’s node.

VM replies with DHCP Request; DNSMasq sends an ACK.

VM configures IP, default route, and hostname from the ACK.

Risk points in V1 include DHCP Agent overload, reliance on VXLAN tunnel mesh, and asynchronous FDB updates, all of which can cause timeouts, packet loss, and Cloud‑Init failures.

V2 Distributed Design

The V2 redesign removes the Neutron DHCP Agent. DHCP traffic is handled locally on each compute node by an embedded DHCP module within the Neutron L2 Agent, eliminating the need for cross‑node broadcast and reducing the service chain.

V2 DHCP topology
V2 DHCP topology

Key changes:

The Neutron L2 Agent loads DHCP forwarding rules on the local vSwitch and forwards DHCP packets to its built‑in DHCP module.

The DHCP module processes the full DHCP handshake (Discover‑Offer‑Request‑ACK) using cached network data (MTU, CIDR, routes, etc.).

Neutron Server no longer manages DHCP tasks.

Metadata traffic for Cloud‑Init no longer traverses the DHCP path, reducing dependency on DHCP stability.

Benefits of the V2 architecture:

Simplified service components: DHCP is now a local module within the L2 Agent, reducing pressure on centralized agents and cutting intra‑cluster broadcast traffic.

Streamlined and stable link: DHCP packets stay within the compute node, removing VXLAN tunnel and FDB synchronization risks.

Aggregated network resource operations: Port‑related tasks are consolidated, avoiding scattered asynchronous management.

Related Gains

VM DHCP stability reaches near‑100% across all clusters, ensuring reliable IP allocation during creation, boot, and cold migration.

Reduced DHCP fault domain: Failures now affect only VMs on the impacted compute node, unlike V1 where a DHCP Agent outage could block IP allocation for many VMs.

Shortened VM creation time: The previous 5‑minute vif‑plug timeout no longer occurs, as task aggregation eliminates long‑running asynchronous waits.

Lower maintenance cost: Development no longer needs to maintain separate DHCP Agent code; operations focus shifts away from DHCP‑specific CI/CD and incident handling.

Future Outlook

As cloud infrastructure evolves, virtual networks must shift from traditional L2/L3 models to traffic‑centric designs. DHCP, DNS, and other services will continue to be cloud‑native, with traffic pools and operator‑level orchestration replacing static layer‑2/3 routing, enabling rapid feature expansion for both VMs and containers.

performance optimizationdistributed architectureCloud NetworkingOpenStackDHCP
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.