Mastering SR‑IOV: From Basics to Advanced KVM Integration and Migration
This guide explains SR‑IOV fundamentals, PF/VF roles, VEB and VEPA virtual switches, multi‑channel extensions, and provides step‑by‑step instructions for enabling VFs, attaching them to KVM guests, configuring NUMA affinity, bonding, and handling hot‑migration challenges.
SR‑IOV Overview
Traditional I/O virtualization relies on the VMM to capture and emulate VM I/O, becoming a performance bottleneck when many VMs share a device. Intel VT‑d hardware‑assisted I/O virtualization allows PCIe passthrough without VMM involvement, but physical devices are limited. SR‑IOV (Single‑Root I/O Virtualization) was introduced to virtualize a PCIe device into multiple virtual functions (VFs) specifically for VMs.
PF and VF
PF (Physical Function) : Manages resources and VF lifecycle, allocating MAC addresses and queues. The OS or VMM configures VFs through the PF.
VF (Virtual Function) : Lightweight virtual channel containing only I/O functions, with its own PCIe configuration space, MAC address, and queues, sharing the PF's physical resources.
SR‑IOV requires two core features:
BAR address mapping : Maps VF PCIe BAR to PF PCIe BAR for resource access.
Virtual I/O queues : Maps VF I/O requests to shared or dedicated queues in the PF.
By default, VFs are disabled; enabling them creates virtual PCIe configuration spaces accessed via registers.
SR‑IOV VEB
Virtual Ethernet Bridge (VEB) provides hardware‑implemented Layer‑2 switching. The PF manages VEB configuration, connecting PF and all VFs; forwarding is based on MAC and VLAN IDs.
Ingress : Frames matching a VF’s MAC/VLAN are delivered to that VF; otherwise they go to the PF or broadcast.
Egress : Frames whose MAC does not match any port are sent out; broadcasts are forwarded within the VLAN.
SR‑IOV VEPA
VEPA (Virtual Ethernet Port Aggregator) addresses limitations of VEB such as lack of traffic visibility and control. Two main issues arise: intra‑host traffic bypasses monitoring points, and outbound traffic lacks identifiable tags.
Solutions include forcing VM traffic through a collection point and adding identifiable tags. Two major approaches exist:
Cisco/VMware promote VN‑Tag (802.1Qbh BPE) requiring new hardware.
HP, Juniper, IBM, Qlogic, Brocade promote VEPA (802.1Qbg EVB) using existing equipment at lower cost.
VEPA forces VM traffic to the upstream TOR switch, enabling hairpin forwarding for intra‑host communication.
SR‑IOV Multi‑Channel
Based on QinQ (802.1ad) and the S‑TAG, Multi‑Channel extends VEPA to support multiple logical channels (VEB, VEPA, Director IO) on a single NIC, allowing flexible deployment according to security, performance, and manageability requirements.
Each logical channel is isolated and identified by an additional S‑TAG and VLAN‑ID.
SR‑IOV OvS
SR‑IOV lacks a native SDN control plane, requiring extra components (e.g., Neutron sriov‑agent). With SmartNICs and DPUs, SR‑IOV can be combined with OVS Fastpath, using VFs as virtual channels for programmable, high‑performance networking.
SR‑IOV Practical Usage
Enable SR‑IOV VFs
Step 1 . Ensure SR‑IOV and VT‑d are enabled in BIOS.
Step 2 . Enable I/O MMU in Linux (e.g., add intel_iommu=on to kernel parameters).
...<br/>linux16 /boot/vmlinuz-3.10.0-862.11.6.rt56.819.el7.x86_64 root=LABEL=img-rootfs ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet intel_iommu=on iommu=pt isolcpus=2-3,8-9 nohz=on nohz_full=2-3,8-9 rcu_nocbs=2-3,8-9 intel_pstate=disable nosoftlockup default_hugepagesz=1G hugepagesz=1G hugepages=16 LANG=en_US.UTF-8<br/>...Step 3 . Create VFs via the PCI SYS interface.
cat /etc/sysconfig/network-scripts/ifcfg-enp129s0f0<br/>DEVICE="enp129s0f0"<br/>BOOTPROTO="dhcp"<br/>ONBOOT="yes"<br/>TYPE="Ethernet"<br/><br/>cat /etc/sysconfig/network-scripts/ifcfg-enp129s0f1<br/>DEVICE="enp129s0f1"<br/>BOOTPROTO="dhcp"<br/>ONBOOT="yes"<br/>TYPE="Ethernet"<br/><br/>echo 16 > /sys/class/net/enp129s0f0/device/sriov_numvfs<br/>echo 16 > /sys/class/net/enp129s0f1/device/sriov_numvfsStep 4 . Verify VFs are created and up.
lspci | grep Ethernet<br/>03:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)<br/>...<br/>81:10.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)Step 5 . Persist VF creation on reboot.
echo "echo '7' > /sys/class/net/eth3/device/sriov_numvfs" >> /etc/rc.localAttach VF to a KVM VM
Use -device vfio-pci,host=<vf pci bus addr> in the QEMU command line.
qemu-system-x86_64 -enable-kvm -drive file=<vm img>,if=virtio -cpu host -smp 16 -m 16G \
-name <vm name> -device vfio-pci,host=<vf1> -device vfio-pci,host=<vf2> -vnc :1 -net noneAlternatively, attach via libvirt XML:
<interface type='hostdev' managed='yes'>
<source>
<address type='pci' domain='0x0000' bus='0x81' slot='0x10' function='0x2'/>
</source>
</interface>Attach live:
virsh attach-device VM1 /tmp/new-device.xml --live --configNUMA Affinity
Check NUMA node of the NIC:
cat /sys/class/net/enp129s0f0/device/numa_node<br/>1Check VM vCPU pinning:
virsh vcpupin VM1_uuidVF Network Configuration
Configure MAC, VLAN, and promiscuous mode per VF:
ip l | grep 5e:9c<br/> vf 14 MAC fa:16:3e:90:5e:9c, vlan 19, spoof checking on, link-state auto, trust on, query_rss offVLAN ID appears in the VM’s XML:
<interface type='hostdev' managed='yes'>
<mac address='fa:aa:aa:aa:aa:aa'/>
<driver name='kvm'/>
<source>
<address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x7'/>
</source>
<vlan>
<tag id='190'/>
</vlan>
</interface>VF Bonding
Bond two VFs (from separate PFs) inside a VM by configuring identical MAC addresses and using standard Linux bonding scripts.
BONDING_MASTER=yes<br/>BOOTPROTO=none<br/>DEVICE=bond0<br/>ONBOOT=yes<br/>TYPE=Bond<br/><br/>DEVICE=ens4<br/>MASTER=bond0<br/>ONBOOT=yes<br/>SLAVE=yes<br/>TYPE=Ethernet<br/><br/>DEVICE=ens5<br/>MASTER=bond0<br/>ONBOOT=yes<br/>SLAVE=yes<br/>TYPE=EthernetSR‑IOV VM Hot‑Migration Issues
Passthrough VFs limits VM migration because the address translation tables (GVA↔HPA) are lost during migration. After migration, the VM must manually bring the interface up, causing network interruption.
Workaround: add a normal (OvS) or indirect (macvtap) port, bond it with the SR‑IOV port inside the guest, migrate, then bring the SR‑IOV port up and remove the extra port.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
