How Exposing Hypervisor CPUID Doubles VM IOPS – Inside CPU Idle Strategies
An extensive performance investigation reveals that exposing the Hypervisor CPUID in virtual machines triggers a CPU idle policy shift from HLT to BusyPoll, halving VMEXIT latency and doubling sequential read/write IOPS, with detailed kernel analysis, perf data, code modifications, and practical optimization recommendations.
Background
When testing the PoleFS shared file system on virtual machines, a counter‑intuitive phenomenon was observed: two identical VMs showed a two‑fold difference in sequential read/write IOPS solely based on whether the Hypervisor CPUID was exposed. All other factors—VM I/O path, network, storage backend—remained unchanged, suggesting the bottleneck resides in the VM's default behavior.
Research Findings
Full call‑chain tracing and KVM‑side verification identified the root cause as the CPU idle strategy, not the storage subsystem. The critical path is:
Hypervisor CPUID → haltpoll governor enabled → Busy‑Poll replaces HLT → VMEXIT dramatically reduced → IPI wake‑up latency drops → IOPS increaseThis chain demonstrates that CPU scheduling policies in virtualized environments can impact performance far more than the I/O path itself.
Problem Localization
Hotspot Analysis
Perf data for the two scenarios showed that without Hypervisor CPUID exposure the raw spin_unlock_irqrestore function dominates the hotspot, while with exposure its contribution drops sharply, indicating that the lock itself is unchanged but the CPU’s waiting behavior differs.
Kernel Chain
The investigation traced the issue to kvm_para_available(), which checks whether the CPU exposes Hypervisor features. If true, Linux enables the haltpoll cpuidle driver and governor.
static void __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, int wake_flags, void *key) {
unsigned long flags;
wait_queue_entry_t bookmark;
bookmark.flags = 0;
bookmark.private = NULL;
bookmark.func = NULL;
INIT_LIST_HEAD(&bookmark.entry);
do {
spin_lock_irqsave(&wq_head->lock, flags);
nr_exclusive = __wake_up_common(wq_head, mode, nr_exclusive,
wake_flags, key, &bookmark);
spin_unlock_irqrestore(&wq_head->lock, flags);
} while (bookmark.flags & WQ_FLAG_BOOKMARK);
}
#define raw_spin_unlock_irqrestore(lock, flags) \
do { \
typecheck(unsigned long, flags); \
_raw_spin_unlock_irqrestore(lock, flags); \
} while (0)The core references of kvm_para_available are located in drivers/cpuidle/cpuidle‑haltpoll.c:113 and drivers/cpuidle/governors/haltpoll.c:143:
bool kvm_para_available(void) {
return kvm_cpuid_base() != 0;
}
static inline uint32_t kvm_cpuid_base(void) {
static int kvm_cpuid_base = -1;
if (kvm_cpuid_base == -1)
kvm_cpuid_base = __kvm_cpuid_base();
return kvm_cpuid_base;
}
static int __init haltpoll_init(void) {
struct cpuidle_driver *drv = &haltpoll_driver;
if (boot_option_idle_override != IDLE_NO_OVERRIDE)
return -ENODEV;
cpuidle_poll_state_init(drv);
if (!kvm_para_available() || !haltpoll_want())
return -ENODEV;
/* driver registration omitted */
return 0;
}
static struct cpuidle_governor haltpoll_governor = {
.name = "haltpoll",
.rating = 9,
.enable = haltpoll_enable_device,
.select = haltpoll_select,
.reflect = haltpoll_reflect,
};
static int __init init_haltpoll(void) {
if (kvm_para_available())
return cpuidle_register_governor(&haltpoll_governor);
return 0;
}HLT vs. BusyPoll
HLT mode: The VCPU executes HLT, causing a VMEXIT, returning control to the host. When the lock is released, the host injects an interrupt, triggering a VMENTRY. This chain involves multiple privilege switches, adding noticeable latency.
BusyPoll mode: With haltpoll enabled, the VCPU stays runnable and actively polls the IPI pending bit, avoiding frequent VMEXIT/VMENTRY. To prevent endless spinning, KVM introduces PLE (Pause Loop Exiting); only after a threshold does a VMEXIT occur. This design trades controlled CPU consumption for deterministic low latency.
Experimental Validation
To eliminate randomness, an aggressive experiment forced haltpoll activation on a VM without Hypervisor CPUID exposure by modifying the two kernel locations:
// cpuidle‑haltpoll.c:113
- if (!kvm_para_available() || !haltpoll_want())
+ if (kvm_para_available() || !haltpoll_want())
// haltpoll.c:143
- if (kvm_para_available())
+ if (!kvm_para_available())After recompiling the kernel, fio tests on PoleFS showed IOPS comparable to the Hypervisor‑exposed case, and perf data confirmed the shift in hotspot distribution.
Related Logic
The two files cpuidle‑haltpoll.c and
haltpoll.c decide whether the CPU idle path uses HLT or BusyPoll. In HLT mode, a VCPU VMEXIT leads to an IPI‑triggered wake‑up, incurring additional latency. In BusyPoll mode, the VCPU remains in a spin state; KVM’s PLE limits the spin duration before a VMEXIT, resulting in much lower latency.Perf data without Hypervisor CPUID shows HLT accounting for 22.59% of VMEXIT events, while with CPUID exposure HLT drops to 14.67% and PAUSE_INSTRUCTION (BusyPoll) rises, confirming the latency reduction mechanism.
Optimization Recommendations
Enable Hypervisor CPUID exposure by default for high‑IOPS VMs, databases, and low‑latency services.
Establish a VMEXIT monitoring framework on the host to analyze and optimize hypervisor‑level exit reasons.
These findings highlight that performance bottlenecks increasingly reside at the guest‑hypervisor scheduling boundary, and future optimizations must consider invisible layers such as CPU idle policies and VMEXIT handling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
