Operations 11 min read

How Exposing Hypervisor CPUID Doubles VM IOPS – Inside CPU Idle Strategies

An extensive performance investigation reveals that exposing the Hypervisor CPUID in virtual machines triggers a CPU idle policy shift from HLT to BusyPoll, halving VMEXIT latency and doubling sequential read/write IOPS, with detailed kernel analysis, perf data, code modifications, and practical optimization recommendations.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
How Exposing Hypervisor CPUID Doubles VM IOPS – Inside CPU Idle Strategies

Background

When testing the PoleFS shared file system on virtual machines, a counter‑intuitive phenomenon was observed: two identical VMs showed a two‑fold difference in sequential read/write IOPS solely based on whether the Hypervisor CPUID was exposed. All other factors—VM I/O path, network, storage backend—remained unchanged, suggesting the bottleneck resides in the VM's default behavior.

Research Findings

Full call‑chain tracing and KVM‑side verification identified the root cause as the CPU idle strategy, not the storage subsystem. The critical path is:

Hypervisor CPUID → haltpoll governor enabled → Busy‑Poll replaces HLT → VMEXIT dramatically reduced → IPI wake‑up latency drops → IOPS increase

This chain demonstrates that CPU scheduling policies in virtualized environments can impact performance far more than the I/O path itself.

Problem Localization

Hotspot Analysis

Perf data for the two scenarios showed that without Hypervisor CPUID exposure the raw spin_unlock_irqrestore function dominates the hotspot, while with exposure its contribution drops sharply, indicating that the lock itself is unchanged but the CPU’s waiting behavior differs.

Kernel Chain

The investigation traced the issue to kvm_para_available(), which checks whether the CPU exposes Hypervisor features. If true, Linux enables the haltpoll cpuidle driver and governor.

static void __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int mode,
    int nr_exclusive, int wake_flags, void *key) {
    unsigned long flags;
    wait_queue_entry_t bookmark;
    bookmark.flags = 0;
    bookmark.private = NULL;
    bookmark.func = NULL;
    INIT_LIST_HEAD(&bookmark.entry);
    do {
        spin_lock_irqsave(&wq_head->lock, flags);
        nr_exclusive = __wake_up_common(wq_head, mode, nr_exclusive,
                        wake_flags, key, &bookmark);
        spin_unlock_irqrestore(&wq_head->lock, flags);
    } while (bookmark.flags & WQ_FLAG_BOOKMARK);
}
#define raw_spin_unlock_irqrestore(lock, flags) \
    do { \
        typecheck(unsigned long, flags); \
        _raw_spin_unlock_irqrestore(lock, flags); \
    } while (0)

The core references of kvm_para_available are located in drivers/cpuidle/cpuidle‑haltpoll.c:113 and drivers/cpuidle/governors/haltpoll.c:143:

bool kvm_para_available(void) {
    return kvm_cpuid_base() != 0;
}
static inline uint32_t kvm_cpuid_base(void) {
    static int kvm_cpuid_base = -1;
    if (kvm_cpuid_base == -1)
        kvm_cpuid_base = __kvm_cpuid_base();
    return kvm_cpuid_base;
}
static int __init haltpoll_init(void) {
    struct cpuidle_driver *drv = &haltpoll_driver;
    if (boot_option_idle_override != IDLE_NO_OVERRIDE)
        return -ENODEV;
    cpuidle_poll_state_init(drv);
    if (!kvm_para_available() || !haltpoll_want())
        return -ENODEV;
    /* driver registration omitted */
    return 0;
}
static struct cpuidle_governor haltpoll_governor = {
    .name = "haltpoll",
    .rating = 9,
    .enable = haltpoll_enable_device,
    .select = haltpoll_select,
    .reflect = haltpoll_reflect,
};
static int __init init_haltpoll(void) {
    if (kvm_para_available())
        return cpuidle_register_governor(&haltpoll_governor);
    return 0;
}

HLT vs. BusyPoll

HLT mode: The VCPU executes HLT, causing a VMEXIT, returning control to the host. When the lock is released, the host injects an interrupt, triggering a VMENTRY. This chain involves multiple privilege switches, adding noticeable latency.

BusyPoll mode: With haltpoll enabled, the VCPU stays runnable and actively polls the IPI pending bit, avoiding frequent VMEXIT/VMENTRY. To prevent endless spinning, KVM introduces PLE (Pause Loop Exiting); only after a threshold does a VMEXIT occur. This design trades controlled CPU consumption for deterministic low latency.

Experimental Validation

To eliminate randomness, an aggressive experiment forced haltpoll activation on a VM without Hypervisor CPUID exposure by modifying the two kernel locations:

// cpuidle‑haltpoll.c:113
- if (!kvm_para_available() || !haltpoll_want())
+ if (kvm_para_available() || !haltpoll_want())

// haltpoll.c:143
- if (kvm_para_available())
+ if (!kvm_para_available())

After recompiling the kernel, fio tests on PoleFS showed IOPS comparable to the Hypervisor‑exposed case, and perf data confirmed the shift in hotspot distribution.

Related Logic

The two files cpuidle‑haltpoll.c and

haltpoll.c decide whether the CPU idle path uses HLT or BusyPoll. In HLT mode, a VCPU VMEXIT leads to an IPI‑triggered wake‑up, incurring additional latency. In BusyPoll mode, the VCPU remains in a spin state; KVM’s PLE limits the spin duration before a VMEXIT, resulting in much lower latency.

Perf data without Hypervisor CPUID shows HLT accounting for 22.59% of VMEXIT events, while with CPUID exposure HLT drops to 14.67% and PAUSE_INSTRUCTION (BusyPoll) rises, confirming the latency reduction mechanism.

Optimization Recommendations

Enable Hypervisor CPUID exposure by default for high‑IOPS VMs, databases, and low‑latency services.

Establish a VMEXIT monitoring framework on the host to analyze and optimize hypervisor‑level exit reasons.

These findings highlight that performance bottlenecks increasingly reside at the guest‑hypervisor scheduling boundary, and future optimizations must consider invisible layers such as CPU idle policies and VMEXIT handling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performancevirtualizationcpu-idleHypervisorIOPS
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.