Deep Hardware‑Software Integration to Eliminate NVMe IO Jitter During Double 11
This case study explains how a combination of kernel‑level NVMe driver congestion control, LVM adjustments, and SSD over‑provisioning was used to suppress severe IO bandwidth drops and jitter, ensuring smooth transaction processing for a high‑traffic Double 11 event.
It is well known that the stability of Double 11 depends on rock‑solid systems, both software and hardware, and only when both are reliable can a smooth user experience be guaranteed. This article presents a real case of deep hardware‑software optimization applied during Double 11.
Business Challenge
A certain service experienced IO jitter when using a specific server model equipped with a new NVMe SSD, causing business performance instability. The service required not only low latency and high IOPS from the SSD but also stable, non‑fluctuating IO metrics across the entire stack.
The TPS (transactions per second) curve showed dramatic drops over time, indicating the need to smooth both TPS and latency to maintain QoS.
Root Cause of IO Jitter
To reproduce the jitter, the team used the FIO tool for stress testing and added file‑deletion actions, which revealed a more severe issue: an abrupt drop of IO bandwidth to zero, representing a hidden risk for the business.
The performance data after deleting a full‑disk file showed a clear degradation in the first two hours.
IOSTAT data captured during the bandwidth drop further illustrated the problem.
To pinpoint the issue, the team needed to determine the system state when NVMe bandwidth fell to zero: whether the NVMe driver timed out, got stuck in a kernel location, or something else. They examined three aspects: NVMe SSD state, LVM state, and the FIO test process state.
A custom kernel module was written to capture state information from the NVMe device, LVM device, and FIO process at the moment of the bandwidth drop.
Analysis of the captured data revealed that both the NVMe driver and the LVM driver bypassed Linux's generic block layer, which provides IO congestion control, and they did not implement their own congestion control. Consequently, the request queue could grow without limit, mixing journal/metadata and data writes, leading to unpredictable journal write latency and subsequent IO jitter.
Because the journal write time became unpredictable, subsequent IO writes were blocked, causing the observed bandwidth oscillations.
LVM and NVMe drivers do not use the generic block layer’s congestion control, and they lack their own mechanisms, resulting in uncontrolled IO request growth and severe jitter.
Deep Hardware‑Software Integration
Implementing IO congestion control in the NVMe driver addressed the software‑induced jitter, but the SSD’s internal garbage collection (GC) still caused severe IO spikes under heavy load. Therefore, merely optimizing the driver was insufficient for Double 11 QoS requirements.
After identifying GC as the hardware source of jitter, the team increased the SSD’s over‑provisioning (OP) space. Subsequent tests showed smoother TPS and response‑time curves.
By combining NVMe driver congestion control with increased SSD OP, the team achieved a "silky‑smooth" IO path that underpinned a rock‑solid infrastructure for Double 11.
The team commits to continuing deep hardware‑software co‑optimization to further improve infrastructure availability and reliability.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.