How Alibaba Accelerated Tengine Gzip with Intel QAT: A Performance Case Study
This article examines how Alibaba's Tengine access layer tackled the CPU bottleneck of Gzip compression by adopting hardware acceleration with Intel QAT cards, detailing the analysis, implementation challenges, performance gains, and operational safeguards that resulted in up to 15% CPU savings and reduced system load.
Background
General‑purpose CPUs are reaching the limits of Moore's Law while machine‑learning and web services grow exponentially, prompting Alibaba to explore hardware acceleration for its Tengine access layer. Gzip compression consumes 15‑20% of CPU in Tengine, making hardware offload essential for performance and cost.
Analysis and Research
Hardware acceleration replaces software algorithms with dedicated hardware, offering higher efficiency. Two main approaches are considered:
FPGA – field‑programmable gate array, customizable for specific algorithms (e.g., smart NICs).
ASIC – application‑specific integrated circuit, such as Intel QAT cards that accelerate SSL, compression, and decompression.
Comparative tables (omitted) show the trade‑offs. Alibaba evaluated three solutions:
Intel QAT Card
QAT (Quick Assist Technology) accelerates RSA/ECDH/ECDSA/DH/DSA and provides a zlib compression shim compatible with existing code, requiring minimal changes.
Intelligent NIC
INIC offers two modes: (a) host‑side API returns compressed data; (b) host sends uncompressed packets, NIC compresses and re‑packs them. Both require significant integration effort.
FPGA Card
FPGA demands a full redesign of the zlib algorithm and driver, leading to high development cost.
After comparison, the QAT ASIC was selected for Tengine Gzip offload.
Implementation
The QAT driver uses Userspace I/O (UIO) with most logic in user space, simplifying debugging and avoiding kernel floating‑point limitations. SR‑IOV enables sharing the PCIe device across up to 32 VMs. The acceleration chain links Zlib Shim, QAT user‑space API, and the QAT driver, minimizing impact on upper‑level services.
Key challenges addressed:
Initial driver caused high CPU usage in kernel mode (ioctl, memory allocation). Replaced with an OOT memory manager (USDM) using a huge‑page pool.
Open, ioctl, and futex calls spiked after acceleration; driver and shim were tuned to reduce these calls.
Reloading workers could exhaust the limited QAT instance pool (64 instances). Updated driver increased the pool to 256 and added automatic fallback to software compression.
Huge‑page memory leaks in the shim caused QAT core dumps; fixing the lifecycle of (In)Deflate calls eliminated the leaks.
Operational safeguards include automatic detection of QAT availability, deployment of dual binaries (software vs. hardware), and runtime fallback to software compression when resources are insufficient.
Performance Results
Test environment: Intel Xeon E5‑2650 v2 (32 cores), Zlib 1.2.8, QAT driver intel‑qatOOT40052.
With QAT enabled, average CPU usage dropped from ~48% to ~41%, system load decreased from 14.22 to 12.09, and Gzip hot‑spot functions were largely eliminated, confirming near‑complete offload.
Conclusion
The collaboration between Alibaba's Tair & Tengine teams and Intel delivered a robust hardware‑accelerated Gzip solution that improves performance, reduces CPU consumption, and lays groundwork for future SSL + Gzip integration, filling a gap in the industry’s access‑layer acceleration landscape.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
