Boosting TLS Performance with Intel QAT and a Custom Keyless Architecture
This article details how XiaoHongShu's infrastructure team built a keyless architecture that offloads CPU‑intensive TLS private‑key signing to Intel QAT hardware, achieving massive HTTPS throughput gains, lower server costs, and valuable insights for similar high‑traffic TLS offload scenarios.
The article systematically introduces XiaoHongShu's in‑house keyless architecture, covering Intel QAT hardware selection and performance tuning, Rustls async support, and a high‑performance keyserver implementation. The solution now handles the company's self‑built IDC public‑facing traffic, dramatically increasing HTTPS processing capacity while reducing server resource costs.
2.1 QAT Introduction
Intel CPUs embed various accelerators; QAT can significantly speed up network processing, including compression, symmetric and asymmetric encryption. Performance varies across CPU families and QAT generations.
2.2 Hardware Selection
When comparing the 4516Y (MCC, 2 × QAT) and 6554S (XCC, 8 × QAT) CPUs, more QAT units do not always mean better cost‑performance. The 4516Y’s single‑die design delivers higher per‑QAT encryption throughput than the multi‑die 6554S, despite the latter having more QAT engines.
QAT Engine provides both hardware and software acceleration; it automatically falls back to software when hardware is saturated. Selecting the optimal CPU model requires balancing QAT resource ratios, purchase price, and overall system throughput.
2.3 Performance Tuning
Key tuning options include enabling both HW and SW acceleration in QAT Engine, disabling the default global memory lock to improve multi‑core scalability, adjusting the QAT driver’s ServicesEnabled configuration to drop unused modes, and enabling debug logging for easier troubleshooting.
4 Keyless Architecture Overview
The architecture consists of two parts: keyclient and keyserver .
keyclient implements asynchronous asymmetric crypto using:
Keyless protocol for communication with the keyserver.
TLS async support via Rust, enabling QUIC‑TLS and TCP‑TLS offload.
keyserver provides a high‑performance user‑space network service:
Keyless protocol handling.
Asynchronous task scheduling to fully utilize CPU parallelism.
Encryption/decryption offload to Intel QAT, reducing CPU cost.
4.1 Keyless Protocol
The protocol follows Cloudflare’s format, enabling communication between keyclient and keyserver.
4.2 keyclient Details
Implemented in Rust, keyclient modifies the rustls library to provide an asynchronous TLS mode, supporting both QUIC‑TLS and TCP‑TLS offload, with remote and local fallback capabilities.
4.3 keyserver Details
The keyserver stack includes:
Multi‑threaded epoll for receiving RPC messages and handling QAT callbacks.
OpenSSL async jobs (ASYNC_start_job, ASYNC_pause_job) to represent individual QAT operations.
Notification mechanisms via eventfd or callbacks, propagating completion from the QAT device back to the user‑space application.
The async framework in libcrypto abstracts much of the QAT interaction, allowing the keyserver to focus on high‑throughput request handling.
Performance Results
A single keyserver node can process over 300,000 sign operations per second, utilizing both hardware and software acceleration across two fully loaded QAT devices and 32 physical CPU cores. Compared with a pure‑software rustls TLS stack, the forwarding cost is reduced by more than fivefold.
Future Work
Planned improvements include UDP‑based keyless communication to cut TCP overhead, further resource pooling in the keyserver, and support for QUIC‑TLS scenarios. The team also intends to open‑source the Rust async rustls library and the QAT‑enabled keyserver.
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
