Boosting TLS Performance with Intel QAT and a Custom Keyless Architecture

This article details how XiaoHongShu's infrastructure team built a keyless architecture that offloads CPU‑intensive TLS private‑key signing to Intel QAT hardware, achieving massive HTTPS throughput gains, lower server costs, and valuable insights for similar high‑traffic TLS offload scenarios.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Boosting TLS Performance with Intel QAT and a Custom Keyless Architecture

The article systematically introduces XiaoHongShu's in‑house keyless architecture, covering Intel QAT hardware selection and performance tuning, Rustls async support, and a high‑performance keyserver implementation. The solution now handles the company's self‑built IDC public‑facing traffic, dramatically increasing HTTPS processing capacity while reducing server resource costs.

2.1 QAT Introduction

Intel CPUs embed various accelerators; QAT can significantly speed up network processing, including compression, symmetric and asymmetric encryption. Performance varies across CPU families and QAT generations.

2.2 Hardware Selection

When comparing the 4516Y (MCC, 2 × QAT) and 6554S (XCC, 8 × QAT) CPUs, more QAT units do not always mean better cost‑performance. The 4516Y’s single‑die design delivers higher per‑QAT encryption throughput than the multi‑die 6554S, despite the latter having more QAT engines.

QAT Engine provides both hardware and software acceleration; it automatically falls back to software when hardware is saturated. Selecting the optimal CPU model requires balancing QAT resource ratios, purchase price, and overall system throughput.

2.3 Performance Tuning

Key tuning options include enabling both HW and SW acceleration in QAT Engine, disabling the default global memory lock to improve multi‑core scalability, adjusting the QAT driver’s ServicesEnabled configuration to drop unused modes, and enabling debug logging for easier troubleshooting.

4 Keyless Architecture Overview

The architecture consists of two parts: keyclient and keyserver .

keyclient implements asynchronous asymmetric crypto using:

Keyless protocol for communication with the keyserver.

TLS async support via Rust, enabling QUIC‑TLS and TCP‑TLS offload.

keyserver provides a high‑performance user‑space network service:

Keyless protocol handling.

Asynchronous task scheduling to fully utilize CPU parallelism.

Encryption/decryption offload to Intel QAT, reducing CPU cost.

4.1 Keyless Protocol

The protocol follows Cloudflare’s format, enabling communication between keyclient and keyserver.

4.2 keyclient Details

Implemented in Rust, keyclient modifies the rustls library to provide an asynchronous TLS mode, supporting both QUIC‑TLS and TCP‑TLS offload, with remote and local fallback capabilities.

4.3 keyserver Details

The keyserver stack includes:

Multi‑threaded epoll for receiving RPC messages and handling QAT callbacks.

OpenSSL async jobs (ASYNC_start_job, ASYNC_pause_job) to represent individual QAT operations.

Notification mechanisms via eventfd or callbacks, propagating completion from the QAT device back to the user‑space application.

The async framework in libcrypto abstracts much of the QAT interaction, allowing the keyserver to focus on high‑throughput request handling.

Performance Results

A single keyserver node can process over 300,000 sign operations per second, utilizing both hardware and software acceleration across two fully loaded QAT devices and 32 physical CPU cores. Compared with a pure‑software rustls TLS stack, the forwarding cost is reduced by more than fivefold.

Future Work

Planned improvements include UDP‑based keyless communication to cut TCP overhead, further resource pooling in the keyserver, and support for QUIC‑TLS scenarios. The team also intends to open‑source the Rust async rustls library and the QAT‑enabled keyserver.

PerformanceTLSIntel QATKeyless Architecture
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.