Artificial Intelligence 20 min read

GPU-Accelerated Inference Optimization for Large-Scale Machine Learning at Xiaohongshu

Xiaohongshu transformed its recommendation, advertising, and search inference pipeline by migrating to GPU‑centric hardware, deploying a custom TensorFlow‑Core Lambda service, and applying system‑level, virtualization, and compute‑level optimizations—including NUMA binding, kernel fusion, dynamic scaling, and FP16 quantization—achieving roughly 30× compute capacity growth, over 10% user‑metric gains, and more than 50% cluster‑resource savings.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
GPU-Accelerated Inference Optimization for Large-Scale Machine Learning at Xiaohongshu

In recent years, the computational and parameter demands of machine learning models for video, image, text and recommendation search have far outpaced the growth of CPU performance. This has driven a close alignment between GPU compute advances and the development of large models.

Many companies, including Xiaohongshu, have begun to migrate their machine‑learning workloads to GPU‑centric solutions to improve inference performance and efficiency. The migration faces challenges such as smooth transition to heterogeneous hardware, integration with business‑specific online architectures, and cost‑effective scaling.

At Xiaohongshu, the recommendation, advertising and search services are unified under a central inference platform. Model sizes have grown dramatically: the main ranking model’s historical behavior length increased by ~100×, and FLOPs per request grew 30‑fold, while memory traffic rose ~5‑fold.

Model characteristics : The 2022 main recommendation model contains massive sparse features (e.g., ) totaling up to 1 TB, while dense parts are kept under 10 GB to fit GPU memory. A single user interaction can trigger ~40 B FLOPs, with latency targets under 300 ms (excluding feature lookup).

Inference framework evolution : Until 2020, TensorFlow Serving was used. Afterwards, a custom Lambda Service built on TensorFlowCore replaced it, eliminating an unnecessary TensorProto‑>CTensor copy and exposing plug‑in optimization points (TRT, BLADE, TVM, etc.). The framework also handles feature extraction and plans edge‑side storage to reduce remote data fetches.

Hardware selection : Xiaohongshu purchases machines from cloud vendors; decisions depend on available CPU, GPU, bandwidth and NUMA characteristics. GPU‑CPU balance, inter‑connect latency and memory bandwidth are all considered.

System‑level optimizations :

Interrupt isolation – separate GPU interrupts from other devices.

Kernel version upgrades for stability and driver compatibility.

Instruction passthrough – direct GPU instruction execution.

Virtualization & container tuning :

Bind each pod to a specific NUMA node to improve CPU‑GPU data transfer.

CPU NUMA affinity to keep memory accesses local, reducing latency.

Maintain CPU utilization below 70 % to keep latency from 200 ms → 150 ms.

Image compilation : Different CPUs support different instruction sets. For an Alibaba Cloud instance with Intel(R) Xeon(R) Platinum 8163 + 2 A10 GPUs, the build flags were tuned to leverage AVX‑512 and related extensions, yielding ~10 % CPU throughput improvement.

# Intel(R) Xeon(R) Platinum 8163 for ali intel
build:intel --copt=-march=skylake-avx512 --copt=-mmmx --copt=-mno-3dnow --copt=-msse
build:intel --copt=-msse2 --copt=-msse3 --copt=-mssse3 --copt=-mno-sse4a --copt=-mcx16
build:intel --copt=-msahf --copt=-mmovbe --copt=-maes --copt=-mno-sha --copt=-mpclmul
build:intel --copt=-mpopcnt --copt=-mabm --copt=-mno-lwp --copt=-mfma --copt=-mno-fma4
build:intel --copt=-mno-xop --copt=-mbmi --copt=-mno-sgx --copt=-mbmi2 --copt=-mno-pconfig
build:intel --copt=-mno-wbnoinvd --copt=-mno-tbm --copt=-mavx --copt=-mavx2 --copt=-msse4.2
build:intel --copt=-msse4.1 --copt=-mlzcnt --copt=-mrtm --copt=-mhle --copt=-mrdrnd --copt=-mf16c
build:intel --copt=-mfsgsbase --copt=-mrdseed --copt=-mprfchw --copt=-madx --copt=-mfxsr
build:intel --copt=-mxsave --copt=-mxsaveopt --copt=-mavx512f --copt=-mno-avx512er
build:intel --copt=-mavx512cd --copt=-mno-avx512pf --copt=-mno-prefetchwt1
build:intel --copt=-mno-clflushopt --copt=-mxsavec --copt=-mxsaves
build:intel --copt=-mavx512dq --copt=-mavx512bw --copt=-mavx512vl --copt=-mno-avx512ifma
build:intel --copt=-mno-avx512vbmi --copt=-mno-avx5124fmaps --copt=-mno-avx5124vnniw
build:intel --copt=-mno-clwb --copt=-mno-mwaitx --copt=-mno-clzero --copt=-mno-pku
build:intel --copt=-mno-rdpid --copt=-mno-gfni --copt=-mno-shstk --copt=-mno-avx512vbmi2
build:intel --copt=-mavx512vnni --copt=-mno-vaes --copt=-mno-vpclmulqdq --copt=-mno-avx512bitalg
build:intel --copt=-mno-movdiri --copt=-mno-movdir64b --copt=-mtune=skylake-avx512

Compute‑level optimizations :

Memory page‑fault reduction using jemalloc and transparent huge pages.

Custom lambda data structures to avoid fragmentation.

Bypassing TensorFlow Serving serialization to cut latency by >10 % in ranking workloads.

Multi‑stream and Multi‑context support, eliminating mutex bottlenecks and raising GPU utilization to >90 %.

CUDA MPS for kernel multiplexing.

Operator/kernel fusion (both hand‑written and compiler‑generated) to better exploit CPU caches and GPU shared memory.

Avoiding compute waste :

Pre‑computing heavy user‑side calculations and moving them to the recall stage.

Graph freeze to convert variable ops to constants, reducing GPU usage by ~12 %.

Batch‑level merging of identical user computations.

Splitting CPU/GPU ops to keep data on the GPU and minimize transfers.

BatchNorm & MLP merging to reduce kernel launches.

Dynamic compute scaling : Automatic degradation based on real‑time load keeps resource usage high while preventing overload, applied across ranking, search and other core services.

Hardware upgrades : Switching from T4 to A10 GPUs (1.5× performance) and newer CPUs further boosted throughput.

Graph optimization using BladeDISC (MLIR‑based dynamic‑shape compiler) added ~20 % QPS in single‑node inference tests.

Precision tuning : FP16 quantization in MLP layers reduced GPU usage by ~13 %; both white‑box (manual) and black‑box (threshold‑based) approaches were evaluated.

From 2021 to the end of 2022, these combined efforts increased inference compute capacity by ~30×, improved key user metrics by >10 %, and saved >50 % of cluster resources. The case demonstrates a systematic, business‑driven AI engineering path that balances innovation, cost, and sustainability.

deep learningLarge ModelsGPU optimizationhardware accelerationMachine Learning Inferencesystem performance
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.