GPU‑Driven Model Service and Optimization Practices in Xiaohongshu's Search Scenario
This article details Xiaohongshu's end‑to‑end GPU‑centric transformation for search‑related machine‑learning models, covering model characteristics, training and inference frameworks, system‑level GPU and CPU optimizations, multi‑card and compilation techniques, and future directions for scaling large sparse and dense models.
The presentation begins with an overview of the rapid growth of machine‑learning workloads in video, image, text, and search domains, highlighting the mismatch between CPU scaling and the computational demands of large models, which motivates a full GPU migration for Xiaohongshu's promotion search models.
Background information describes Xiaohongshu's application landscape—homepage recommendation and search pages—including the use of CTR, CVR, and relevance models, and quantifies the increase in FLOPs and parameter counts from 2021 to 2022.
In the model‑service section, the authors discuss the characteristics of sparse versus dense models, the evolution of training and inference frameworks from TensorFlow Serving to a custom Lambda Service, and the adoption of PyTorch‑based stacks for CV/NLP models.
The GPU optimization practices are divided into system, compute, and training optimizations. System optimizations address physical‑machine tuning, interrupt isolation, kernel version upgrades, and instruction pass‑through, achieving 1‑2% performance gains.
Compute optimizations tackle multi‑card scheduling, NUMA binding, compilation‑level instruction selection, kernel fusion, redundant‑computation elimination, and hardware upgrades, collectively delivering up to 10% or more performance improvements.
Training optimizations focus on data layout transformation (row‑to‑column), prefetching, asynchronous gradient updates, and pipeline enhancements to reduce CPU bottlenecks and improve GPU utilization.
Future work outlines plans for scaling sparse large‑model training with HPC‑style single‑machine runs, multi‑GPU high‑speed interconnects, and continued hardware upgrades, as well as enhancements to inference such as hash‑based caching, model lightweighting, and a drag‑and‑drop machine‑learning platform.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.