Artificial Intelligence 16 min read

GPU‑Driven Model Service and Optimization Practices in Xiaohongshu's Search Scenario

This article details Xiaohongshu's end‑to‑end GPU‑centric transformation for search‑related machine‑learning models, covering model characteristics, training and inference frameworks, system‑level GPU and CPU optimizations, multi‑card and compilation techniques, and future directions for scaling large sparse and dense models.

DataFunTalk

Dec 1, 2023

GPU‑Driven Model Service and Optimization Practices in Xiaohongshu's Search Scenario

The presentation begins with an overview of the rapid growth of machine‑learning workloads in video, image, text, and search domains, highlighting the mismatch between CPU scaling and the computational demands of large models, which motivates a full GPU migration for Xiaohongshu's promotion search models.

Background information describes Xiaohongshu's application landscape—homepage recommendation and search pages—including the use of CTR, CVR, and relevance models, and quantifies the increase in FLOPs and parameter counts from 2021 to 2022.

In the model‑service section, the authors discuss the characteristics of sparse versus dense models, the evolution of training and inference frameworks from TensorFlow Serving to a custom Lambda Service, and the adoption of PyTorch‑based stacks for CV/NLP models.

The GPU optimization practices are divided into system, compute, and training optimizations. System optimizations address physical‑machine tuning, interrupt isolation, kernel version upgrades, and instruction pass‑through, achieving 1‑2% performance gains.

Compute optimizations tackle multi‑card scheduling, NUMA binding, compilation‑level instruction selection, kernel fusion, redundant‑computation elimination, and hardware upgrades, collectively delivering up to 10% or more performance improvements.

Training optimizations focus on data layout transformation (row‑to‑column), prefetching, asynchronous gradient updates, and pipeline enhancements to reduce CPU bottlenecks and improve GPU utilization.

Future work outlines plans for scaling sparse large‑model training with HPC‑style single‑machine runs, multi‑GPU high‑speed interconnects, and continued hardware upgrades, as well as enhancements to inference such as hash‑based caching, model lightweighting, and a drag‑and‑drop machine‑learning platform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GPU Optimization Xiaohongshu inference model serving Training

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.