Artificial Intelligence 5 min read

CUTLASS Extreme Performance Optimization and Its Application in Alibaba's Recommendation System

At the GTC conference, the talk presents Alibaba Cloud’s heterogeneous computing platform and introduces the Open Deep Learning API (ODLA), then details how CUTLASS‑based operator fusion dramatically accelerates attention and MLP layers in large‑scale recommendation models, achieving multi‑fold performance gains in production.

Alibaba Cloud Infrastructure

Mar 22, 2023

CUTLASS Extreme Performance Optimization and Its Application in Alibaba's Recommendation System

Alibaba Cloud’s Zhenduan heterogeneous computing acceleration platform introduces the industry‑first Open Deep Learning API (ODLA), a unified interface for deep‑learning hardware that enables smooth migration of applications across heterogeneous resources. The platform’s innovations in heterogeneous compilation, architecture‑aware sparsity, and full‑stack auto‑tuning have delivered record‑breaking MLPerf results, processing 1.078 million images per second in 2021 and securing multiple inference performance firsts in subsequent MLPerf releases.

The upcoming GTC conference features a recommended talk titled “CUTLASS Extreme Performance Optimization and Its Application in Alibaba’s Recommendation System” (session code SE51305) scheduled for March 22, 4:30 PM–5:00 PM, presented by senior engineer Dong Jiying from Alibaba Cloud’s Infrastructure Server R&D division.

The abstract explains that large‑scale click‑through‑rate (CTR) and conversion‑rate (CVR) prediction models consist of embedding, attention, and MLP layers. In TensorFlow, the attention and MLP layers become performance bottlenecks due to the sheer number of operators and costly computations.

By leveraging NVIDIA’s open‑source CUTLASS framework for high‑performance general matrix multiplication (GEMM), the team fuses operators connected to GEMM into a single kernel, allowing the entire attention module to be collapsed into one operator. Similarly, back‑to‑back GEMMs in the MLP are merged vertically, and horizontally linked GEMMs are combined into a batch GEMM, yielding substantial speedups. These optimizations have been deployed on Alibaba’s prediction engine platform, dramatically improving inference performance and better utilizing hardware capabilities.

To attend the talk, users can search the GTC website for the session code SE51305 and click “Add to Schedule” or the star icon in the upper‑right corner to bookmark the session. The official GTC site is https://www.nvidia.cn/gtc-global/.

A “Zhenduan” tip highlights the vODLA compute‑pooling technology, which virtualizes compute resources (vXPU) and provides intelligent slicing and scheduling, delivering a single‑node scale‑up experience without the complexity of network configuration or increased power and cooling costs. The vODLA pool runs on standard servers with high‑performance networking, supporting flexible multi‑accelerator interconnects beyond the typical 8‑GPU configuration.

For the full original article and additional details, click “Read Original” on the GTC website.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance optimization Deep Learning recommendation systems GPU computing heterogeneous computing CUTLASS

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.