Artificial Intelligence 8 min read

SHARK: Efficient Embedding Compression for Large-Scale Recommendation Models

The paper introduces SHARK, a two‑component framework that uses a fast Taylor‑expanded permutation method to prune embedding tables and a frequency‑aware quantization scheme to apply mixed‑precision to embeddings, achieving up to 70% memory reduction and 30% QPS improvement in industrial short‑video and e‑commerce recommendation systems.

Kuaishou Tech

Oct 26, 2023

SHARK: Efficient Embedding Compression for Large-Scale Recommendation Models

Building effective recommendation models is crucial for online services such as e‑commerce and short‑video platforms, but the embedding layer often dominates storage, reaching terabyte scale and hindering further iteration.

To reduce resource consumption while preserving model performance, the authors propose SHARK, which consists of two key components: (1) Fast‑Permutation (F‑Permutation) , a Taylor‑expansion approximation of the permutation importance metric that dramatically lowers the computational cost of feature‑field evaluation, enabling aggressive pruning of embedding tables; and (2) Frequence‑Quantization (F‑Quantization) , a novel quantization strategy that assigns different low‑precision formats (e.g., int8, fp16) to embeddings based on their access frequency and importance scores.

The paper details the challenges of (a) needing a lightweight yet accurate feature‑importance estimator without adding extra parameters, and (b) handling the higher quantization error of frequently accessed embeddings. F‑Permutation evaluates each feature field by measuring performance drop after permuting its values, and the Taylor‑expanded approximation allows parallel evaluation across all fields. F‑Quantization scores embeddings, applies time‑decayed priority weights, and selects precision levels accordingly.

Extensive experiments on public benchmarks and Kuaishou’s industrial datasets answer three questions: (i) F‑Permutation outperforms prior feature‑selection methods; (ii) F‑Quantization surpasses existing quantization techniques; and (iii) the two methods are compatible, jointly reducing memory usage to about 30% of the baseline while improving AUC on tasks such as Like, Click, and Follow.

Online A/B tests on short‑video, e‑commerce, and advertising recommendation models show that SHARK compresses the embedding layer by 70% without degrading average watch time, and increases queries per second by 30%, saving thousands of machines.

In summary, SHARK provides a lightweight, effective solution for embedding‑layer compression, combining fast permutation‑based pruning with frequency‑aware mixed‑precision quantization, and demonstrates significant storage and latency gains in real‑world large‑scale recommendation systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Efficiency recommendation Quantization embedding compression Model Pruning large-scale AI

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.