Artificial Intelligence 13 min read

Beyond Dual‑Tower: Advanced Distillation and Interaction Techniques for Recommendation Systems

This article reviews recent advances that enhance dual‑tower recommendation models by injecting interaction information through various knowledge‑distillation strategies and interaction‑enhanced architectures, summarizing methods such as PFD, ENDX, TRMD, VIRT, Distilled‑DualEncoder, ERNIE‑Search, ColBert, IntTower and MVKE.

NewBeeNLP

Feb 12, 2024

Beyond Dual‑Tower: Advanced Distillation and Interaction Techniques for Recommendation Systems

Optimization Overview

The dual‑tower (bi‑encoder) architecture is widely adopted in large‑scale recommendation and text‑matching because it enables fast inner‑product inference. Classic industrial systems include Microsoft DSSM, Google YouTubeDNN, and Airbnb’s personalized user embeddings. As performance gains from pure dual‑tower scaling diminish, recent research focuses on bridging the gap to interaction‑rich models via knowledge distillation and interaction‑enhanced designs.

Knowledge Distillation Techniques

PFD (Privileged Features Distillation)

During offline training a teacher model receives privileged features unavailable at inference time, such as recent click‑category statistics and post‑click dwell time. The teacher and student share the same public feature embeddings; the teacher is trained jointly with the student until its predictions stabilize. Distillation then proceeds by adding an auxiliary loss that forces the student’s output distribution to match the teacher’s predictions (e.g., KL‑divergence) while still optimizing the original supervised loss.

ENDX

The teacher generates query and answer embeddings of the same dimensionality as the student, providing richer supervision. Because the teacher and student differ structurally, ENDX distills only logits and aligns the vector spaces using a Geometry Alignment Mechanism (GAM) :

Model similarity between two vectors as a conditional probability; higher similarity yields larger probability.

For a batch, compute the probability distribution of similarities for teacher and student separately.

Measure distribution closeness with KL‑divergence.

Combine four pairwise probabilities— P(answer|query), P(answer|answer), P(query|answer), P(query|query) —into a single auxiliary loss.

TRMD (Two‑Ranker Multi‑Teacher Distillation)

Two frozen teachers are used: a cross‑encoder (provides CLS‑level representations) and a bi‑encoder such as ColBERT (provides all token‑level representations, denoted REP). The student learns to mimic both teachers: it aligns its CLS output with the cross‑encoder and its token‑level outputs with the bi‑encoder. Final predictions are obtained by summing the scores from the two teacher‑derived heads.

VIRT (Virtual Interaction)

The teacher’s transformer encoder yields query‑ and document‑side Q, K, V matrices for every token, encoding cross‑attention information. VIRT distills the teacher’s cross‑attention matrices into the student via auxiliary losses that penalize the L2 distance between corresponding attention scores. Additionally, an attention‑based weighting is applied to the student’s final‑layer representations to inject the missing interaction signals.

Distilled‑DualEncoder

Similar to VIRT, but replaces the L2 auxiliary loss with KL‑divergence and adds a soft‑label distillation term that aligns the student’s prediction distribution with the teacher’s.

ERNIE‑Search

Proposes a cascade distillation paradigm:

Distill a cross‑encoder into a bi‑encoder (ColBERT) to obtain a richer intermediate teacher.

Distill the intermediate bi‑encoder into the final dual‑encoder student.

During the final stage, combine the original supervised loss with four auxiliary losses (logit alignment, representation alignment, attention‑level distillation, and KL‑divergence between probability distributions).

The resulting student remains a pure bi‑encoder, requiring only a single inner‑product at inference.

Interaction‑Enhanced Architectures

ColBERT

Introduces late interaction : each query token is compared with every document token, and the sum of all pairwise similarity scores forms the final relevance score. Computational cost grows as len(query) × len(document), but the approach retains the efficiency of a bi‑encoder while capturing fine‑grained token interactions.

IntTower

Item‑side final hidden vector interacts with multiple user‑side hidden layers. The interaction proceeds in three steps:

Map user and item hidden layers into M sub‑spaces (producing M vectors per layer).

Compute the inner product between each user sub‑space vector and each item sub‑space vector, taking the maximum per user layer.

Sum the maxima across all user layers to obtain the final score.

An auxiliary self‑supervised loss L_cir is constructed from positive‑negative sample pairs to further refine similarity modeling (InfoNCE style).

MVKE (Mixture of Virtual‑Kernel Experts)

The user side generates multiple Virtual‑Kernel Experts (VK‑Experts) , each representing a distinct interest. An item‑side tag vector acts as a query to attend over all user features, producing an item‑related user representation. The final relevance score is the inner product between this representation and the item embedding. Interaction cost equals the number of VK‑Experts plus one.

References

Huang et al., “Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data,” CIKM 2013 (DSSM).

Covington et al., “Deep Neural Networks for YouTube Recommendations,” RecSys 2016 (YouTubeDNN).

Grbovic et al., “Real‑time Personalization Using Embeddings for Search Ranking at Airbnb,” KDD 2018.

Xu et al., “Privileged Features Distillation at Taobao Recommendations,” KDD 2020 (PFD).

Wang et al., “Enhancing Dual‑Encoders with Question and Answer Cross‑Embeddings for Answer Retrieval,” arXiv 2022 (ENDX).

Choi et al., “Improving Bi‑Encoder Document Ranking Models with Two Rankers and Multi‑Teacher Distillation,” SIGIR 2021 (TRMD).

Li et al., “VIRT: Improving Representation‑Based Models for Text Matching through Virtual Interaction,” arXiv 2021.

Wang et al., “Distilled Dual‑Encoder Model for Vision‑Language Understanding,” arXiv 2021.

Lu et al., “ERNIE‑Search: Bridging Cross‑Encoder with Dual‑Encoder via Self On‑the‑fly Distillation for Dense Passage Retrieval,” arXiv 2022.

Khattab et al., “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT,” SIGIR 2020.

IntTower: “The Next Generation of Two‑Tower Model for Pre‑Ranking System.”

Xu et al., “Mixture of Virtual‑Kernel Experts for Multi‑Objective User Profile Modeling,” KDD 2022 (MVKE).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

dual-tower AI research knowledge distillation interaction modeling

Written by

NewBeeNLP

Always insightful, always fun

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.