HyFormer: Unified Sequence Modeling and Feature Interaction for Recommendations

HyFormer, a novel hybrid Transformer framework introduced by ByteDance’s TikTok search team, integrates sequence modeling and feature interaction into a unified alternating optimization process, enhancing representation power and scaling efficiency for ultra‑long user behavior sequences and high‑dimensional heterogeneous features, leading to significant offline and online performance gains.

DataFunTalk
DataFunTalk
DataFunTalk
HyFormer: Unified Sequence Modeling and Feature Interaction for Recommendations

Background

Traditional ranking models use a two‑stage pipeline: a sequence model (e.g., DIN, LONGER) first compresses user behavior sequences into embeddings, then a feature‑interaction module (DCN, RankMixer) combines these embeddings with heterogeneous non‑temporal features. This late‑fusion design limits expressive power and scaling efficiency for ultra‑long sequences and high‑dimensional features.

Method

HyFormer introduces a hybrid Transformer that jointly performs sequence modeling and feature interaction through alternating Query Generation , Query Decoding , and Query Boosting modules.

Query Generation

All non‑sequence features (NS Tokens) are grouped by semantics. Each sequence type (e.g., search, feed, interaction) is pooled into a single token (Seq Token). NS Tokens are concatenated with the pooled Seq Tokens and passed through a lightweight MLP to produce an initial global query embedding Q₀ that contains full contextual information.

Query Decoding

For each layer, the query Q is refined by cross‑attention with sequence elements represented as key/value pairs. Three encoding options for the sequence are supported:

Full Transformer encoding – self‑attention over the entire sequence.

LONGER‑style efficient encoding – prepend a small set of additional tokens and use cross‑attention.

Decoder‑style lightweight encoding – direct linear projection (e.g., SwiGLU) to obtain K/V.

Cross‑attention updates the query: Q' = CrossAttn(Q, K, V). The updated query now carries sequence‑aware information and is passed to the next module.

Query Boosting

The decoded query Q' is concatenated with all NS Tokens. An MLP‑Mixer‑based block performs a mix‑up followed by per‑token feed‑forward (FFN) or Sparse Mixture‑of‑Experts (SMoE) layers. Residual connections preserve the original semantics while incrementally enriching the representation, effectively fusing decoded sequence information with heterogeneous features.

HyFormer Layer

Each HyFormer layer consists of a Query Decoding block followed by a Query Boosting block. Stacking L layers yields a sequence of queries Q⁰,…,Qᴸ, where each layer iteratively refines the global token using the latest cross‑attention output. The final query representation is fed to a downstream MLP for prediction.

Experiments

Effectiveness

Evaluated on TikTok search with three sequence types (up to 3000‑length long sequence, top‑50 search sequence, top‑50 feed sequence) on a 64‑GPU cluster. HyFormer outperformed two‑stage baselines (separate sequence modeling + feature crossing) and other unified models, achieving higher ranking metrics with comparable FLOPs.

Ablation Study

Removing any of the three query components—global non‑sequence features, pooled sequence token, or target‑item information—degraded performance. Excluding the global token from the Query Boosting block also reduced results. Merging multiple sequences into a single stream hurt effectiveness, confirming the benefit of independent multi‑sequence modeling.

Scaling Analysis

Using LONGER + RankMixer as the base, HyFormer maintained superiority as model parameters and compute increased, and across varying sequence lengths.

Online A/B Test

In the production TikTok search scenario, HyFormer increased average watch time per user, completion rate, and query‑switch rate.

Conclusion

HyFormer is the first industrial framework that unifies sequence modeling and feature interaction through alternating updates. By iteratively decoding long sequences with a global token and enriching that token via Mixer‑based feature fusion, the architecture provides a scalable solution for ultra‑long sequence representation and heterogeneous feature fusion.

HyFormer architecture overview
HyFormer architecture overview
Mixer‑based token fusion
Mixer‑based token fusion
Effectiveness comparison chart
Effectiveness comparison chart
Ablation study results
Ablation study results
Scaling analysis
Scaling analysis
Online A/B metrics
Online A/B metrics
recommendationAIfeature interactionSequence ModelingHyFormer
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.