HyFormer: Unified Sequence Modeling and Feature Interaction for Recommendations
HyFormer, a novel hybrid Transformer framework introduced by ByteDance’s TikTok search team, integrates sequence modeling and feature interaction into a unified alternating optimization process, enhancing representation power and scaling efficiency for ultra‑long user behavior sequences and high‑dimensional heterogeneous features, leading to significant offline and online performance gains.
Background
Traditional ranking models use a two‑stage pipeline: a sequence model (e.g., DIN, LONGER) first compresses user behavior sequences into embeddings, then a feature‑interaction module (DCN, RankMixer) combines these embeddings with heterogeneous non‑temporal features. This late‑fusion design limits expressive power and scaling efficiency for ultra‑long sequences and high‑dimensional features.
Method
HyFormer introduces a hybrid Transformer that jointly performs sequence modeling and feature interaction through alternating Query Generation , Query Decoding , and Query Boosting modules.
Query Generation
All non‑sequence features (NS Tokens) are grouped by semantics. Each sequence type (e.g., search, feed, interaction) is pooled into a single token (Seq Token). NS Tokens are concatenated with the pooled Seq Tokens and passed through a lightweight MLP to produce an initial global query embedding Q₀ that contains full contextual information.
Query Decoding
For each layer, the query Q is refined by cross‑attention with sequence elements represented as key/value pairs. Three encoding options for the sequence are supported:
Full Transformer encoding – self‑attention over the entire sequence.
LONGER‑style efficient encoding – prepend a small set of additional tokens and use cross‑attention.
Decoder‑style lightweight encoding – direct linear projection (e.g., SwiGLU) to obtain K/V.
Cross‑attention updates the query: Q' = CrossAttn(Q, K, V). The updated query now carries sequence‑aware information and is passed to the next module.
Query Boosting
The decoded query Q' is concatenated with all NS Tokens. An MLP‑Mixer‑based block performs a mix‑up followed by per‑token feed‑forward (FFN) or Sparse Mixture‑of‑Experts (SMoE) layers. Residual connections preserve the original semantics while incrementally enriching the representation, effectively fusing decoded sequence information with heterogeneous features.
HyFormer Layer
Each HyFormer layer consists of a Query Decoding block followed by a Query Boosting block. Stacking L layers yields a sequence of queries Q⁰,…,Qᴸ, where each layer iteratively refines the global token using the latest cross‑attention output. The final query representation is fed to a downstream MLP for prediction.
Experiments
Effectiveness
Evaluated on TikTok search with three sequence types (up to 3000‑length long sequence, top‑50 search sequence, top‑50 feed sequence) on a 64‑GPU cluster. HyFormer outperformed two‑stage baselines (separate sequence modeling + feature crossing) and other unified models, achieving higher ranking metrics with comparable FLOPs.
Ablation Study
Removing any of the three query components—global non‑sequence features, pooled sequence token, or target‑item information—degraded performance. Excluding the global token from the Query Boosting block also reduced results. Merging multiple sequences into a single stream hurt effectiveness, confirming the benefit of independent multi‑sequence modeling.
Scaling Analysis
Using LONGER + RankMixer as the base, HyFormer maintained superiority as model parameters and compute increased, and across varying sequence lengths.
Online A/B Test
In the production TikTok search scenario, HyFormer increased average watch time per user, completion rate, and query‑switch rate.
Conclusion
HyFormer is the first industrial framework that unifies sequence modeling and feature interaction through alternating updates. By iteratively decoding long sequences with a global token and enriching that token via Mixer‑based feature fusion, the architecture provides a scalable solution for ultra‑long sequence representation and heterogeneous feature fusion.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
