Artificial Intelligence 20 min read

Efficient Target Attention (ETA) for Long-Term User Behavior Modeling in Click‑Through Rate Prediction

Efficient Target Attention (ETA) introduces a low‑cost hash‑based attention operator that enables end‑to‑end modeling of ultra‑long user behavior sequences for CTR prediction, achieving significant online CTR, GMV, and QPS improvements in Alibaba’s Taobao feed recommendation system.

DataFunTalk
DataFunTalk
DataFunTalk
Efficient Target Attention (ETA) for Long-Term User Behavior Modeling in Click‑Through Rate Prediction

Abstract: Users' interests can be divided into instant, recent, and long‑term interests. While instant and recent interests have been widely modeled, this work focuses on long‑term behavior modeling. Existing two‑stage pipelines suffer from inconsistency and high computational cost. We propose Efficient Target Attention (ETA), a hash‑based attention operator that reduces computation while enabling end‑to‑end modeling of long sequences.

1. Target Attention and User Sequence Modeling

Target Attention extracts user interest by attending to historical behaviors conditioned on a candidate item and is the core of many industrial recommendation models (e.g., DIN, DIEN, BST). However, its O(B·L·D) complexity (B: candidate items, L: sequence length) limits its use for ultra‑long sequences.

Two‑stage methods first retrieve a top‑K sub‑sequence and then apply attention, but they suffer from a gap between retrieval and prediction objectives.

2. Rethinking Target Attention

We redesign the attention operator so that the same Target Attention can be applied directly to long sequences in an end‑to‑end fashion, eliminating the retrieval‑prediction inconsistency.

3. Existing Retrieval Methods and Problems

Current approaches include Approximate Nearest Neighbor (ANN) search (e.g., SIM Soft) and structured attribute inverted indexes (e.g., SIM Hard, UBR4CTR). ANN suffers from a task gap, while inverted indexes rely on handcrafted structures and are less general.

4. Efficient Target Attention Net (ETA‑Net)

4.1 Multi‑Round LSH Attention & SimHash

We use locality‑sensitive hashing (LSH) with binary hash buckets to generate SimHash signatures for queries (Q) and keys (K). Multi‑round LSH reduces hash collision error, enabling fast binary similarity computation (Hamming distance) instead of costly floating‑point dot products.

4.2 ETA Architecture

ETA‑Net consists of two parts: the ETA module that models ultra‑long user behavior using SimHash‑based attention, and a BaseModel that processes other features (user, item, context). The BaseModel concatenates short‑term behavior representations, ETA outputs, and other features before feeding them into a multi‑layer MLP for CTR prediction.

The hash functions operate on Q and K only, leaving the value vectors unchanged.

5. End‑to‑End Learning and Complexity Analysis

Unlike two‑stage pipelines, ETA‑Net is trained end‑to‑end; SimHash has no trainable parameters, so its signatures evolve with the attention weights during training, ensuring consistency between retrieval and prediction.

Complexity comparison (Table 1) shows that ETA replaces the O(B·L·D) term with a much cheaper binary hash lookup, and the optimized ETA+ variant even removes the SimHash computation by pre‑computing signatures.

6. Industrial Implementation

Training runs on Alibaba’s AOP platform (CPU clusters). Optimizations such as operator fusion and GPU‑custom kernels (for XOR, TopK, Gather) improve latency and memory bandwidth.

Online deployment on the RTP service handles >60 k QPS daily, peaking at >120 k QPS during Double‑11.

Pre‑compute SimHash signatures offline and store them as Int64 fingerprints, reducing online storage overhead to ~6%.

GPU‑specific kernel rewrites further accelerate inference.

7. Experiments

We evaluate ETA against several baselines (Avg‑Pooling DNN, DIN, DIN‑Long, SIM, UBR4CTR, and their time‑aware variants) on both public and production datasets. ETA consistently achieves the highest AUC/CTR gains (e.g., +0.1% AUC on public data, +0.34%–0.43% over SIM/UBR4CTR on production data).

Business impact on Taobao feed recommendation: CTR +1.8%, IPV +2.8%, GMV +3.1% compared with the previous ranking model.

8. Conclusion

We present ETA, the first end‑to‑end, cost‑effective solution for modeling ultra‑long user behavior in large‑scale CTR tasks. Deployed as the main ranking model for Taobao’s homepage feed, ETA processes millions of requests per day with substantial revenue uplift.

References

[1] Zhou G, Zhu X, Song C, et al. Deep interest network for click‑through rate prediction. KDD 2018.

[2] Zhou G, Mou N, Fan Y, et al. Deep interest evolution network for click‑through rate prediction. AAAI 2019.

[3] Chen Q, Zhao H, Li W, et al. Behavior sequence transformer for e‑commerce recommendation in Alibaba. 2019.

[4] Qi Pi, Weijie Bian, Guorui Zhou, et al. Practice on long sequential user behavior modeling for CTR prediction. KDD 2019.

[5] Pi Q, Zhou G, Zhang Y, et al. Search‑based user interest modeling with lifelong sequential behavior data for CTR prediction. CIKM 2020.

[6] Qin J, Zhang W, Wu X, et al. User behavior retrieval for CTR prediction. SIGIR 2020.

[7] Charikar M S. Similarity estimation techniques from rounding algorithms. STOC 2002.

[8] Manku G S, Jain A, Das Sarma A. Detecting near‑duplicates for web crawling. WWW 2007.

CTR predictionRecommendation systemshashingattention mechanismLong sequence modeling
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.