How Auto Risk Transforms Behavior Sequence Data with Unsupervised Pre‑Training
This article introduces Auto Risk, a deep‑learning risk model for behavior‑sequence data that leverages unsupervised pre‑training with proxy tasks, details its convolution‑attention encoder, demonstrates significant gains across multiple business scenarios, and highlights its strong small‑sample and analogy capabilities.
1. Background
Behavior‑sequence data such as shopping logs or risk events are common internal inputs for recommendation and risk control, requiring conversion into feature vectors for sequence classification.
Traditional methods rely on handcrafted trigger and accumulation features fed to GBDT classifiers, while recent approaches use RNN, CNN, or Attention networks to directly process raw sequences, reducing manual feature engineering.
2. Pretraining
Inspired by NLP pre‑training (ELMo, GPT, BERT, ERNIE), Auto Risk adopts proxy tasks on massive unlabeled data to learn generic high‑level representations, which can be fine‑tuned for downstream tasks.
Key pre‑training conditions include:
Proxy task accumulation: From simple Cbow/Skip‑Gram to Masked‑LM and NSP, increasing task difficulty to capture abstract knowledge.
Deep networks: Larger capacity models (e.g., ResNet, SBBB) extract richer features.
Attention: Provides memory, alignment, and global view, crucial for sequence data.
CNN rise: Parallelizable, stackable convolutions combined with attention form efficient Transformers.
3. Problem Analysis
Existing NLP‑style models are supervised and suffer from label scarcity; behavior sequences have multiple fields, heterogeneous modalities, long lengths, and no natural sentence boundaries, making direct BERT adaptation inefficient.
Unlabeled data is abundant, and learning generic features from it can dramatically improve data utilization and complement handcrafted features.
4. Model Design
4.1 Proxy Tasks
Two proxy tasks are defined:
Masked Language Model at the event level – mask a token at time t and predict it, encouraging the model to capture local context.
Quick Thought at the sequence level – split a sequence into two sub‑sequences, encode each with a Siamese network, and predict whether a pair originates from the same source.
4.2 Network Structure
The encoder combines embedding, convolution, and attention layers:
Embedding: Field‑wise embeddings (event type, time, amount, channel, product name) are merged via addition or concatenation; textual fields are aggregated by convolution or averaging.
Convolution layer: Captures local context, assumed to be the primary feature for risk scenarios.
Attention layer: Captures global context, providing additional view.
One convolution + one attention form a block; multiple blocks are stacked with ResNet‑style shortcuts, analogous to stacking Transformer layers.
4.2.1 Convolution Improvements
Standard convolutions cause gradient diffusion and high computation. Auto Risk replaces them with Gated Conv and Depthwise Separable Conv, reducing parameters from 320k to 60k (≈20%) for D=256, K=5, and even lower for larger kernels, accelerating convergence.
4.2.2 Attention Improvements
Self‑Attention’s memory cost grows O(N²). Fixed‑Size or Block Attention reduces it to O(2NK) or O(NK), enabling training on sequences up to length 4000 with three improved attention layers on a single GPU.
4.3 Training
With the optimized encoder, Auto Risk trains three‑layer encoders on sequences of length 4000 using a single GPU, achieving 2–3× batch speedup over Transformers and converging in less than a day for tens of millions of samples.
5. Application Effects
5.1 Business Gains
Adding Auto Risk vectors to state‑of‑the‑art handcrafted features improves AUC by 3–6 points in risk control tasks. Fine‑tuning the pre‑trained model yields further gains, confirming the benefit of unsupervised pre‑training.
5.2 Multi‑Scenario Results
The same pre‑trained model, without any task‑specific features or fine‑tuning, achieves high AUC (up to 0.9) on unrelated tasks such as gender and age prediction using only a linear classifier, demonstrating strong generalization.
5.3 Small‑Sample Learning
Pre‑training provides a solid foundation, allowing models to achieve superior performance with limited labeled data, beneficial for cold‑start or expensive‑label scenarios. In a “buy‑now‑pay‑later” use case, Auto Risk outperforms supervised networks even with only 40 k labeled samples.
5.4 Sequence Analogy
Analogous to word‑embedding analogies, Auto Risk vectors capture semantic relationships. Experiments such as “Taobao credit‑payment – Taobao balance = External credit‑payment – External balance” retrieve expected counterparts, and numeric fields preserve magnitude differences.
A=[创建交易-淘宝实物担保,花呗付款-淘宝实物担保,...]
B=[创建交易-淘宝实物担保,余额付款-淘宝实物担保,...]
C=[花呗付款-站外即时到账,...]
D=[app端-登录,余额付款-站外即时到账,...]6. Conclusion
Auto Risk introduces an unsupervised pre‑training framework for behavior‑sequence data, designing proxy tasks and a convolution‑attention encoder tailored to the data’s characteristics. Deployed in real business, it yields significant AUC improvements, strong cross‑scenario generalization, and notable benefits for small‑sample learning, with future work extending to more data sources and proxy tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
