Auto Risk: Pretraining Deep Models on Unlabeled Behavior Sequences
This article introduces Auto Risk, a behavior‑sequence deep‑learning framework that uses unsupervised pre‑training with proxy tasks to learn universal feature representations from massive unlabeled data, achieving significant gains in risk‑control scenarios, improving AUC, supporting multi‑scene generalization and small‑sample learning.
1. Introduction
Behavior sequence data such as shopping logs or risk‑control events are abundant in many business scenarios, yet labeled data are scarce. To address this, we propose Auto Risk, a deep‑learning risk algorithm that learns generic feature representations from unlabeled data via proxy tasks, similar to NLP pre‑training models like BERT, but tailored for behavior sequences.
2. Background
Traditional methods rely on handcrafted trigger and accumulation features followed by GBDT classifiers. Recent advances use RNN, CNN, or Attention networks to directly ingest raw sequences, eliminating manual feature engineering. Our earlier Detail Risk framework applied such ideas, reducing manual work and improving model performance.
3. Limitations of Supervised Approaches
Supervised models suffer when label samples are limited, and multi‑task learning introduces complex trade‑offs. Meanwhile, massive unlabeled data remain underutilized.
4. Pre‑training Concept
Pre‑training leverages abundant unlabeled data and proxy tasks to learn high‑level features, which can be fine‑tuned for downstream tasks. Inspired by the success of ELMo, GPT, BERT, and ERNIE in NLP, we adapt the idea to behavior sequences, which differ significantly from text.
5. Proxy Tasks
Masked Language Model (MLM) at the event level: Randomly mask a token at time t and require the model to predict it, encouraging the network to capture local context.
Quick Thought (QT) at the sequence level: Split a sequence into two sub‑sequences, encode each with a Siamese network, and predict whether a pair originates from the same original sequence, encouraging global semantic understanding.
6. Model Architecture
The core is an Encoder built from stacked Convolution‑Attention blocks using ResNet‑style shortcuts.
Embedding Layer: All fields (event type, timestamp, amount, channel, product name, etc.) are embedded; textual fields are aggregated via convolution or averaging.
Convolution Layer: Captures local context, assumed to be the primary feature in risk‑control scenarios.
Attention Layer: Captures global context as a secondary feature.
Block: One Convolution + one Attention layer, stacked multiple times.
6.1 Convolution Improvements
Standard convolutions become inefficient when stacked deep. We replace them with Gated Convolution and Depthwise Separable Convolution, reducing parameters and computation (e.g., from 320k to 60k for D=256, K=5) while improving convergence.
6.2 Attention Improvements
Self‑Attention’s O(N²) memory becomes prohibitive for long sequences (>1000). We adopt Fixed‑Size or Block Attention, reducing memory to O(2NK) or O(NK), enabling training of 3‑layer encoders on sequences of length 4000 within a single GPU.
7. Training Efficiency
Three‑layer Encoder can train on sequences of length 4000, whereas Transformer‑based BERT struggles beyond 1000.
2–3× faster batch training and fewer convergence steps; a full‑scale dataset can be trained within a day on a single GPU.
8. Application Results
8.1 Business Gains
Adding Auto Risk vectors to existing handcrafted features improves AUC by 3–6 points across risk‑control tasks. Fine‑tuning the encoder further boosts performance, mirroring BERT’s behavior.
8.2 Multi‑Scene Effectiveness
Because pre‑training does not use task‑specific labels, the learned representations transfer to unrelated tasks such as gender or age prediction, achieving AUC up to 0.9 with a simple logistic regression classifier.
8.3 Small‑Sample Learning
Pre‑trained models excel when labeled data are scarce; in a “buy‑now‑pay‑later” scenario, Auto Risk + fine‑tuning outperforms a fully supervised network even with 40k labeled samples.
9. Sequence Analogy
We test A‑B=C‑D analogies in the embedding space. For example, swapping payment methods or amounts yields consistent vector arithmetic, demonstrating that the learned space captures high‑level semantics.
A=[创建交易-淘宝实物担保,花呗付款-淘宝实物担保,...]
B=[创建交易-淘宝实物担保,余额付款-淘宝实物担保,...]
C=[花呗付款-站外即时到账,...]
D=[app端-登录,余额付款-站外即时到账,...]Similar experiments on amount fields and product names confirm the model’s ability to encode numeric buckets and textual patterns.
A=[\N,\N,10000.0,10000.0,...]
B=[\N,\N,10.0,10.0,...]
C=[\N,\N,\N,8000.0,...]
D=[\N,1.0,\N,1.0,...] A=["滴滴快车-周师傅",...]
B=["腾讯Q币100元qq",...]
C=["滴滴快车-冯师傅",...]
D=["腾讯1000QQ币1",...]10. Conclusion
Auto Risk demonstrates that unsupervised pre‑training on massive behavior‑sequence data can produce universal high‑level features, alleviating label scarcity, improving risk‑control performance, and generalizing across multiple scenarios. Future work includes extending to more data sources and exploring additional proxy tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
