DeepString: Alibaba's Anti‑Fraud Platform Using Large Models for Real‑Time Traffic Detection
Alibaba's anti-fraud platform DeepString uses large unsupervised models to detect abnormal traffic in real time across multiple advertising products, combining a foundation model for event mining, anomaly measurement, and an alignment model for online filtering, reducing reliance on manual labeling and domain expertise.
Abstract
The Alibaba Mama risk‑control team built a next‑generation anti‑fraud platform (Deepstring on Alimama Defense Force, DNA) to protect dozens of product lines across both on‑site and off‑site advertising. The core is the DeepString algorithm framework, which leverages large models to learn business natural laws in an unsupervised manner, weakening reliance on domain experience and enabling rapid iteration for new risk scenarios.
1. Background
1.1 Advertising fraud overview
Alibaba Mama operates a massive commercial marketing middle‑platform serving millions of advertisers with billions of dollars in spend. Abnormal traffic accounts for roughly 24% of total flow, making the platform a prime target for black‑gray‑market attacks.
The risk‑control team safeguards multiple businesses (Alimama, Xianyu, Feizhu, Youku) and protects hundreds of billions of RMB from fraudulent traffic.
1.2 Problem definition
Rapidly evolving fraud patterns and the high cost of manual labeling make traditional experience‑driven governance inefficient. The workflow is abstracted into two stages – “growth” and “maintenance” – with the algorithm tasked to improve both effectiveness and efficiency.
1.3 DNA platform overview
DNA comprises more than ten applications (tiered filtering, multi‑entity mining, unsupervised sample library, model training, reporting, analysis). Its core DeepString framework uses large models to capture business regularities without supervision, achieving stronger recall for evolving risks while reducing dependence on domain expertise.
2. DeepString algorithm framework
Inspired by string theory, DeepString explores high‑dimensional sub‑spaces where abnormal traffic resides, progressively increasing confidence as independent abnormal dimensions are discovered.
2.1 Framework sketch
The pipeline consists of three serial modules: (1) a foundation model shared across all business lines that learns universal patterns from massive, long‑term data; (2) an offline event‑mining module that heuristically searches for high‑impact events; (3) an alignment model that integrates business‑specific signals for online streaming filtering.
2.2 Basic assumptions
Assumption 1: Fraud always increases traffic volume. Assumption 2: The distribution of fraudulent traffic deviates from normal patterns. These guide heuristic searches for events that exhibit significant volume spikes.
2.3 Offline & online usage
Offline, the system defines suitable data structures and performs heuristic searches to discover events with sufficient independent volume increase. Online, a lightweight, stateless ensemble function updates per‑round to maintain high‑frequency adaptation.
3. Foundation model – Event mining
3.1 What is event mining?
Event mining heuristically identifies sub‑spaces where fraudulent traffic is most pronounced, even without labeled data. It determines the upper bound of recall for the DeepString pipeline.
3.2 Comparison with traditional ML
Unlike supervised models that memorize fixed fraud probabilities, event mining focuses on discovering new, high‑impact dimensions, reducing reliance on handcrafted features.
3.3 Mining methods
For tabular data, parallel trees (e.g., XGBoost) generate multiple independent event spaces. For sequential or graph data, sequence models produce embeddings that are projected into discrete event groups.
4. Foundation model – Event anomaly measurement
4.1 Estimating normal event probability
By learning spatio‑temporal business regularities, the system estimates the baseline probability of each event, enabling deviation detection.
4.2 Computing deviation
Various statistical tools (binomial test, Chernoff bound, t‑test, explained variance) quantify the gap between observed and expected frequencies, feeding into downstream scoring.
5. Alignment model – Integration
5.1 Ensemble approach
Multiple independent random variables from different modules are combined using non‑negative linear models, with variance reduction achieved by increasing ensemble size.
5.2 Bias calibration
A calibration module maps confidence scores to a uniform distribution, ensuring consistent thresholds across updates and preventing drift during traffic spikes.
6. Conclusion and outlook
DNA’s DeepString framework demonstrates a sustainable, large‑model‑driven solution for advertising anti‑fraud. Future work includes fully eliminating manual experience, improving heterogeneous graph search, and enhancing cross‑scenario knowledge sharing.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.