Artificial Intelligence 19 min read

DeepString: Alibaba's Anti‑Fraud Platform Using Large Models for Real‑Time Traffic Detection

Alibaba's anti-fraud platform DeepString uses large unsupervised models to detect abnormal traffic in real time across multiple advertising products, combining a foundation model for event mining, anomaly measurement, and an alignment model for online filtering, reducing reliance on manual labeling and domain expertise.

Alimama Tech
Alimama Tech
Alimama Tech
DeepString: Alibaba's Anti‑Fraud Platform Using Large Models for Real‑Time Traffic Detection

Abstract

The Alibaba Mama risk‑control team built a next‑generation anti‑fraud platform (Deepstring on Alimama Defense Force, DNA) to protect dozens of product lines across both on‑site and off‑site advertising. The core is the DeepString algorithm framework, which leverages large models to learn business natural laws in an unsupervised manner, weakening reliance on domain experience and enabling rapid iteration for new risk scenarios.

1. Background

1.1 Advertising fraud overview

Alibaba Mama operates a massive commercial marketing middle‑platform serving millions of advertisers with billions of dollars in spend. Abnormal traffic accounts for roughly 24% of total flow, making the platform a prime target for black‑gray‑market attacks.

The risk‑control team safeguards multiple businesses (Alimama, Xianyu, Feizhu, Youku) and protects hundreds of billions of RMB from fraudulent traffic.

1.2 Problem definition

Rapidly evolving fraud patterns and the high cost of manual labeling make traditional experience‑driven governance inefficient. The workflow is abstracted into two stages – “growth” and “maintenance” – with the algorithm tasked to improve both effectiveness and efficiency.

1.3 DNA platform overview

DNA comprises more than ten applications (tiered filtering, multi‑entity mining, unsupervised sample library, model training, reporting, analysis). Its core DeepString framework uses large models to capture business regularities without supervision, achieving stronger recall for evolving risks while reducing dependence on domain expertise.

2. DeepString algorithm framework

Inspired by string theory, DeepString explores high‑dimensional sub‑spaces where abnormal traffic resides, progressively increasing confidence as independent abnormal dimensions are discovered.

2.1 Framework sketch

The pipeline consists of three serial modules: (1) a foundation model shared across all business lines that learns universal patterns from massive, long‑term data; (2) an offline event‑mining module that heuristically searches for high‑impact events; (3) an alignment model that integrates business‑specific signals for online streaming filtering.

2.2 Basic assumptions

Assumption 1: Fraud always increases traffic volume. Assumption 2: The distribution of fraudulent traffic deviates from normal patterns. These guide heuristic searches for events that exhibit significant volume spikes.

2.3 Offline & online usage

Offline, the system defines suitable data structures and performs heuristic searches to discover events with sufficient independent volume increase. Online, a lightweight, stateless ensemble function updates per‑round to maintain high‑frequency adaptation.

3. Foundation model – Event mining

3.1 What is event mining?

Event mining heuristically identifies sub‑spaces where fraudulent traffic is most pronounced, even without labeled data. It determines the upper bound of recall for the DeepString pipeline.

3.2 Comparison with traditional ML

Unlike supervised models that memorize fixed fraud probabilities, event mining focuses on discovering new, high‑impact dimensions, reducing reliance on handcrafted features.

3.3 Mining methods

For tabular data, parallel trees (e.g., XGBoost) generate multiple independent event spaces. For sequential or graph data, sequence models produce embeddings that are projected into discrete event groups.

4. Foundation model – Event anomaly measurement

4.1 Estimating normal event probability

By learning spatio‑temporal business regularities, the system estimates the baseline probability of each event, enabling deviation detection.

4.2 Computing deviation

Various statistical tools (binomial test, Chernoff bound, t‑test, explained variance) quantify the gap between observed and expected frequencies, feeding into downstream scoring.

5. Alignment model – Integration

5.1 Ensemble approach

Multiple independent random variables from different modules are combined using non‑negative linear models, with variance reduction achieved by increasing ensemble size.

5.2 Bias calibration

A calibration module maps confidence scores to a uniform distribution, ensuring consistent thresholds across updates and preventing drift during traffic spikes.

6. Conclusion and outlook

DNA’s DeepString framework demonstrates a sustainable, large‑model‑driven solution for advertising anti‑fraud. Future work includes fully eliminating manual experience, improving heterogeneous graph search, and enhancing cross‑scenario knowledge sharing.

risk managementanti-fraudLarge Modelsmachine learningalgorithm frameworkreal-time detection
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.