Designing Machine Learning Models for Fraud Detection: Sampling, Feature Engineering, and Evaluation
This article explains how Airbnb's Trust & Safety team builds machine‑learning models to detect fraudulent behavior, covering problem definition, role‑based sampling, feature design techniques such as normalization and CP‑coding, and the trade‑offs between precision and recall in model evaluation.
Airbnb’s Trust & Safety team focuses on protecting users and the company from various risks, especially fraud, by building machine‑learning models that identify risky behavior such as refund abuse.
This article outlines the thought process behind constructing such models, using a fictional scenario of predicting whether a character is a "villain" to illustrate the steps.
What Are We Trying to Predict?
The first step in any model is to clearly define the prediction target; here the target is a binary classification of characters as either positive (good) or negative (villain) based on their actions over time.
How to Simulate Scoring?
The training set must reflect a character’s behavior across multiple time periods, resulting in a dataset where each row represents a character‑period pair, as shown in the accompanying diagram.
Time intervals are not required to be continuous; only moments with significant events are considered.
Sampling
Standard row‑based down‑sampling can split a character’s observations between training and validation sets, leading to incomplete character representations.
To avoid this, the team adopts role‑based sampling , ensuring that all periods for a given character stay together in either the training or test split.
Feature Design
Effective feature engineering starts with a deep understanding of the data. Examples include feature normalization and handling of categorical variables.
Normalization can be illustrated by scaling soldier counts by the number of years a character has ruled, providing more comparable features.
For categorical features, one‑hot encoding is common, but Conditional Probability coding (CP‑coding) often works better for high‑cardinality categories by converting each level into a probability value.
To reduce noise from rare categories, smoothing techniques such as weighted averages or global probabilities are applied.
Model Performance Evaluation
When evaluating the model, it is important to consider the imbalance between positive and negative characters. The data is organized as [character*period] rows, but evaluation should be performed at the character level.
Metrics such as precision (TP/(TP+FP)) and recall (TP/(TP+FN)) are explained, along with a confusion‑matrix breakdown.
For fraud detection, a higher recall is usually preferred, even at the cost of some precision.
Conclusion
Building a good model requires understanding the data context, creating meaningful features, and carefully handling sampling and evaluation; there is no one‑size‑fits‑all solution.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
