Artificial Intelligence 15 min read

Few-Shot Learning, Data Augmentation, and Multi‑Task Learning for Safety Modeling in Ride‑Hailing Platforms

This article presents Didi's exploration of few‑shot learning, data‑augmentation, semi‑supervised self‑training and multi‑task learning techniques to address the scarcity of labeled samples in safety and governance scenarios, demonstrating practical solutions and performance gains across various risk‑detection tasks.

DataFunTalk

May 9, 2021

Few-Shot Learning, Data Augmentation, and Multi‑Task Learning for Safety Modeling in Ride‑Hailing Platforms

In ride‑hailing platforms like Didi, passenger and driver safety is a core barrier, yet the lack of accurate, large‑scale labeled samples hampers model performance. To overcome this, Didi applied few‑shot learning across governance and safety domains, building a systematic solution.

Related Work

Few‑shot learning studies how to solve machine‑learning tasks with limited supervised data, often overlapping with semi‑supervised learning, which leverages abundant unlabeled data.

Following the taxonomy of Wang et al., few‑shot techniques are categorized into three groups: data‑level methods (using prior knowledge for data augmentation), model‑level methods (reducing hypothesis space via prior knowledge), and algorithm‑level methods (improving parameter search strategies).

Data Augmentation

Data augmentation is a low‑complexity, widely used approach. For images, operations include flipping, rotation, scaling; for text, synonym replacement, random insertion, swap, and deletion are applied. Augmented text can achieve the accuracy of using 100% of training data while only using 50% of original samples.

Weak supervision via semi‑supervised learning, such as self‑training, further expands the dataset. The self‑training pipeline involves training an initial model on a few labeled samples, predicting pseudo‑labels for unlabeled data, selecting high‑confidence predictions, and iteratively retraining.

Model Techniques

Model‑level methods reduce the parameter search space. Multi‑task learning (MTL) shares embeddings and hidden layers between a primary task (e.g., fault detection) and an auxiliary task (e.g., complaint classification), effectively increasing effective sample size.

Business Applications

Sexual harassment order detection : By inserting key resistance phrases into ASR transcripts, a data‑augmented dataset of 120 k positive samples was created, improving a TextCNN model’s performance.

Fee‑complaint driver responsibility : Self‑training expanded the training set, reducing log‑loss by over 20% compared to a baseline XGBoost model.

Route‑detour interception : An MTL model sharing embeddings between a large‑scale complaint task and a rare fault task achieved higher ROC‑AUC and better precision‑recall at operational thresholds than the production XGBoost model.

Fine‑Tuning with Weak Samples

For severe traffic accidents, a pre‑training stage on abundant minor‑injury cases followed by fine‑tuning on rare severe cases improved recall at low impact‑area thresholds for both DNN and XGBoost models.

Conclusion

By combining few‑shot learning, data augmentation, semi‑supervised self‑training, and multi‑task learning, Didi reduced labeling costs and achieved measurable performance gains across multiple safety‑critical scenarios, offering practical guidance for other organizations facing limited labeled data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data augmentation AI multi-task learning Few‑Shot Learning Semi-supervised Learning ride-hailing safety

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.