Artificial Intelligence 11 min read

FedMix: Boosting Vertical Federated Learning with Data Mixture

This paper introduces FedMix, a method that enhances vertical federated learning by mixing aligned and unaligned data, theoretically demonstrating the value of unaligned data and empirically achieving over 10% ROI improvement and significant AUC gains while keeping computational and communication overhead low.

Tencent Advertising Technology
Tencent Advertising Technology
Tencent Advertising Technology
FedMix: Boosting Vertical Federated Learning with Data Mixture

Vertical Federated Learning (VFL) splits data features across multiple parties, requiring Private Set Intersection (PSI) to align samples before training; this discards a large amount of unaligned data. The authors theoretically show that unaligned data still carries valuable information and propose FedMix, which mixes aligned and unaligned data during training to improve model performance and reduce communication/computation costs.

Modern privacy regulations such as GDPR create data silos, making it essential to develop methods that preserve privacy while enabling collaborative analytics. VFL addresses this need but suffers from limited usable data because only aligned samples and their labels can be used, while the majority of unaligned data is wasted.

Through a theoretical analysis based on distribution‑shift and Wasserstein distance bounds, the paper proves that incorporating unaligned data reduces the upper bound on model error, indicating that unaligned data can indeed boost VFL performance.

The proposed FedMix framework consists of two key components: a Data Mixer and Data Seasoning. The Data Mixer applies a mixup‑style augmentation, combining each aligned sample with randomly selected unaligned samples using a weight drawn from a Beta distribution. Two mixing strategies are explored: OA (one unaligned sample per mix) and FT (multiple unaligned samples). Data Seasoning handles unlabeled unaligned data by semi‑supervised pseudo‑labeling, mixing each unlabeled sample with an aligned one and using the second‑ranked inference label as its pseudo‑label.

Complexity analysis shows that FedMix adds only minor overhead for sample selection and forward propagation; communication and computation remain comparable to vanilla VFL because the number of participating parties is limited and the extra operations are lightweight.

Extensive experiments on four datasets (DefaultCredit, Criteo, FashionMNIST, CIFAR‑10) against three baselines (Vanilla, FedDA, VFLFS) demonstrate that FedMix consistently achieves higher AUC and lower training time. It also proves robust under varying feature‑distribution imbalances and different proportions of unaligned or unlabeled data.

Ablation studies compare OA versus FT and evaluate the individual contributions of Data Mixer and Data Seasoning. Both components improve performance independently, with the Data Mixer providing the primary gain; Data Seasoning adds further benefit when unlabeled data is abundant.

In conclusion, FedMix effectively leverages unaligned data in VFL scenarios, delivering significant accuracy improvements without incurring substantial additional costs. The method has been deployed in Tencent’s advertising federated learning projects, achieving more than a 2% increase in model AUC.

privacyFederated Learningdata mixturevertical federated learning
Tencent Advertising Technology
Written by

Tencent Advertising Technology

Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.