Artificial Intelligence 19 min read

How WASP Generates High‑Quality DP Synthetic Data with Multi‑Model Collaboration

WASP is a privacy‑preserving framework that fuses multiple pretrained language models through a weighted Top‑Q voting scheme to synthesize differential‑private data, dramatically improving downstream task performance even when only a few private samples are available, and it scales to federated settings.

AsiaInfo Technology: New Tech Exploration

May 19, 2025

How WASP Generates High‑Quality DP Synthetic Data with Multi‑Model Collaboration

Introduction

Rapid advances in large language models (LLMs) and task‑specific models (STMs) rely on abundant high‑quality training data, yet real‑world data are often scarce and contain sensitive user information, raising serious privacy concerns.

Challenges of Existing DP Data Synthesis

Privacy sample scarcity: Current privacy‑enhanced (PE) methods need thousands of private samples; in practice only hundreds may be available, leading to noisy guidance.

Synthetic data noise: PE methods still generate low‑quality samples that hurt downstream models.

PLM selection risk: Different pretrained language models (PLMs) perform variably across tasks, but existing PE works focus on a single PLM.

WASP Framework Overview

WASP addresses the above challenges with a weighted multi‑PLM fusion and contrastive DP data synthesis pipeline. The core ideas are:

Extend PE’s Top‑1 voting to a Top‑Q decaying‑weight voting to improve private distribution estimation.

Use the voting results to separate high‑ and low‑quality synthetic samples, constructing contrastive prompts that guide generation toward high‑quality regions.

Dynamically weight each PLM based on its similarity to private data, allowing stronger PLMs to contribute more samples.

The code is publicly available at https://github.com/Lindalydia/WASP.

Theoretical Foundations

Differential Privacy (DP) : Two datasets D and D′ are adjacent if they differ by a single record. A mechanism M satisfies (ε,δ)‑DP when for any adjacent D, D′ and any output set E, Pr[M(D) ∈ E] ≤ e^{ε}·Pr[M(D′) ∈ E] + δ Post‑processing does not incur additional privacy loss.

Gaussian Mechanism : Adding Gaussian noise N(0,σ²) to a statistic achieves (ε,δ)‑DP with σ = Δ·√{2·ln(1.25/δ)}/ε, where Δ is the sensitivity.

Problem Definition

Given a small private dataset 𝔅 = { (z_j, u_j) }_{j=1}^M, WASP aims to generate a DP‑compliant synthetic dataset 𝔇 = { (x_i, y_i) }_{i=1}^N by orchestrating K black‑box PLM APIs. The synthetic data are then used to train a small task‑specific model (STM) m, whose performance is evaluated on an untouched real test set 𝔄.

Methodology

1. Weighted Parallel Data Generation

In each of T iterations, each PLM 𝒫_k generates N_k = ⌊(N/T)·w_k⌋ samples. The first iteration uses a zero‑sample prompt; subsequent iterations use contrastive prompts built from selected high‑ and low‑quality samples.

2. Differential‑Privacy Top‑Q Voting

For each private sample (z_j, u_j), compute the ℓ₂ distance to synthetic samples of the same label: d(z_j, x_i) = ‖φ(z_j) – φ(x_i)‖₂ Select the Q nearest and Q farthest synthetic samples, then apply exponentially decaying weights (1, ½, …, ½^{Q‑1}) to update nearest (Hⁿ) and farthest (Hᶠ) histograms. Add Gaussian noise 𝒩(0,σ²) with σ = 4·√{2·ln(1.25/δ_{iter})}·√{T‑1}/ε to preserve DP.

3. PLM Importance Weighting

Update each PLM’s weight based on the average similarity of its generated samples to private data:

w_k = ( Σ_{(x_i,y_i)∈𝔇_k} s_i ) / (|𝔇_k| / |𝔇|), s_i = Hⁿ[i] / Σ_{i'} Hⁿ[i']

4. Contrastive Cross‑PLM Context Learning

Construct a contrastive prompt 𝒯(·) that includes:

Analysis of differences between high‑ and low‑quality samples.

Constraints ensuring new samples are closer to high‑quality examples and farther from low‑quality ones.

Diversity encouragement to generate varied expressions.

Randomly sample 50 % of high‑ and low‑quality examples for each label to build the final context.

Experimental Setup

Models : Open‑source PLMs – GPT‑2‑xl, Llama‑2‑7b‑chat, Vicuna‑7b‑1.5v, OPT‑6.7b, ChatGLM3‑6b‑base, Flan‑T5‑xl; Closed‑source PLMs – GPT‑3.5‑turbo‑instruct, GPT‑4‑turbo‑preview, GPT‑4o. STM – BERT‑base‑uncased fine‑tuned classifier. Embedding model – sentence‑t5‑base.

Datasets : Six NLP tasks – IMDb, Yelp‑Category, Yelp‑Rating, OpenReview‑Category, OpenReview‑Rating, Banking77.

Baselines : Aug‑PE (single‑PLM PE), Pre‑Text (federated PE), OnlyPrivate (private‑only training), FuseGen (zero‑sample multi‑PLM), DP‑SGD+Gen (DP‑fine‑tuned PLM generation).

Implementation Details : Private sample count M=100 (single‑site) or M=300 (federated, L=10). Total synthetic samples N=6,000 generated over T=5 rounds. ε=4.0, δ_{iter}=1e‑5 unless otherwise noted.

Results

Single‑Site Scenario

WASP outperforms all baselines on every task, e.g., OpenReview‑Rating accuracy improves by 1.68 % over the best Aug‑PE. It remains robust when an unsuitable PLM is used, demonstrating PLM‑agnostic performance and lower FID scores.

Federated Scenario

Compared with the federated baseline Pre‑Text, WASP consistently achieves higher accuracy across tasks, confirming its scalability to multi‑party settings.

Computation & Communication

Runtime is comparable to PE baselines, indicating no significant overhead.

Communication adds only L vectors of dimension N, a negligible increase.

Ablation Studies

Both contrastive context learning and dynamic PLM weighting contribute positively to downstream performance.

Increasing Q improves results up to a saturation point (Q≈8).

Performance degrades gracefully as ε becomes stricter; with ε=8.0 it approaches the non‑private regime.

Conclusion and Future Work

WASP introduces a novel DP synthetic data generation framework that leverages weighted multi‑PLM collaboration to overcome private sample scarcity, offering high efficiency, PLM‑agnosticism, scalability, and strong performance on challenging tasks.

Future directions include finer‑grained sample‑level weighting and extending the approach to non‑classification tasks such as generation and sequence labeling.