How WASP Generates High‑Quality DP Synthetic Data with Multi‑Model Collaboration
WASP is a privacy‑preserving framework that fuses multiple pretrained language models through a weighted Top‑Q voting scheme to synthesize differential‑private data, dramatically improving downstream task performance even when only a few private samples are available, and it scales to federated settings.
Introduction
Rapid advances in large language models (LLMs) and task‑specific models (STMs) rely on abundant high‑quality training data, yet real‑world data are often scarce and contain sensitive user information, raising serious privacy concerns.
Challenges of Existing DP Data Synthesis
Privacy sample scarcity: Current privacy‑enhanced (PE) methods need thousands of private samples; in practice only hundreds may be available, leading to noisy guidance.
Synthetic data noise: PE methods still generate low‑quality samples that hurt downstream models.
PLM selection risk: Different pretrained language models (PLMs) perform variably across tasks, but existing PE works focus on a single PLM.
WASP Framework Overview
WASP addresses the above challenges with a weighted multi‑PLM fusion and contrastive DP data synthesis pipeline. The core ideas are:
Extend PE’s Top‑1 voting to a Top‑Q decaying‑weight voting to improve private distribution estimation.
Use the voting results to separate high‑ and low‑quality synthetic samples, constructing contrastive prompts that guide generation toward high‑quality regions.
Dynamically weight each PLM based on its similarity to private data, allowing stronger PLMs to contribute more samples.
The code is publicly available at https://github.com/Lindalydia/WASP.
Theoretical Foundations
Differential Privacy (DP) : Two datasets D and D′ are adjacent if they differ by a single record. A mechanism M satisfies (ε,δ)‑DP when for any adjacent D, D′ and any output set E, Pr[M(D) ∈ E] ≤ e^{ε}·Pr[M(D′) ∈ E] + δ Post‑processing does not incur additional privacy loss.
Gaussian Mechanism : Adding Gaussian noise N(0,σ²) to a statistic achieves (ε,δ)‑DP with σ = Δ·√{2·ln(1.25/δ)}/ε, where Δ is the sensitivity.
Problem Definition
Given a small private dataset 𝔅 = { (z_j, u_j) }_{j=1}^M, WASP aims to generate a DP‑compliant synthetic dataset 𝔇 = { (x_i, y_i) }_{i=1}^N by orchestrating K black‑box PLM APIs. The synthetic data are then used to train a small task‑specific model (STM) m, whose performance is evaluated on an untouched real test set 𝔄.
Methodology
1. Weighted Parallel Data Generation
In each of T iterations, each PLM 𝒫_k generates N_k = ⌊(N/T)·w_k⌋ samples. The first iteration uses a zero‑sample prompt; subsequent iterations use contrastive prompts built from selected high‑ and low‑quality samples.
2. Differential‑Privacy Top‑Q Voting
For each private sample (z_j, u_j), compute the ℓ₂ distance to synthetic samples of the same label: d(z_j, x_i) = ‖φ(z_j) – φ(x_i)‖₂ Select the Q nearest and Q farthest synthetic samples, then apply exponentially decaying weights (1, ½, …, ½^{Q‑1}) to update nearest (Hⁿ) and farthest (Hᶠ) histograms. Add Gaussian noise 𝒩(0,σ²) with σ = 4·√{2·ln(1.25/δ_{iter})}·√{T‑1}/ε to preserve DP.
3. PLM Importance Weighting
Update each PLM’s weight based on the average similarity of its generated samples to private data:
w_k = ( Σ_{(x_i,y_i)∈𝔇_k} s_i ) / (|𝔇_k| / |𝔇|), s_i = Hⁿ[i] / Σ_{i'} Hⁿ[i']4. Contrastive Cross‑PLM Context Learning
Construct a contrastive prompt 𝒯(·) that includes:
Analysis of differences between high‑ and low‑quality samples.
Constraints ensuring new samples are closer to high‑quality examples and farther from low‑quality ones.
Diversity encouragement to generate varied expressions.
Randomly sample 50 % of high‑ and low‑quality examples for each label to build the final context.
Experimental Setup
Models : Open‑source PLMs – GPT‑2‑xl, Llama‑2‑7b‑chat, Vicuna‑7b‑1.5v, OPT‑6.7b, ChatGLM3‑6b‑base, Flan‑T5‑xl; Closed‑source PLMs – GPT‑3.5‑turbo‑instruct, GPT‑4‑turbo‑preview, GPT‑4o. STM – BERT‑base‑uncased fine‑tuned classifier. Embedding model – sentence‑t5‑base.
Datasets : Six NLP tasks – IMDb, Yelp‑Category, Yelp‑Rating, OpenReview‑Category, OpenReview‑Rating, Banking77.
Baselines : Aug‑PE (single‑PLM PE), Pre‑Text (federated PE), OnlyPrivate (private‑only training), FuseGen (zero‑sample multi‑PLM), DP‑SGD+Gen (DP‑fine‑tuned PLM generation).
Implementation Details : Private sample count M=100 (single‑site) or M=300 (federated, L=10). Total synthetic samples N=6,000 generated over T=5 rounds. ε=4.0, δ_{iter}=1e‑5 unless otherwise noted.
Results
Single‑Site Scenario
WASP outperforms all baselines on every task, e.g., OpenReview‑Rating accuracy improves by 1.68 % over the best Aug‑PE. It remains robust when an unsuitable PLM is used, demonstrating PLM‑agnostic performance and lower FID scores.
Federated Scenario
Compared with the federated baseline Pre‑Text, WASP consistently achieves higher accuracy across tasks, confirming its scalability to multi‑party settings.
Computation & Communication
Runtime is comparable to PE baselines, indicating no significant overhead.
Communication adds only L vectors of dimension N, a negligible increase.
Ablation Studies
Both contrastive context learning and dynamic PLM weighting contribute positively to downstream performance.
Increasing Q improves results up to a saturation point (Q≈8).
Performance degrades gracefully as ε becomes stricter; with ε=8.0 it approaches the non‑private regime.
Conclusion and Future Work
WASP introduces a novel DP synthetic data generation framework that leverages weighted multi‑PLM collaboration to overcome private sample scarcity, offering high efficiency, PLM‑agnosticism, scalability, and strong performance on challenging tasks.
Future directions include finer‑grained sample‑level weighting and extending the approach to non‑classification tasks such as generation and sequence labeling.
AsiaInfo Technology: New Tech Exploration
AsiaInfo's cutting‑edge ICT viewpoints and industry insights, featuring its latest technology and product case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
