How to Detect Test Set Contamination in Black‑Box Language Models
Researchers propose a black‑box method to expose test‑set leakage in large language models by comparing log‑probability shifts when test items are shuffled, using Monte‑Carlo estimation and a sharded likelihood test, and demonstrate its effectiveness on several models including Mistral‑7B.
Background
Large language models (LLMs) are pretrained on massive web corpora. During pretraining they can memorize entire benchmark test sets, which makes it difficult to trust reported performance when a model claims to exceed stronger baselines but fails on truly unseen evaluations. Because the pretraining data and model weights are usually proprietary, detecting such test‑set leakage must be done without direct access to the training corpus.
Goal
The paper proposes a black‑box procedure that can prove test‑set contamination in an LLM without requiring the original training data or the model parameters.
Core Idea
If a test example is randomly shuffled, a model that has memorized the original ordering will show a large change in the log‑probability log(p) of its answer. The magnitude of this change is used as a statistical signal of leakage.
Single‑example Test
Using the BoolQ benchmark as an illustration, the authors reorder the items of a single test example. If the shuffled version causes a substantial drop in the answer’s log‑probability compared with the original ordering, the model is likely over‑fitted to that specific example rather than demonstrating genuine understanding.
Dataset‑level Statistic
For a dataset X, let ℓ_orig be the log‑probability of the model on the original ordering and let ℓ_perm be the log‑probability on a random permutation. Define p = Pr[ℓ_perm ≤ ℓ_orig] over all possible permutations. If p falls below a pre‑specified threshold α, the test set is inferred to be contaminated.
Monte‑Carlo Estimation
Enumerating all permutations is infeasible for realistic datasets. The authors approximate p by Monte‑Carlo sampling with a finite‑sample correction:
Split the dataset X into m shards.
For each shard, draw a modest number of random permutations (e.g., 30–100).
Compute the log‑probability for the original ordering and for each sampled permutation.
Estimate the shard‑level proportion of permutations with log‑probability ≤ the original.
Average the shard‑level proportions to obtain the overall estimate of p.
This estimator runs in linear time with respect to the dataset size, but the constant factor depends on the number of sampled permutations per shard.
Sharded Likelihood Comparison Test
To improve statistical power and computational efficiency, the paper introduces a “Sharded Likelihood Comparison Test”:
Divide the dataset into r contiguous shards.
Within each shard, randomly permute the examples and compute the mean log‑probability of the shuffled order.
For each shard compute the difference Δ_i = ℓ_orig,i – mean(ℓ_perm,i).
Perform a two‑sample t‑test on the collection {Δ_i} to assess whether the mean difference is significantly greater than zero.
A significant positive mean indicates that the original ordering yields higher likelihood than random permutations, which is evidence of test‑set leakage.
Pseudocode
for shard in shards:
# compute log‑probability on original order
orig_logp = model(shard)
# sample K random permutations of the shard
perm_logps = []
for k in range(K):
perm = shuffle(shard)
perm_logps.append(model(perm))
perm_logp_mean = np.mean(perm_logps, axis=0)
diff = orig_logp - perm_logp_mean
diffs.append(diff)
# two‑sample t‑test across shards
z = np.mean(diffs) / np.std(diffs, ddof=1) * np.sqrt(len(diffs))
pval = 1 - tdist.cdf(z, df=len(diffs)-1)
print(f"p-value = {pval}")Experimental Evaluation
The authors applied the method to several publicly released LLMs. Most models showed no statistically significant leakage on benchmarks such as BoolQ, ARC‑Easy, and others. Mistral‑7B exhibited a notable deviation on the ARC‑Easy benchmark, suggesting possible contamination. Full per‑model results are provided in the paper’s appendix.
References
arXiv preprint: https://arxiv.org/abs/2310.17623
Code repository: https://github.com/tatsu-lab/test_set_contamination
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
