How to Detect Test Set Contamination in Black‑Box Language Models

Researchers propose a black‑box method to expose test‑set leakage in large language models by comparing log‑probability shifts when test items are shuffled, using Monte‑Carlo estimation and a sharded likelihood test, and demonstrate its effectiveness on several models including Mistral‑7B.

NewBeeNLP
NewBeeNLP
NewBeeNLP
How to Detect Test Set Contamination in Black‑Box Language Models

Background

Large language models (LLMs) are pretrained on massive web corpora. During pretraining they can memorize entire benchmark test sets, which makes it difficult to trust reported performance when a model claims to exceed stronger baselines but fails on truly unseen evaluations. Because the pretraining data and model weights are usually proprietary, detecting such test‑set leakage must be done without direct access to the training corpus.

Goal

The paper proposes a black‑box procedure that can prove test‑set contamination in an LLM without requiring the original training data or the model parameters.

Core Idea

If a test example is randomly shuffled, a model that has memorized the original ordering will show a large change in the log‑probability log(p) of its answer. The magnitude of this change is used as a statistical signal of leakage.

Illustration of test‑set leakage detection concept
Illustration of test‑set leakage detection concept

Single‑example Test

Using the BoolQ benchmark as an illustration, the authors reorder the items of a single test example. If the shuffled version causes a substantial drop in the answer’s log‑probability compared with the original ordering, the model is likely over‑fitted to that specific example rather than demonstrating genuine understanding.

Dataset‑level Statistic

For a dataset X, let ℓ_orig be the log‑probability of the model on the original ordering and let ℓ_perm be the log‑probability on a random permutation. Define p = Pr[ℓ_perm ≤ ℓ_orig] over all possible permutations. If p falls below a pre‑specified threshold α, the test set is inferred to be contaminated.

Monte‑Carlo Estimation

Enumerating all permutations is infeasible for realistic datasets. The authors approximate p by Monte‑Carlo sampling with a finite‑sample correction:

Split the dataset X into m shards.

For each shard, draw a modest number of random permutations (e.g., 30–100).

Compute the log‑probability for the original ordering and for each sampled permutation.

Estimate the shard‑level proportion of permutations with log‑probability ≤ the original.

Average the shard‑level proportions to obtain the overall estimate of p.

This estimator runs in linear time with respect to the dataset size, but the constant factor depends on the number of sampled permutations per shard.

Sharded Likelihood Comparison Test

To improve statistical power and computational efficiency, the paper introduces a “Sharded Likelihood Comparison Test”:

Divide the dataset into r contiguous shards.

Within each shard, randomly permute the examples and compute the mean log‑probability of the shuffled order.

For each shard compute the difference Δ_i = ℓ_orig,i – mean(ℓ_perm,i).

Perform a two‑sample t‑test on the collection {Δ_i} to assess whether the mean difference is significantly greater than zero.

A significant positive mean indicates that the original ordering yields higher likelihood than random permutations, which is evidence of test‑set leakage.

Sharded likelihood comparison workflow
Sharded likelihood comparison workflow

Pseudocode

for shard in shards:
    # compute log‑probability on original order
    orig_logp = model(shard)
    # sample K random permutations of the shard
    perm_logps = []
    for k in range(K):
        perm = shuffle(shard)
        perm_logps.append(model(perm))
    perm_logp_mean = np.mean(perm_logps, axis=0)
    diff = orig_logp - perm_logp_mean
    diffs.append(diff)
# two‑sample t‑test across shards
z = np.mean(diffs) / np.std(diffs, ddof=1) * np.sqrt(len(diffs))
pval = 1 - tdist.cdf(z, df=len(diffs)-1)
print(f"p-value = {pval}")

Experimental Evaluation

The authors applied the method to several publicly released LLMs. Most models showed no statistically significant leakage on benchmarks such as BoolQ, ARC‑Easy, and others. Mistral‑7B exhibited a notable deviation on the ARC‑Easy benchmark, suggesting possible contamination. Full per‑model results are provided in the paper’s appendix.

Experimental results table
Experimental results table

References

arXiv preprint: https://arxiv.org/abs/2310.17623

Code repository: https://github.com/tatsu-lab/test_set_contamination

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMevaluationtest-set contaminationblack-box detectionsharded likelihood
NewBeeNLP
Written by

NewBeeNLP

Always insightful, always fun

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.