How to Detect Test Set Leakage in Black‑Box Language Models
The ICLR 2024 paper introduces a black‑box method for detecting test‑set leakage in large language models by comparing log‑probabilities of original and shuffled test orders, proposes a scalable sharded likelihood test, and demonstrates its effectiveness on several open‑source models, revealing a potential leak in Mistral‑7B.
Background
Large language models (LLMs) are pretrained on massive internet data, which can cause them to memorize benchmark test sets. This memorization inflates reported performance because the model has effectively seen the test data during training, making it hard to assess true capabilities.
Goal
The ICLR 2024 paper “Proving Test Set Contamination in Black‑Box Language Models” proposes a method to detect such leakage without access to the pre‑training corpus or model weights.
Basic permutation test
The test shuffles the order of test examples and compares the model’s log‑probability for the original ordering versus the shuffled order. A large change indicates reliance on memorized information.
Formally, for a dataset X compute L(X) under the original order and compare it to the distribution of log‑probabilities obtained from many random permutations of X. Let p be the proportion of permutations whose log‑probability ≤ L(X). If p < α (a significance threshold) the dataset is considered contaminated.
Scalable approximation
Enumerating all permutations is infeasible. The authors approximate p using Monte‑Carlo sampling with a finite‑sample correction. They split X into m shards, run the permutation test on each shard, and average the shard‑wise p estimates, reducing computation to linear time in |X|, though still costly for very large datasets.
Sharded Likelihood Comparison Test (SLCT)
SLCT improves efficiency by dividing X into r contiguous shards. Within each shard the examples are randomly permuted; the difference between the log‑probability of the original order and the mean log‑probability of shuffled orders is computed. A two‑sample t‑test across shards yields a confidence measure for contamination.
Core algorithm:
for i in sample:
# shard sampling and compute log‑probability
canonical_logprobs = model(i)
shuffled_logprobs = model(shuffle(i))
# T‑test
diffs = canonical_logprobs - shuffled_logprobs.mean(axis=1)
z = np.mean(diffs) / np.std(diffs) * np.sqrt(len(diffs))
pval = 1 - tdist.cdf(z, df=len(diffs)-1)
print(f"{pval=}")Experimental findings
The authors evaluated several open‑source LLMs. Most behaved honestly; however, Mistral‑7B showed suspicious results on the Arc‑Easy benchmark, suggesting possible test‑set leakage. Full results are provided in the paper’s appendix.
Resources
Paper: https://arxiv.org/abs/2310.17623
Code: https://github.com/tatsu-lab/test_set_contamination
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
