How to Detect Test Set Leakage in Black‑Box Language Models

The ICLR 2024 paper introduces a black‑box method for detecting test‑set leakage in large language models by comparing log‑probabilities of original and shuffled test orders, proposes a scalable sharded likelihood test, and demonstrates its effectiveness on several open‑source models, revealing a potential leak in Mistral‑7B.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
How to Detect Test Set Leakage in Black‑Box Language Models

Background

Large language models (LLMs) are pretrained on massive internet data, which can cause them to memorize benchmark test sets. This memorization inflates reported performance because the model has effectively seen the test data during training, making it hard to assess true capabilities.

Goal

The ICLR 2024 paper “Proving Test Set Contamination in Black‑Box Language Models” proposes a method to detect such leakage without access to the pre‑training corpus or model weights.

Basic permutation test

The test shuffles the order of test examples and compares the model’s log‑probability for the original ordering versus the shuffled order. A large change indicates reliance on memorized information.

Formally, for a dataset X compute L(X) under the original order and compare it to the distribution of log‑probabilities obtained from many random permutations of X. Let p be the proportion of permutations whose log‑probability ≤ L(X). If p < α (a significance threshold) the dataset is considered contaminated.

Scalable approximation

Enumerating all permutations is infeasible. The authors approximate p using Monte‑Carlo sampling with a finite‑sample correction. They split X into m shards, run the permutation test on each shard, and average the shard‑wise p estimates, reducing computation to linear time in |X|, though still costly for very large datasets.

Sharded Likelihood Comparison Test (SLCT)

SLCT improves efficiency by dividing X into r contiguous shards. Within each shard the examples are randomly permuted; the difference between the log‑probability of the original order and the mean log‑probability of shuffled orders is computed. A two‑sample t‑test across shards yields a confidence measure for contamination.

Core algorithm:

for i in sample:
    # shard sampling and compute log‑probability
    canonical_logprobs = model(i)
    shuffled_logprobs = model(shuffle(i))

    # T‑test
    diffs = canonical_logprobs - shuffled_logprobs.mean(axis=1)
    z = np.mean(diffs) / np.std(diffs) * np.sqrt(len(diffs))
    pval = 1 - tdist.cdf(z, df=len(diffs)-1)
    print(f"{pval=}")

Experimental findings

The authors evaluated several open‑source LLMs. Most behaved honestly; however, Mistral‑7B showed suspicious results on the Arc‑Easy benchmark, suggesting possible test‑set leakage. Full results are provided in the paper’s appendix.

Resources

Paper: https://arxiv.org/abs/2310.17623

Code: https://github.com/tatsu-lab/test_set_contamination

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLM evaluationlanguage model securityshuffled likelihood testtest-set contamination
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.