Artificial Intelligence 13 min read

Scaling Laws for Dense Retrieval: Empirical Study of Model Size, Training Data, and Annotation Quality

The award‑winning study shows that dense retrieval performance follows precise power‑law scaling with model size, training data quantity, and annotation quality, introduces contrast entropy for evaluation, validates joint scaling formulas on MS MARCO and T2Ranking, and uses cost models to guide budget‑optimal resource allocation.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Scaling Laws for Dense Retrieval: Empirical Study of Model Size, Training Data, and Annotation Quality

At SIGIR 2024, the Xiaohongshu team together with Tsinghua University's Information Retrieval Lab received the Best Paper Award for their work "Scaling Laws for Dense Retrieval," marking the first time a mainland Chinese institution has won this prize in the conference's 47‑year history. The paper investigates whether the empirical scaling laws observed for large language models also apply to dense vector retrieval systems.

The authors review classic statistical laws such as Zipf's Law and Heap's Law, noting how these have historically influenced the design of statistical retrieval models like BM25 and the estimation of inverted index sizes.

The study poses three research questions (RQ1‑RQ3): (RQ1) how model size influences dense retrieval performance; (RQ2) how the scale of manually annotated training data affects performance; and (RQ3) how the quality of annotation data impacts the scaling behavior.

Because traditional metrics such as NDCG are discrete and unsuitable for capturing continuous performance trends, the authors introduce contrast entropy as a new evaluation metric, sampling 256 negative examples per query‑document pair.

Experiments are conducted on two large‑scale datasets—MS MARCO (English) and T2Ranking (Chinese). Model sizes range from BERT‑tiny (0.5 M parameters) to BERT‑base (82 M) for English, and from small to large ERNIE variants for Chinese. Least‑squares fitting yields scaling‑law curves with R² > 0.99, demonstrating that retrieval performance follows a power‑law relationship with model size.

To examine data‑size effects, the authors fix BERT‑base and vary the amount of training data D. The fitted curves again show a strong exponential relationship, indicating that performance improves predictably as more annotated data become available.

Three annotation strategies are compared: (1) Inverse Cloze Task (ICT) as low‑quality pseudo‑queries; (2) supervised generation using docT5query; and (3) large‑language‑model (LLM) generated queries. All exhibit power‑law scaling, but ICT has the shallowest slope, docT5query outperforms LLM‑generated data, and human‑annotated data achieve the best performance.

A joint scaling law that combines model size N and data size D is proposed and validated with held‑out points, confirming its predictive reliability.

Using the derived cost model—comprising annotation cost (~$0.6 per sample), training cost per parameter, and inference cost per parameter—the authors illustrate how scaling laws can guide optimal resource allocation under budget constraints. When inference cost is ignored, budgets of $150 k favor models >10 B parameters; when inference cost is included, optimal model sizes shrink dramatically, emphasizing the importance of inference efficiency.

Overall, the work provides a systematic framework for predicting dense retrieval performance, optimizing experimental budgets, and highlighting the need for higher‑quality LLM‑generated annotations.

information retrievalscaling lawsannotation qualitycontrast entropydense retrievalmodel sizetraining data
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.