Artificial Intelligence 12 min read

Recent Advances in Sparse and Dense Retrieval for Search Engines

The article surveys recent academic advances in both sparse inverted‑index and dense semantic retrieval for large‑scale search, highlighting key papers on decision frameworks, benchmarks, sparse lexical models, dual encoders, and hybrid techniques, while discussing challenges such as single‑vector limits and proposing multi‑view and hybrid solutions.

Baidu Geek Talk

Mar 13, 2023

Recent Advances in Sparse and Dense Retrieval for Search Engines

The article reviews the latest academic progress on the two major retrieval pipelines used in large‑scale search: traditional sparse inverted‑index retrieval and dense semantic retrieval. With the rise of deep learning, dense retrieval has achieved significant gains, while sparse retrieval remains attractive for its exact matching, indexing efficiency, and interpretability.

Key papers discussed include:

Are We There Yet? A Decision Framework for Replacing Term‑Based Retrieval with Dense Retrieval Systems – proposes a comprehensive evaluation framework (effectiveness, cost, robustness) and shows dense retrieval can replace term‑based methods when vectorization cost is acceptable.

BEIR: A Heterogeneous Benchmark for Zero‑Shot Evaluation of Information Retrieval Models – aggregates diverse IR datasets to benchmark retrieval models in zero‑shot settings.

SPLADE: Sparse Lexical and Expansion Model for First‑Stage Ranking – learns sparse lexical representations with term weighting and expansion, preserving the advantages of inverted indexes.

SpaDE: Improving Sparse Representations using a Dual Document Encoder – introduces a dual‑encoder architecture for separate query and document modeling, with joint training and FLOPs regularization.

LexMAE: Lexicon‑Bottlenecked Pretraining for Large‑Scale Retrieval – replaces the standard MLM head with a lexicon bottleneck to produce more informative sparse vectors.

Salient Phrase Aware Dense Retrieval – distills knowledge from a sparse teacher into a dense retriever to improve lexical matching.

LED: Lexicon‑Enlightened Dense Retriever – combines lexical hard‑negative sampling with dense training to enhance term‑level matching.

Multi‑View Document Representation Learning for Open‑Domain Dense Retrieval – learns multiple document embeddings via clustering and attention to capture diverse query intents.

Learning Diverse Document Representations with Deep Query Interactions – generates multiple pseudo‑queries per document (doc2query) and uses them for training and re‑ranking.

The article also outlines remaining challenges for dense retrieval, such as limited expressive capacity of a single representation, difficulty modeling exact matches, and multi‑representation effectiveness. Proposed solutions involve multi‑view encoders, hard‑negative mining, and hybrid sparse‑dense models.

In summary, both sparse and dense retrieval benefit from advances in large‑scale pre‑training, and ongoing research aims to combine their complementary strengths for more accurate and efficient search systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Ranking Information Retrieval pretraining dense retrieval search engines sparse representation

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.