Recent Advances in Sparse and Dense Retrieval for Search Engines
The article surveys recent academic advances in both sparse inverted‑index and dense semantic retrieval for large‑scale search, highlighting key papers on decision frameworks, benchmarks, sparse lexical models, dual encoders, and hybrid techniques, while discussing challenges such as single‑vector limits and proposing multi‑view and hybrid solutions.
The article reviews the latest academic progress on the two major retrieval pipelines used in large‑scale search: traditional sparse inverted‑index retrieval and dense semantic retrieval. With the rise of deep learning, dense retrieval has achieved significant gains, while sparse retrieval remains attractive for its exact matching, indexing efficiency, and interpretability.
Key papers discussed include:
Are We There Yet? A Decision Framework for Replacing Term‑Based Retrieval with Dense Retrieval Systems – proposes a comprehensive evaluation framework (effectiveness, cost, robustness) and shows dense retrieval can replace term‑based methods when vectorization cost is acceptable.
BEIR: A Heterogeneous Benchmark for Zero‑Shot Evaluation of Information Retrieval Models – aggregates diverse IR datasets to benchmark retrieval models in zero‑shot settings.
SPLADE: Sparse Lexical and Expansion Model for First‑Stage Ranking – learns sparse lexical representations with term weighting and expansion, preserving the advantages of inverted indexes.
SpaDE: Improving Sparse Representations using a Dual Document Encoder – introduces a dual‑encoder architecture for separate query and document modeling, with joint training and FLOPs regularization.
LexMAE: Lexicon‑Bottlenecked Pretraining for Large‑Scale Retrieval – replaces the standard MLM head with a lexicon bottleneck to produce more informative sparse vectors.
Salient Phrase Aware Dense Retrieval – distills knowledge from a sparse teacher into a dense retriever to improve lexical matching.
LED: Lexicon‑Enlightened Dense Retriever – combines lexical hard‑negative sampling with dense training to enhance term‑level matching.
Multi‑View Document Representation Learning for Open‑Domain Dense Retrieval – learns multiple document embeddings via clustering and attention to capture diverse query intents.
Learning Diverse Document Representations with Deep Query Interactions – generates multiple pseudo‑queries per document (doc2query) and uses them for training and re‑ranking.
The article also outlines remaining challenges for dense retrieval, such as limited expressive capacity of a single representation, difficulty modeling exact matches, and multi‑representation effectiveness. Proposed solutions involve multi‑view encoders, hard‑negative mining, and hybrid sparse‑dense models.
In summary, both sparse and dense retrieval benefit from advances in large‑scale pre‑training, and ongoing research aims to combine their complementary strengths for more accurate and efficient search systems.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.