Predicting Weak Retrieval Without an LLM: Low‑Cost Signals and Gateways
Most retrieval systems apply the same pipeline to every query, but this one‑size‑fits‑all approach fails on hard queries and wastes compute on easy ones; the article defines cheap, no‑LLM signals to predict weak retrieval, evaluates them across three corpora, and proposes a gating method to upgrade only the weak cases.
Why a Uniform Retrieval Pipeline Fails
Most retrieval systems apply the same pipeline to every query. A single retrieval pass cannot handle difficult queries, while re‑ranking or rewriting easy queries wastes compute. When retrieval fails silently—relevant documents never appear in the top results—the system still returns an answer based on whatever it retrieved, with no indication of failure.
Defining “Weak Retrieval”
Weak retrieval is defined with a binary label on the window , i.e., the top‑k results actually consumed by a downstream model or shown to a user. A retrieval is weak if any required piece of evidence is missing from this window, regardless of later recall.
Cheap Signals
A signal is a numeric value computed from the retriever’s output to predict weak retrieval. It must (1) distinguish good from weak retrieval on annotated data and (2) be cheap enough to compute for every query. “Cheap” excludes adding an LLM call on the hot path; the signals read only the retriever’s results or at most issue one extra Qdrant query.
Signal Catalog
Height – signal max_score: reads the top‑1 fused score; low values mark weak retrieval.
Dispersion – signal dense_variance: variance of raw dense cosine scores; low coverage marks weak retrieval.
Coverage – signal evidence_coverage: counts query entities appearing in the top‑k texts; low values mark weak retrieval.
Divergence (dense vs sparse) – signal retriever_divergence: 1 – Jaccard overlap between dense and sparse top‑k IDs; high values (disagreement) mark weak retrieval.
Agreement (across dense models) – signal dense_agreement: average Jaccard overlap of top‑k IDs from independent dense models; low values (disagreement) mark weak retrieval.
Intuition Behind Effective Signals
Height and Dispersion : a confident retriever spreads top scores apart; a lost retriever compresses them.
Coverage : whether named entities from the query appear in the retrieved texts.
Agreement : when two dense models return different documents, one is usually off‑track, especially for jargon or out‑of‑vocabulary terms.
Dispersion only carries information when the score magnitudes are retained; read it from raw dense cosine values or from a fused score that preserves magnitude (e.g., DBSF).
Computing the Signals
Dispersion is the population variance of the dense scores:
import statistics
def dense_variance(dense_ranking):
# variance of raw dense cosine scores
return statistics.pvariance([score for _, score in dense_ranking])Agreement signals are based on top‑k ID overlap:
from itertools import combinations
def jaccard(a, b):
a, b = set(a), set(b)
return len(a & b) / len(a | b) if a or b else 1.0
def retriever_divergence(dense_ids, sparse_ids):
return 1 - jaccard(dense_ids, sparse_ids) # high = they disagree
def dense_agreement(rankings):
# each ranking is a top‑k ID list from a dense model
scores = [jaccard(a, b) for a, b in combinations(rankings, 2)]
return sum(scores) / len(scores) # low = they disagree dense_agreementrequires running an extra dense model per query, but it remains far cheaper than an LLM judge.
Signal Effectiveness Depends on Failure Mode
Each signal’s ability to separate good from weak retrieval is measured with AUC (0.5 = random, 1.0 = perfect) on three annotated corpora:
MuSiQue (multi‑hop QA)
BEIR NFCorpus (medical terminology)
BEIR SciFact (scientific claims)
No single signal dominates all three corpora; usefulness is tied to the failure mode:
Vocabulary mismatch (NFCorpus) : jargon and OOV terms cause divergence between dense and sparse rankings and between independent dense models. Agreement signals rise from random to 0.73‑0.76, matching dispersion.
Ranking precision issue (SciFact) : the correct document is retrieved but ranked low. Dispersion and height capture the problem (AUC 0.75‑0.76); agreement adds nothing.
Reachability issue (MuSiQue) : the answer requires a hop the query cannot express, so the first retrieval looks confident even when wrong. The best cheap signal reaches only 0.73 AUC; agreement drops to random. No cheap signal reliably captures the missing hop, so the remedy is query decomposition rather than a better gate.
Because the three benchmarks do not yield a universal winner, the deliverable is a method , not a default configuration: measure discriminative power on your own data, keep effective signals (e.g., evidence_coverage proved ineffective on all three and is dropped), and apply a ceiling principle—when failure is due to reachability rather than embedding confusion, cheap signals cannot detect it.
Finding Your Own Signals
On a calibration split, score each candidate signal with a separation metric:
from sklearn.metrics import roc_auc_score
def separation(values, weak_labels):
auc = roc_auc_score(weak_labels, values)
return max(auc, 1 - auc) # discriminative power, direction‑agnosticTwo rules guide selection:
Retain signals with separation above a threshold (0.65 is a reasonable start).
Discard signals highly correlated (absolute correlation > 0.85) with a stronger one, as redundancy adds cost without information.
Set the threshold on the calibration split and reserve a test split for final evaluation. Let the failure mode steer you: corpora dense with jargon benefit from agreement signals; precision‑deficient corpora benefit from dispersion and height. Confirm with data, because the winner is always corpus‑specific.
Turning Signals into Gates
Thresholds convert signals into decisions. Adjust signal direction so that lower values indicate weaker retrieval (e.g., invert retriever_divergence which rises on weakness):
def retrieval_is_weak(result):
return signal(result) < FLOOR # signal already adjusted to low = weakThe floor is a trade‑off:
Raising it captures more weak queries but upgrades more often.
Lowering it reduces upgrades but misses more weak cases.
A solid default maximizes capture rate minus false‑positive rate (Youden index) on the calibration set. If missing a weak retrieval is costly, set the threshold to achieve high recall (e.g., 90%). When two signals are both strong and nearly independent, either can trigger an upgrade.
Acting on Weak Retrieval
Upgrading can mean re‑ranking, rewriting, decomposing, relaxing filters, or handing off to a human. The gate decides only whether to upgrade; the specific remedy is independent of the gate. An upgrade cannot fabricate evidence that does not exist; when nothing can be found, the correct action is to abstain rather than answer from thin evidence.
Adapting the Recipe to Your Corpus
Define the window and the evidence each query needs.
Label weak retrievals on a calibration split.
Benchmark cheap signals; keep those with discriminative power, discard redundant ones.
Threshold the winning signals to create a gate that upgrades only when triggered.
Retrieval quality can be observed from cheap statistics, but only when those statistics can see the failure type.
Related Work
This work sits alongside corrective and adaptive retrieval, differing in the source of the decision: a cheap statistic on already‑retrieved results, requiring no extra model calls or training.
Query Performance Prediction (QPP) surveys many post‑retrieval predictors; the cheapest ones’ transferability and ceiling are the focus here.
CRAG trains a retrieval evaluator; Self‑RAG fine‑tunes a generator to critique its own context—both deploy trained models where this article uses free statistics.
Adaptive‑RAG routes queries by complexity before retrieval; this article argues the gate should be based on obtained evidence, not query shape.
Full‑context work uses LLM judges to ask “Is this enough?”—the opposite of the cheap‑signal approach.
The complete self‑correcting retrieval loop and its evaluation tools are described in the self‑correcting retrieval workshop; the upgrade components reference delayed‑interaction models, hybrid retrieval, and query decomposition.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Engineer Programming
In the AI era, defining problems is often more important than solving them; here we explore AI's contradictions, boundaries, and possibilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
