Why Rerank Beats Simple Retrieval in RAG: Practical Tips & Code

This article explains the limitations of Bi‑Encoder retrieval, introduces Cross‑Encoder rerankers, shows how a cascade of recall‑rerank‑generation improves answer quality, and provides concrete code, threshold‑filtering strategies, and domain‑specific fine‑tuning techniques for industrial RAG systems.

Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Why Rerank Beats Simple Retrieval in RAG: Practical Tips & Code

1. Retrieval Quantity Does Not Equal Quality: Limits of Bi‑Encoder

Bi‑Encoder encodes the query and each document independently with the same encoder, producing vectors that are compared by cosine similarity or dot product. This "dual‑tower" design enables fast offline indexing and millisecond‑level nearest‑neighbor search, but it cannot assess whether a document actually answers the query because there is no interaction between query and document during encoding.

In a financial‑insurance use case with 5,000 contract documents, a query like "What is the waiting period for critical illness?" may retrieve 20 passages that all contain the term "waiting period". Only one passage provides the correct answer (e.g., "The waiting period is 180 days"); the others are semantically similar but irrelevant, illustrating why high similarity scores do not guarantee answer relevance.

2. How Cross‑Encoder Rerankers Work

Cross‑Encoder concatenates the query and a candidate document, feeds the pair into a Transformer, and lets every token attend to every other token. The model outputs a relevance score between 0 and 1, effectively judging whether the document truly answers the query.

The main advantage is token‑level interaction, but the downside is that each (query, document) pair must be processed online, making it unsuitable for full‑corpus retrieval. Therefore, Cross‑Encoder is used only in the reranking stage for a small set of candidates (e.g., the top‑20 retrieved by Bi‑Encoder).

Commonly used reranker models include:

BGE‑Reranker series (BAAI/bge‑reranker‑v2‑m3, strong Chinese performance)

Cohere Rerank API (closed‑source, high effectiveness)

JinaAI Reranker

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load BGE‑Reranker model
model_name = "BAAI/bge-reranker-v2-m3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

def rerank(query: str, documents: list[str], top_k: int = 5) -> list[dict]:
    """Use Cross‑Encoder to rerank candidate documents.
    Args:
        query: user question
        documents: list of documents returned by Bi‑Encoder
        top_k: number of documents to keep after reranking
    Returns:
        List of documents sorted by relevance score (desc), each with text and score.
    """
    pairs = [[query, doc] for doc in documents]
    inputs = tokenizer(
        pairs,
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors="pt"
    )
    with torch.no_grad():
        scores = model(**inputs).logits.squeeze(-1)
        scores = torch.sigmoid(scores).tolist()
    scored_docs = [{"text": doc, "score": score} for doc, score in zip(documents, scores)]
    scored_docs.sort(key=lambda x: x["score"], reverse=True)
    return scored_docs[:top_k]

The code pairs each document with the query, runs the Cross‑Encoder, converts logits to probabilities, sorts by score, and returns the top‑k results.

3. Cascaded Recall‑Rerank‑Generation Architecture

The optimal pipeline combines the speed of Bi‑Encoder with the precision of Cross‑Encoder:

Bi‑Encoder quickly recalls the top‑20 candidate documents from the full corpus (milliseconds).

Cross‑Encoder reranks those 20 candidates and selects the top‑5 (tens of milliseconds).

The selected five high‑quality chunks are fed to the LLM for answer generation.

Running this cascade on the insurance project reduced noise in the chunks sent to the LLM from 42 % to 11 %, lowered the hallucination rate from 18.3 % to 6.4 %, and increased average retrieval latency only from 12 ms to 58 ms—an acceptable trade‑off for production systems.

4. Threshold Filtering: "Better None Than Bad"

After reranking, each document receives a relevance score (0‑1). If all scores are low, even the top‑5 may be noise, leading the LLM to hallucinate. Therefore, an absolute threshold (e.g., 0.5) should be applied: documents below the threshold are discarded, even if that leaves fewer than the desired number of chunks.

If no document passes the threshold, the system should inform the user that no relevant knowledge was found instead of forcing a fabricated answer.

def filter_by_threshold(
    scored_docs: list[dict],
    threshold: float = 0.5
) -> list[dict]:
    """Discard documents with a relevance score below the threshold.
    Args:
        scored_docs: output of rerank(), each with "text" and "score"
        threshold: minimum score to keep a document
    Returns:
        Filtered list of documents (may be empty).
    """
    filtered = [doc for doc in scored_docs if doc["score"] >= threshold]
    return filtered


def rag_retrieve(
    query: str,
    vector_store,
    reranker_model,
    top_k_recall: int = 20,
    top_k_rerank: int = 5,
    threshold: float = 0.5
) -> list[dict]:
    """Full RAG pipeline: vector recall → rerank → threshold filter.
    Returns the final list of documents to give to the LLM.
    """
    recall_results = vector_store.similarity_search(query, k=top_k_recall)
    recall_texts = [doc.page_content for doc in recall_results]
    reranked = reranker_model.rerank(query, recall_texts, top_k=top_k_rerank)
    final_docs = filter_by_threshold(reranked, threshold=threshold)
    return final_docs

The optimal threshold was determined offline by sweeping values (e.g., 0.3‑0.8) on a labeled QA set and selecting the point with the highest F1; for the insurance project the best threshold was 0.52, raising F1 from 0.74 to 0.81.

5. Domain‑Specific Fine‑Tuning of Rerankers

Generic rerankers perform well on open‑domain data but often miss domain‑specific terminology. In the insurance scenario, queries about "light‑illness payout" were not well matched by the generic model because the phrase differs from the wording in contracts.

Fine‑tuning involves creating triplets (query, positive_doc, negative_doc). Hard negatives are documents that are semantically close to the query yet do not contain the answer—exactly the type of errors Bi‑Encoder makes.

import json

def prepare_finetune_data(
    qa_pairs: list[dict],
    vector_store,
    reranker_model,
    output_path: str
) -> None:
    """Generate training triples for reranker fine‑tuning.
    Each triple contains a query, a correct document, and a hard negative.
    """
    training_examples = []
    for qa in qa_pairs:
        query = qa["query"]
        positive_doc = qa["positive_doc"]
        recall_candidates = vector_store.similarity_search(query, k=50)
        hard_negatives = []
        for candidate in recall_candidates:
            if candidate.page_content == positive_doc:
                continue
            if not contains_answer(candidate.page_content, qa["answer"]):
                hard_negatives.append(candidate.page_content)
        if not hard_negatives:
            continue
        hardest_negative = hard_negatives[0]
        training_examples.append({
            "query": query,
            "pos": [positive_doc],
            "neg": [hardest_negative]
        })
    with open(output_path, "w", encoding="utf-8") as f:
        for ex in training_examples:
            f.write(json.dumps(ex, ensure_ascii=False) + "
")
    print(f"Generated {len(training_examples)} fine‑tuning examples to {output_path}")

def contains_answer(doc_text: str, answer: str) -> bool:
    """Simple heuristic: check if any answer keyword appears in the document.
    Real projects may use more sophisticated semantic matching.
    """
    answer_keywords = answer.split()
    return any(kw in doc_text for kw in answer_keywords if len(kw) > 1)

Fine‑tuning BGE‑Reranker‑v2‑m3 on 500 insurance‑domain triples raised Precision@5 from 0.71 to 0.86, Recall@5 from 0.68 to 0.82, NDCG@5 from 0.74 to 0.88, and reduced hallucination rate from 9.1 % to 4.8 %.

6. How to Answer RAG Rerank Questions in Interviews

Structure the answer in four layers (total 2‑3 minutes):

Layer 1 (≈20 s): Explain the inherent limitation of Bi‑Encoder—no query‑document interaction, so high semantic similarity does not guarantee answer relevance.

Layer 2 (≈1 min): Describe Cross‑Encoder mechanics, the cascade architecture, and cite the project’s hallucination‑rate drop (18.3 % → 6.4 %).

Layer 3 (≈1 min): Discuss threshold filtering, how the optimal threshold (0.52) was found, and its impact on F1.

Layer 4 (≈30 s): Mention domain‑specific fine‑tuning (500‑1000 triples) and the resulting Precision@5 improvement (0.71 → 0.86), which demonstrates deeper understanding.

Conclusion

The core RAG pipeline is "recall‑rerank‑generate". Many engineers focus on vector‑search tuning while neglecting the rerank stage, which directly determines the quality of the context fed to the LLM. Proper reranking, threshold filtering, and domain‑specific fine‑tuning are essential practices for building reliable, industrial‑grade RAG systems.

RAGrerankAI retrievalCross-EncoderBi-EncoderDomain Fine-tuningThreshold Filtering
Wu Shixiong's Large Model Academy
Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.