Boost Your RAG Pipeline with Cohere and BGE Rerank Models
This guide explains why post‑retrieval reranking is essential for Retrieval‑Augmented Generation, compares the commercial Cohere Rerank service with the open‑source bge‑reranker‑large model, and provides step‑by‑step code for integrating both into LlamaIndex pipelines, including a custom TEI‑based processor.
Why Rerank After Retrieval?
In Retrieval‑Augmented Generation (RAG) pipelines, a post‑retrieval reranking step can reorder the retrieved chunks (or nodes) so that the most semantically relevant pieces appear first. This improves the quality of the final LLM answer, allows a smaller top_k to reduce context length, and is useful when the primary index is non‑semantic, when multiple heterogeneous retrieval paths are combined, or when vector similarity alone yields sub‑optimal ordering due to model, language, or domain mismatches.
Cohere Rerank Model (Online)
Cohere provides a closed‑source rerank‑multilingual‑v3.0 model that scores a list of texts against a query. The model can be accessed via the Cohere API and is already wrapped in LangChain and LlamaIndex.
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index import SimpleDirectoryReader, SentenceSplitter, VectorStoreIndex
# Build a vector index
docs = SimpleDirectoryReader(input_files=["../../data/yiyan.txt"]).load_data()
nodes = SentenceSplitter(chunk_size=100, chunk_overlap=0).get_nodes_from_documents(docs)
vector_index = VectorStoreIndex(nodes)
retriever = vector_index.as_retriever(similarity_top_k=5)
# Retrieve without rerank
nodes = retriever.retrieve("百度文心一言的逻辑推理能力怎么样?")
print('=== before rerank ===')
print_nodes(nodes)
# Apply Cohere Rerank
cohere_rerank = CohereRerank(model='rerank-multilingual-v3.0', api_key='YOUR_API_KEY', top_n=2)
rerank_nodes = cohere_rerank.postprocess_nodes(nodes, query_str='百度文心一言的逻辑推理能力怎么样?')
print('=== after rerank ===')
print_nodes(rerank_nodes)For Chinese queries the multilingual‑v3.0 model must be used.
Using Cohere Directly
import cohere
co = cohere.Client(api_key='YOUR_API_KEY')
query = "你的问题..."
docs = ["相关文档1...", "相关文档2...", "相关文档3..."]
results = co.rerank(
model="rerank-multilingual-v3.0",
query=query,
documents=docs,
top_n=5,
return_documents=True,
)bge‑reranker‑large Model (Local)
The open‑source bge‑reranker‑large model from BAAI can be served locally with HuggingFace Text‑Embeddings‑Inference (TEI). After installing TEI, start the service:
model=BAAI/bge-reranker-large
text-embeddings-router --model-id $model --port 8080TEI exposes a FastAPI endpoint at http://localhost:8080 with a /rerank API.
Custom LlamaIndex Postprocessor for TEI
The following class implements a LlamaIndex post‑processor that calls the TEI /rerank endpoint.
import requests
from typing import List, Optional
from llama_index.core.bridge.pydantic import Field
from llama_index.core.postprocessor.types import BaseNodePostprocessor
from llama_index.core.schema import NodeWithScore, QueryBundle
class BgeRerank(BaseNodePostprocessor):
url: str = Field(description="Rerank server URL.")
top_n: int = Field(description="Number of top nodes to return.")
def __init__(self, top_n: int, url: str):
super().__init__(url=url, top_n=top_n)
def rerank(self, query: str, texts: List[str]):
endpoint = f"{self.url}/rerank"
body = {"query": query, "texts": texts, "truncate": False}
resp = requests.post(endpoint, json=body)
if resp.status_code != 200:
raise RuntimeError(f"Failed to rerank: {resp}")
return resp.json()
def _postprocess_nodes(
self,
nodes: List[NodeWithScore],
query_bundle: Optional[QueryBundle] = None,
) -> List[NodeWithScore]:
if query_bundle is None:
raise ValueError("Missing query bundle.")
if not nodes:
return []
texts = [node.text for node in nodes]
results = self.rerank(query=query_bundle.query_str, texts=texts)
new_nodes = []
for result in results[: self.top_n]:
idx = result["index"]
score = result["score"]
new_node = NodeWithScore(node=nodes[idx].node, score=score)
new_nodes.append(new_node)
return new_nodesExample Usage with LlamaIndex
# Assume `nodes` were retrieved as in the Cohere example
custom_rerank = BgeRerank(url="http://localhost:8080", top_n=2)
rerank_nodes = custom_rerank.postprocess_nodes(nodes, query_str="百度文心一言的逻辑推理能力怎么样?")
print_nodes(rerank_nodes)Installation of TEI (Linux/macOS example)
Install Rust (required for TEI):
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | shClone the TEI repository:
git clone https://github.com/huggingface/text-embeddings-inference.gitBuild and install the router (use metal for Apple Silicon or mkl for Intel):
cd text-embeddings-inference
cargo install --path router -F metalStart the service (Linux may need libssl-dev and gcc):
model=BAAI/bge-reranker-large
text-embeddings-router --model-id $model --port 8080Conclusion
Both the commercial Cohere Rerank service and the open‑source bge‑reranker‑large model can substantially improve the relevance ordering of retrieved chunks, leading to higher‑quality LLM outputs. The open‑source option, deployed via TEI, enables on‑premise usage suitable for private or enterprise environments while keeping resource requirements modest.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Large Model Application Practice
Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
