Boost Your RAG Pipeline with Cohere and BGE Rerank Models

This guide explains why post‑retrieval reranking is essential for Retrieval‑Augmented Generation, compares the commercial Cohere Rerank service with the open‑source bge‑reranker‑large model, and provides step‑by‑step code for integrating both into LlamaIndex pipelines, including a custom TEI‑based processor.

AI Large Model Application Practice
AI Large Model Application Practice
AI Large Model Application Practice
Boost Your RAG Pipeline with Cohere and BGE Rerank Models

Why Rerank After Retrieval?

In Retrieval‑Augmented Generation (RAG) pipelines, a post‑retrieval reranking step can reorder the retrieved chunks (or nodes) so that the most semantically relevant pieces appear first. This improves the quality of the final LLM answer, allows a smaller top_k to reduce context length, and is useful when the primary index is non‑semantic, when multiple heterogeneous retrieval paths are combined, or when vector similarity alone yields sub‑optimal ordering due to model, language, or domain mismatches.

Cohere Rerank Model (Online)

Cohere provides a closed‑source rerank‑multilingual‑v3.0 model that scores a list of texts against a query. The model can be accessed via the Cohere API and is already wrapped in LangChain and LlamaIndex.

from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index import SimpleDirectoryReader, SentenceSplitter, VectorStoreIndex

# Build a vector index
docs = SimpleDirectoryReader(input_files=["../../data/yiyan.txt"]).load_data()
nodes = SentenceSplitter(chunk_size=100, chunk_overlap=0).get_nodes_from_documents(docs)
vector_index = VectorStoreIndex(nodes)
retriever = vector_index.as_retriever(similarity_top_k=5)

# Retrieve without rerank
nodes = retriever.retrieve("百度文心一言的逻辑推理能力怎么样?")
print('=== before rerank ===')
print_nodes(nodes)

# Apply Cohere Rerank
cohere_rerank = CohereRerank(model='rerank-multilingual-v3.0', api_key='YOUR_API_KEY', top_n=2)
rerank_nodes = cohere_rerank.postprocess_nodes(nodes, query_str='百度文心一言的逻辑推理能力怎么样?')
print('=== after rerank ===')
print_nodes(rerank_nodes)

For Chinese queries the multilingual‑v3.0 model must be used.

Using Cohere Directly

import cohere
co = cohere.Client(api_key='YOUR_API_KEY')
query = "你的问题..."
docs = ["相关文档1...", "相关文档2...", "相关文档3..."]
results = co.rerank(
    model="rerank-multilingual-v3.0",
    query=query,
    documents=docs,
    top_n=5,
    return_documents=True,
)

bge‑reranker‑large Model (Local)

The open‑source bge‑reranker‑large model from BAAI can be served locally with HuggingFace Text‑Embeddings‑Inference (TEI). After installing TEI, start the service:

model=BAAI/bge-reranker-large
text-embeddings-router --model-id $model --port 8080

TEI exposes a FastAPI endpoint at http://localhost:8080 with a /rerank API.

Custom LlamaIndex Postprocessor for TEI

The following class implements a LlamaIndex post‑processor that calls the TEI /rerank endpoint.

import requests
from typing import List, Optional
from llama_index.core.bridge.pydantic import Field
from llama_index.core.postprocessor.types import BaseNodePostprocessor
from llama_index.core.schema import NodeWithScore, QueryBundle

class BgeRerank(BaseNodePostprocessor):
    url: str = Field(description="Rerank server URL.")
    top_n: int = Field(description="Number of top nodes to return.")

    def __init__(self, top_n: int, url: str):
        super().__init__(url=url, top_n=top_n)

    def rerank(self, query: str, texts: List[str]):
        endpoint = f"{self.url}/rerank"
        body = {"query": query, "texts": texts, "truncate": False}
        resp = requests.post(endpoint, json=body)
        if resp.status_code != 200:
            raise RuntimeError(f"Failed to rerank: {resp}")
        return resp.json()

    def _postprocess_nodes(
        self,
        nodes: List[NodeWithScore],
        query_bundle: Optional[QueryBundle] = None,
    ) -> List[NodeWithScore]:
        if query_bundle is None:
            raise ValueError("Missing query bundle.")
        if not nodes:
            return []
        texts = [node.text for node in nodes]
        results = self.rerank(query=query_bundle.query_str, texts=texts)
        new_nodes = []
        for result in results[: self.top_n]:
            idx = result["index"]
            score = result["score"]
            new_node = NodeWithScore(node=nodes[idx].node, score=score)
            new_nodes.append(new_node)
        return new_nodes

Example Usage with LlamaIndex

# Assume `nodes` were retrieved as in the Cohere example
custom_rerank = BgeRerank(url="http://localhost:8080", top_n=2)
rerank_nodes = custom_rerank.postprocess_nodes(nodes, query_str="百度文心一言的逻辑推理能力怎么样?")
print_nodes(rerank_nodes)

Installation of TEI (Linux/macOS example)

Install Rust (required for TEI):

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Clone the TEI repository:

git clone https://github.com/huggingface/text-embeddings-inference.git

Build and install the router (use metal for Apple Silicon or mkl for Intel):

cd text-embeddings-inference
cargo install --path router -F metal

Start the service (Linux may need libssl-dev and gcc):

model=BAAI/bge-reranker-large
text-embeddings-router --model-id $model --port 8080

Conclusion

Both the commercial Cohere Rerank service and the open‑source bge‑reranker‑large model can substantially improve the relevance ordering of retrieved chunks, leading to higher‑quality LLM outputs. The open‑source option, deployed via TEI, enables on‑premise usage suitable for private or enterprise environments while keeping resource requirements modest.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

RAGrerankBGELlamaIndexCohereTEI
AI Large Model Application Practice
Written by

AI Large Model Application Practice

Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.