Artificial Intelligence 19 min read

How to Build a Production-Ready RAG System with Qwen3 Embedding and Reranker Models

This guide walks through using Alibaba's new Qwen3-Embedding and Qwen3-Reranker models to build a two‑stage Retrieval‑Augmented Generation pipeline with Milvus, covering environment setup, data ingestion, vector indexing, reranking, and LLM‑driven answer generation, demonstrating production‑grade performance across multilingual queries.

Instant Consumer Technology Team

Jun 12, 2025

How to Build a Production-Ready RAG System with Qwen3 Embedding and Reranker Models

Introduction

Alibaba recently released two new models in the Qwen3 family: Qwen3-Embedding and Qwen3-Reranker , each available in 0.6B, 4B, and 8B sizes. Built on the Qwen3 base, they support 119 languages, covering both natural and programming languages.

Key Performance Highlights

Qwen3-Embedding‑8B scores 70.58 on the MTEB multilingual benchmark, surpassing BGE, E5, and even Google Gemini.

Qwen3-Reranker‑8B achieves 69.02 on multilingual ranking tasks and 77.45 on Chinese, making it a top open‑source reranker.

Both models place Chinese queries directly into the same semantic space as English results, ideal for global search or customer‑service scenarios.

These results show that the models are not only competitive among open‑source options but also match or exceed mainstream commercial APIs, making them ready for production in RAG, cross‑language search, and code‑search systems.

RAG Tutorial (Qwen3-Embedding‑0.6B + Qwen3‑Reranker‑0.6B)

Environment Preparation

!pip install --upgrade pymilvus openai requests tqdm sentence-transformers transformers

Requires transformers>=4.51.0 and sentence‑transformers>=2.7.0

Set your OpenAI API key as an environment variable for the LLM.

import os
os.environ["OPENAI_API_KEY"] = "sk-************"

Data Preparation

Use the Milvus documentation FAQ as a private knowledge source.

!wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip
!unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs

Load all markdown files and split them by "#" headings.

from glob import glob
text_lines = []
for file_path in glob("milvus_docs/en/faq/*.md", recursive=True):
    with open(file_path, "r") as f:
        file_text = f.read()
    text_lines += file_text.split("# ")

Load LLM and Embedding Models

from openai import OpenAI
from sentence_transformers import SentenceTransformer
import torch
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM

openai_client = OpenAI()
embedding_model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")
reranker_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Reranker-0.6B", padding_side='left')
reranker_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Reranker-0.6B").eval()

token_false_id = reranker_tokenizer.convert_tokens_to_ids("no")
token_true_id = reranker_tokenizer.convert_tokens_to_ids("yes")
max_reranker_length = 8192
prefix = "<|im_start|>system
Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>
<|im_start|>user
"
suffix = "<|im_end|>
<|im_start|>assistant
<think>

</think>

"
prefix_tokens = reranker_tokenizer.encode(prefix, add_special_tokens=False)
suffix_tokens = reranker_tokenizer.encode(suffix, add_special_tokens=False)

Utility Functions

def emb_text(text, is_query=False):
    """Generate text embeddings using Qwen3‑Embedding‑0.6B model."""
    if is_query:
        embeddings = embedding_model.encode([text], prompt_name="query")
    else:
        embeddings = embedding_model.encode([text])
    return embeddings[0].tolist()

def format_instruction(instruction, query, doc):
    if instruction is None:
        instruction = 'Given a web search query, retrieve relevant passages that answer the query'
    return f"<Instruct>: {instruction}
<Query>: {query}
<Document>: {doc}"

def process_inputs(pairs):
    inputs = reranker_tokenizer(pairs, padding=False, truncation='longest_first',
                                 return_attention_mask=False,
                                 max_length=max_reranker_length - len(prefix_tokens) - len(suffix_tokens))
    for i, ele in enumerate(inputs['input_ids']):
        inputs['input_ids'][i] = prefix_tokens + ele + suffix_tokens
    inputs = reranker_tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_reranker_length)
    for key in inputs:
        inputs[key] = inputs[key].to(reranker_model.device)
    return inputs

@torch.no_grad()
def compute_logits(inputs, **kwargs):
    batch_scores = reranker_model(**inputs).logits[:, -1, :]
    true_vector = batch_scores[:, token_true_id]
    false_vector = batch_scores[:, token_false_id]
    batch_scores = torch.stack([false_vector, true_vector], dim=1)
    batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
    scores = batch_scores[:, 1].exp().tolist()
    return scores

def rerank_documents(query, documents, task_instruction=None):
    if task_instruction is None:
        task_instruction = 'Given a web search query, retrieve relevant passages that answer the query'
    pairs = [format_instruction(task_instruction, query, doc) for doc in documents]
    inputs = process_inputs(pairs)
    scores = compute_logits(inputs)
    doc_scores = list(zip(documents, scores))
    doc_scores.sort(key=lambda x: x[1], reverse=True)
    return doc_scores

Milvus Collection Setup

from pymilvus import MilvusClient
milvus_client = MilvusClient(uri="./milvus_demo.db")
collection_name = "my_rag_collection"

# Create collection (dimension will be set after a test embedding)
test_embedding = emb_text("This is a test")
embedding_dim = len(test_embedding)

milvus_client.create_collection(
    collection_name=collection_name,
    dimension=embedding_dim,
    metric_type="IP",
    consistency_level="Strong"
)

Insert Data

from tqdm import tqdm
data = []
for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")):
    data.append({"id": i, "vector": emb_text(line), "text": line})

milvus_client.insert(collection_name=collection_name, data=data)

Retrieval and Reranking

question = "How is data stored in milvus?"
# Initial retrieval (top 10 candidates)
search_res = milvus_client.search(
    collection_name=collection_name,
    data=[emb_text(question, is_query=True)],
    limit=10,
    search_params={"metric_type": "IP", "params": {}},
    output_fields=["text"]
)
candidate_docs = [res["entity"]["text"] for res in search_res[0]]
print("Reranking documents...")
reranked_docs = rerank_documents(question, candidate_docs)
top_reranked_docs = reranked_docs[:3]
print(f"Selected top {len(top_reranked_docs)} documents after reranking")

Generate RAG Response with OpenAI GPT‑4o

context = "
".join([doc for doc, _ in top_reranked_docs])
SYSTEM_PROMPT = """Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided."""
USER_PROMPT = f"""Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>{context}</context>
<question>{question}</question>"""

response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "system", "content": SYSTEM_PROMPT},
              {"role": "user", "content": USER_PROMPT}]
)
print(response.choices[0].message.content)

Conclusion

The Qwen3‑Embedding and Qwen3‑Reranker models deliver strong multilingual performance while remaining lightweight enough for local deployment. Their combination enables an efficient two‑stage retrieval‑augmented generation pipeline that balances speed, accuracy, and cost, making it suitable for small‑to‑medium enterprises and individual developers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM RAG Milvus Embedding Qwen3 reranker

Written by

Instant Consumer Technology Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.