How to Build a Production-Ready RAG System with Qwen3 Embedding and Reranker Models
This guide walks through using Alibaba's new Qwen3-Embedding and Qwen3-Reranker models to build a two‑stage Retrieval‑Augmented Generation pipeline with Milvus, covering environment setup, data ingestion, vector indexing, reranking, and LLM‑driven answer generation, demonstrating production‑grade performance across multilingual queries.
Introduction
Alibaba recently released two new models in the Qwen3 family: Qwen3-Embedding and Qwen3-Reranker , each available in 0.6B, 4B, and 8B sizes. Built on the Qwen3 base, they support 119 languages, covering both natural and programming languages.
Key Performance Highlights
Qwen3-Embedding‑8B scores 70.58 on the MTEB multilingual benchmark, surpassing BGE, E5, and even Google Gemini.
Qwen3-Reranker‑8B achieves 69.02 on multilingual ranking tasks and 77.45 on Chinese, making it a top open‑source reranker.
Both models place Chinese queries directly into the same semantic space as English results, ideal for global search or customer‑service scenarios.
These results show that the models are not only competitive among open‑source options but also match or exceed mainstream commercial APIs, making them ready for production in RAG, cross‑language search, and code‑search systems.
RAG Tutorial (Qwen3-Embedding‑0.6B + Qwen3‑Reranker‑0.6B)
Environment Preparation
<code>!pip install --upgrade pymilvus openai requests tqdm sentence-transformers transformers</code>Requires transformers>=4.51.0 and sentence‑transformers>=2.7.0
Set your OpenAI API key as an environment variable for the LLM.
<code>import os
os.environ["OPENAI_API_KEY"] = "sk-************"</code>Data Preparation
Use the Milvus documentation FAQ as a private knowledge source.
<code>!wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip
!unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs</code>Load all markdown files and split them by "#" headings.
<code>from glob import glob
text_lines = []
for file_path in glob("milvus_docs/en/faq/*.md", recursive=True):
with open(file_path, "r") as f:
file_text = f.read()
text_lines += file_text.split("# ")
</code>Load LLM and Embedding Models
<code>from openai import OpenAI
from sentence_transformers import SentenceTransformer
import torch
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM
openai_client = OpenAI()
embedding_model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")
reranker_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Reranker-0.6B", padding_side='left')
reranker_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Reranker-0.6B").eval()
token_false_id = reranker_tokenizer.convert_tokens_to_ids("no")
token_true_id = reranker_tokenizer.convert_tokens_to_ids("yes")
max_reranker_length = 8192
prefix = "<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>\n<|im_start|>user\n"
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
prefix_tokens = reranker_tokenizer.encode(prefix, add_special_tokens=False)
suffix_tokens = reranker_tokenizer.encode(suffix, add_special_tokens=False)
</code>Utility Functions
<code>def emb_text(text, is_query=False):
"""Generate text embeddings using Qwen3‑Embedding‑0.6B model."""
if is_query:
embeddings = embedding_model.encode([text], prompt_name="query")
else:
embeddings = embedding_model.encode([text])
return embeddings[0].tolist()
def format_instruction(instruction, query, doc):
if instruction is None:
instruction = 'Given a web search query, retrieve relevant passages that answer the query'
return f"<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}"
def process_inputs(pairs):
inputs = reranker_tokenizer(pairs, padding=False, truncation='longest_first',
return_attention_mask=False,
max_length=max_reranker_length - len(prefix_tokens) - len(suffix_tokens))
for i, ele in enumerate(inputs['input_ids']):
inputs['input_ids'][i] = prefix_tokens + ele + suffix_tokens
inputs = reranker_tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_reranker_length)
for key in inputs:
inputs[key] = inputs[key].to(reranker_model.device)
return inputs
@torch.no_grad()
def compute_logits(inputs, **kwargs):
batch_scores = reranker_model(**inputs).logits[:, -1, :]
true_vector = batch_scores[:, token_true_id]
false_vector = batch_scores[:, token_false_id]
batch_scores = torch.stack([false_vector, true_vector], dim=1)
batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
scores = batch_scores[:, 1].exp().tolist()
return scores
def rerank_documents(query, documents, task_instruction=None):
if task_instruction is None:
task_instruction = 'Given a web search query, retrieve relevant passages that answer the query'
pairs = [format_instruction(task_instruction, query, doc) for doc in documents]
inputs = process_inputs(pairs)
scores = compute_logits(inputs)
doc_scores = list(zip(documents, scores))
doc_scores.sort(key=lambda x: x[1], reverse=True)
return doc_scores
</code>Milvus Collection Setup
<code>from pymilvus import MilvusClient
milvus_client = MilvusClient(uri="./milvus_demo.db")
collection_name = "my_rag_collection"
# Create collection (dimension will be set after a test embedding)
test_embedding = emb_text("This is a test")
embedding_dim = len(test_embedding)
milvus_client.create_collection(
collection_name=collection_name,
dimension=embedding_dim,
metric_type="IP",
consistency_level="Strong"
)
</code>Insert Data
<code>from tqdm import tqdm
data = []
for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")):
data.append({"id": i, "vector": emb_text(line), "text": line})
milvus_client.insert(collection_name=collection_name, data=data)
</code>Retrieval and Reranking
<code>question = "How is data stored in milvus?"
# Initial retrieval (top 10 candidates)
search_res = milvus_client.search(
collection_name=collection_name,
data=[emb_text(question, is_query=True)],
limit=10,
search_params={"metric_type": "IP", "params": {}},
output_fields=["text"]
)
candidate_docs = [res["entity"]["text"] for res in search_res[0]]
print("Reranking documents...")
reranked_docs = rerank_documents(question, candidate_docs)
top_reranked_docs = reranked_docs[:3]
print(f"Selected top {len(top_reranked_docs)} documents after reranking")
</code>Generate RAG Response with OpenAI GPT‑4o
<code>context = "\n".join([doc for doc, _ in top_reranked_docs])
SYSTEM_PROMPT = """Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided."""
USER_PROMPT = f"""Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>{context}</context>
<question>{question}</question>"""
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": USER_PROMPT}]
)
print(response.choices[0].message.content)
</code>Conclusion
The Qwen3‑Embedding and Qwen3‑Reranker models deliver strong multilingual performance while remaining lightweight enough for local deployment. Their combination enables an efficient two‑stage retrieval‑augmented generation pipeline that balances speed, accuracy, and cost, making it suitable for small‑to‑medium enterprises and individual developers.
Instant Consumer Technology Team
Instant Consumer Technology Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.