17 min read

How to Build a Rust-Powered Retrieval‑Augmented Generation (RAG) System from Scratch

This article explains how to construct a Retrieval‑Augmented Generation pipeline in Rust, covering knowledge‑base creation with Qdrant, model loading and embedding generation using candle, and integrating a Rust‑based inference service to answer queries with up‑to‑date external data.

JD Cloud Developers

Oct 8, 2024

How to Build a Rust-Powered Retrieval‑Augmented Generation (RAG) System from Scratch

Introduction

Retrieval‑Augmented Generation (RAG) enhances large language models (LLMs) by retrieving relevant information from external sources and combining it with model prompts, improving accuracy and keeping knowledge up‑to‑date without retraining.

The guide shows how to build a complete RAG system using only Rust, leveraging LangChain‑like components, Qdrant as a vector store, and the candle‑rust framework for model inference.

Knowledge Base Construction

The knowledge base consists of a model and a vector database. Qdrant, a pure‑Rust vector store, is chosen for its compatibility.

The most critical step is generating embeddings:

Load the model.

Tokenize the text.

Obtain embeddings from the model.

Model Loading

The following code loads a BERT model and its tokenizer:

async fn build_model_and_tokenizer(model_config: &ConfigModel) -> Result<(BertModel, Tokenizer)> {
    let device = Device::new_cuda(0)?;
    let repo = Repo::with_revision(
        model_config.model_id.clone(),
        RepoType::Model,
        model_config.revision.clone(),
    );
    let (config_filename, tokenizer_filename, weights_filename) = {
        let api = ApiBuilder::new().build()?;
        let api = api.repo(repo);
        let config = api.get("config.json").await?;
        let tokenizer = api.get("tokenizer.json").await?;
        let weights = if model_config.use_pth {
            api.get("pytorch_model.bin").await?
        } else {
            api.get("model.safetensors").await?
        };
        (config, tokenizer, weights)
    };
    let config = std::fs::read_to_string(config_filename)?;
    let mut config: Config = serde_json::from_str(&config)?;
    let tokenizer = Tokenizer::from_file(tokenizer_filename).map_err(E::msg)?;
    let vb = if model_config.use_pth {
        VarBuilder::from_pth(&weights_filename, DTYPE, &device)?
    } else {
        unsafe { VarBuilder::from_mmaped_safetensors(&[weights_filename], DTYPE, &device)? }
    };
    if model_config.approximate_gelu {
        config.hidden_act = HiddenAct::GeluApproximate;
    }
    let model = BertModel::load(vb, &config)?;
    Ok((model, tokenizer))
}

To avoid repeated loading, a static OnceCell holds the model and tokenizer:

pub static GLOBAL_EMBEDDING_MODEL: OnceCell<Arc<(BertModel, Tokenizer)>> = OnceCell::const_new();

pub async fn init_model_and_tokenizer() -> Arc<(BertModel, Tokenizer)> {
    let config = get_config().unwrap();
    let (m, t) = build_model_and_tokenizer(&config.model).await.unwrap();
    Arc::new((m, t))
}

During application startup the model is loaded once:

GLOBAL_RUNTIME.block_on(async {
    log::info!("global runtime start!");
    GLOBAL_EMBEDDING_MODEL.get_or_init(init_model_and_tokenizer).await;
});

Embedding Generation

pub async fn embedding_setence(content: &str) -> Result<Vec<Vec<f64>>> {
    let m_t = GLOBAL_EMBEDDING_MODEL.get().unwrap();
    let tokens = m_t.1.encode(content, true).map_err(E::msg)?.get_ids().to_vec();
    let token_ids = Tensor::new(&tokens[..], &m_t.0.device)?.unsqueeze(0)?;
    let token_type_ids = token_ids.zeros_like()?;
    let sequence_output = m_t.0.forward(&token_ids, &token_type_ids)?;
    let (_n_sentence, n_tokens, _hidden_size) = sequence_output.dims3()?;
    let embeddings = (sequence_output.sum(1)? / (n_tokens as f64))?;
    let embeddings = normalize_l2(&embeddings)?;
    let encodings = embeddings.to_vec2()?;
    Ok(encodings)
}

pub fn normalize_l2(v: &Tensor) -> Result<Tensor> {
    Ok(v.broadcast_div(&v.sqr()?.sum_keepdim(1)?.sqrt()?)?)
}

Data Ingestion

Documents are defined as:

{
  "content": "# Service Billing ...",
  "title": "Service Billing Explanation",
  "product": "CVM",
  "url": "https://docs.jdcloud.com/..."
}

The ingestion pipeline reads files, deserializes them, creates a payload, computes embeddings, and upserts points into Qdrant in batches of 100:

pub async fn load_dir(&self, path: &str, collection_name: &str) {
    let mut points = vec![];
    for entry in WalkDir::new(path).into_iter().filter_map(Result::ok) {
        if let Some(p) = entry.path().to_str() {
            let id = Uuid::new_v4();
            let content = match fs::read_to_string(p) { Ok(c) => c, Err(_) => continue };
            let doc: Doc = from_str(&content).unwrap();
            let mut payload = Payload::new();
            payload.insert("content", doc.content);
            payload.insert("title", doc.title);
            payload.insert("product", doc.product);
            payload.insert("url", doc.url);
            let vector_contents = embedding_setence(content).await.unwrap();
            let ps = PointStruct::new(id.to_string(), vector_contents[0].clone(), payload);
            points.push(ps);
            if points.len() == 100 {
                self.client.upsert_points(UpsertPointsBuilder::new(collection_name, points.clone()).wait(true)).await.unwrap();
                points.clear();
                println!("batch finish");
            }
        }
    }
    if !points.is_empty() {
        self.client.upsert_points(UpsertPointsBuilder::new(collection_name, points).wait(true)).await.unwrap();
    }
}

Inference Service

The inference server runs mistral.rs with a Qwen model. Because direct HuggingFace access is blocked in China, the model is downloaded via a mirror:

HF_ENDPOINT="https://hf-mirror.com" huggingface-cli download --repo-type model --resume-download Qwen/Qwen2-7B --local-dir /root/Qwen2-7B

Start the server:

git clone https://github.com/EricLBuehler/mistral.rs
cd mistral.rs
cargo run --bin mistralrs-server --features cuda -- --port 3333 plain -m /root/Qwen2-7B -a qwen2

The server exposes an OpenAI‑compatible API, which the Rust client calls:

pub static GLOBAL_OPENAI_CLIENT: Lazy<Arc<OpenAIClient>> = Lazy::new(|| {
    let mut client = OpenAIClient::new_with_endpoint("http://10.0.0.7:3333/v1".to_string(), "EMPTY".to_string());
    client.timeout = Some(30);
    Arc::new(client)
});

pub async fn inference(content: &str, max_len: i64) -> Result<String> {
    let req = ChatCompletionRequest::new("".to_string(), vec![ChatCompletionMessage {
        role: MessageRole::user,
        content: Content::Text(content.to_string()),
        ..Default::default()
    }]).max_tokens(max_len);
    let cr = GLOBAL_OPENAI_CLIENT.chat_completion(req).await?;
    Ok(cr.choices[0].message.content.clone())
}

Answer Generation

pub async fn answer(question: &str, max_len: i64) -> Result<String> {
    let retriever = retriever(question, 1).await?;
    let mut context = String::new();
    for sp in retriever.result {
        let payload = sp.payload;
        context.push_str(&payload["product"].to_string());
        context.push_str(&payload["title"].to_string());
        context.push_str(&payload["content"].to_string());
    }
    let prompt = format!(
        "You are a cloud‑technology expert. Use the retrieved context to answer the question in Chinese.
Question: {}
Context: {}",
        question, context
    );
    let req = ChatCompletionRequest::new("".to_string(), vec![ChatCompletionMessage {
        role: MessageRole::user,
        content: Content::Text(prompt),
        ..Default::default()
    }]).max_tokens(max_len);
    let cr = GLOBAL_OPENAI_CLIENT.chat_completion(req).await?;
    Ok(cr.choices[0].message.content.clone())
}

Putting It All Together

#[tokio::main]
async fn main() {
    GLOBAL_MODEL.get_or_init(init_model).await;
    GLOBAL_TOKEN.get_or_init(init_tokenizer).await;
    let collection_name = "default_collection";
    let qdrant = Qdrant::from_url("http://localhost:6334").build().unwrap();
    let qdrant_client = QdrantClient { client: qdrant };
    if !qdrant_client.collection_exists(collection_name).await.unwrap() {
        qdrant_client.create_collection(CreateCollectionBuilder::new(collection_name)
            .vectors_config(VectorParamsBuilder::new(1024, Distance::Dot))).await.unwrap();
    }
    qdrant_client.load_dir("/root/jd_docs", collection_name).await;
    println!("{:?}", qdrant_client.client.health_check().await);
}

Observations & Tips

Embedding models using the candle‑rust framework consume slightly less GPU memory than alternatives. For inference, the Qwen‑1.5‑1.8B‑Chat model runs efficiently with vLLM, while the larger Qwen‑2‑7B model may exceed GPU memory in vLLM but works with mistral.rs. Adjust the HuggingFace endpoint to a mirror when operating behind the Great Firewall.

LLM Rust RAG Embedding Qdrant

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.