Artificial Intelligence 28 min read

How to Build a Retrieval‑Augmented Generation QA Assistant for an Open Platform

This article details a step‑by‑step design of a RAG‑based intelligent Q&A assistant for the DeWu Open Platform, covering background, RAG fundamentals, system architecture, technology selection, prompt engineering with CO‑STAR, data preprocessing, vector store setup, LangChain.js implementation, similarity search, runnable chaining, debugging, and future prospects.

Architect

Jan 27, 2025

How to Build a Retrieval‑Augmented Generation QA Assistant for an Open Platform

Background

The DeWu Open Platform provides API documentation, solution guides, and permission packages for merchants, ISVs, and internal applications. Its built‑in search only matches API paths and names, forcing users to hop between pages to find answers, which increases user anxiety and the load on human support.

Retrieval‑Augmented Generation (RAG)

RAG improves large language models (LLMs) by retrieving relevant external knowledge before generation. The model works like an “open‑book exam”, reducing hallucinations and enabling accurate, practical answers.

Core Components

External Knowledge Base – the source documents; quality directly impacts answer accuracy.

Embedding Model – converts documents and queries into dense vector embeddings.

Vector Database – stores embeddings for fast similarity search (e.g., Milvus).

Retriever – returns the top‑K most similar document chunks for a query.

Generator (LLM) – combines the query with retrieved context to produce a response.

Prompt Engineering – structures the input to the LLM (CO‑STAR template used).

RAG Workflow

Encode the user query into an embedding.

Perform semantic similarity search over the vector store.

Pass the retrieved chunks and the original query to the LLM.

Parse the LLM output (JSON) to extract the final answer and any reference URLs.

Technology Selection

LLM: GPT‑4o‑mini (OpenAI) – https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

Embedding Model: OpenAI Embeddings – https://platform.openai.com/docs/guides/embeddings

Vector DB: Milvus – https://milvus.io/

Framework: LangChain.js (Runnable design) – https://js.langchain.com/v0.2/docs/introduction/

Accuracy Considerations

Two safeguards are applied before the LLM sees a query:

Filter out non‑platform questions using a JSON‑Schema‑driven classifier.

Detect and mitigate contexts that could cause the model to generate unrelated answers.

User Query Structuring

Queries are first classified into three categories defined by a JSON schema: api_call, general, and unknown. The classifier is driven by a prompt such as:

# CONTEXT
得物的开放平台是一个包含 API 文档、解决方案文档的平台，帮助商家使用得物服务。
# OBJECTIVE
将每个客户问题分类到固定的类别中，只接受与开放平台接口相关的问题。
# RESPONSE
返回符合以下 JSON Schema 的对象。

CO‑STAR Prompt Structure

The CO‑STAR framework (Context, Objective, Style, Tone, Audience, Response) is used to craft precise prompts. Example:

# CONTEXT
...platform description...
# OBJECTIVE
Answer the user’s question using the provided context.
# STYLE
Concise and clear.
# TONE
Professional and helpful.
# AUDIENCE
DeWu platform developers.
# RESPONSE
Return a JSON object matching the defined schema.

Data Pre‑processing & Vector Store Preparation

Steps:

Select high‑quality API documentation as the knowledge base.

Clean and normalize the raw HTML/Markdown files.

Chunk the documents. For API docs a structure‑aware split (by headings and tables) is preferred; each chunk is ~128 bytes.

Store metadata (serviceId, apiId, apiName, apiUrl, pageUrl) in front‑matter for later link extraction.

Embedding Generation (JavaScript)

const hbsTemplate = `---
服务ID (serviceId): {{ service.id }}
接口ID (apiId): {{ apiId }}
接口名称 (apiName): {{ apiName }}
接口地址 (apiUrl): {{ apiUrl }}
页面地址 (pageUrl): {{ pageUrl }}
---
# {{ title }}
{{ paragraph }}`;
export const processIntoEmbeddings = (data) => {
  const template = baseTemplate(data);
  const texts = [
    template(requestHeader(data)),
    template(requestUrl(data)),
    template(publicRequestParam(data)),
    template(requestParam(data)),
    template(responseParam(data)),
    template(errorCodes(data)),
    template(authPackage(data)),
  ].filter(Boolean);
  return flattenDeep(texts).map((content) => new Document({
    metadata: {
      serviceId: data.service.id,
      apiId: data.apiId,
      apiName: data.apiName,
      apiUrl: data.apiUrl,
      pageUrl: data.pageUrl,
    },
    pageContent: content,
  }));
};

Similarity Search (Milvus)

export const $getContext = async () => {
  const embeddings = new OpenAIEmbeddings(getLLMConfig().OpenAIEmbeddingsConfig);
  const vectorStore = await Milvus.fromExistingCollection(embeddings, { collectionName: 'open_rag' });
  return RunnableSequence.from([
    (input) => input.question,
    vectorStore.asRetriever(5), // top‑5 similar chunks
  ]);
};

Answer Generation Runnable

export const $getOpenRag = async () => {
  const $model = new ChatOpenAI(getLLMConfig().ChatOpenAIConfig).bind({
    response_format: { type: 'json_object' },
  });
  const chain = RunnableSequence.from([
    RunnableMap.from({
      context: await $getContext(),
      structuredInput: (i) => i.structuredInput,
      question: (i) => i.question,
      strcuturedOutputSchema: () => getStringifiedJsonSchema(zOutputSchema),
    }),
    $getPrompt().bind({}),
    $model,
    new JsonOutputParser(),
  ]);
  return chain;
};

Full Pipeline

The end‑to‑end chain combines query structuring, context retrieval, prompt formatting, LLM inference, and JSON parsing:

const mainChain = RunnableSequence.from([
  RunnablePassthrough.assign({
    structuredInput: () => structure,
  }),
  await $getOpenRag(),
]);
const response = await mainChain.invoke({ question });

Future Outlook

RAG reduces hallucinations and provides up‑to‑date answers without retraining the LLM. Deploying the assistant internally can further lower support costs and generate usage analytics for product improvement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI LLM Prompt engineering LangChain RAG vector database

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.