How to Build a Multimodal Embedding RAG with Cohere and LlamaIndex
This guide explains how to overcome the limitations of text‑only embeddings for enterprise AI search by using a multimodal embedding model to index and retrieve both text and images, detailing the full workflow, code examples, and performance benefits.
In enterprise AI applications, massive amounts of data exist in complex multimodal forms, making the construction of an enterprise‑grade AI search or Retrieval‑Augmented Generation (RAG) system challenging, especially when it comes to indexing and retrieving image data.
Limitations of Text‑Only Embeddings
Traditional RAG pipelines often rely on text embeddings, converting images to text via large visual models (VLMs). This approach introduces a cumbersome indexing process, high computational cost, and potential semantic loss, preventing mixed‑modal retrieval such as image‑to‑image search.
Multimodal Embedding Models
Using a multimodal embedding model (e.g., Cohere’s embed‑multilingual‑v3.0) allows direct generation of embeddings for both text and images, storing them in a single vector space. This simplifies integration, reduces cost, and improves retrieval accuracy across modalities.
Step‑by‑Step Multimodal RAG Example
The example demonstrates building a multimodal RAG pipeline with the following components:
Document parsing and image extraction using LlamaParse .
Embedding generation with Cohere’s multimodal model.
Vector storage in Qdrant .
Index creation and retrieval via LlamaIndex .
Final query answering using a multimodal LLM (e.g., gpt‑4o‑mini).
1. Parse Documents and Extract Media
import os
from llama_index import SimpleDirectoryReader, Settings, StorageContext
from llama_index.embeddings import CohereEmbedding
from llama_index.vector_stores import QdrantVectorStore
import qdrant_client
FILE_NAME = "xiaomi_products.docx"
IMAGES_DOWNLOAD_PATH = "parsed_data"
def parse_doc():
parser = LlamaParse(api_key=LLAMA_CLOUD_API_KEY, language='ch_sim', result_type="markdown")
json_objs = parser.get_json_result(FILE_NAME)
json_list = json_objs[0]["pages"]
text_nodes = [TextNode(text=page["text"], metadata={"page": page["page"]}) for page in json_list]
texts = [node.text for node in text_nodes]
all_text = "
".join(texts)
os.makedirs(IMAGES_DOWNLOAD_PATH, exist_ok=True)
with open(f"{IMAGES_DOWNLOAD_PATH}/extracted_texts.txt", "w", encoding="utf-8") as file:
file.write(all_text)
parser.get_images(json_objs, download_path=IMAGES_DOWNLOAD_PATH)
parse_doc()
Settings.embed_model = CohereEmbedding(api_key=COHERE_API_KEY, model_name="embed-multilingual-v3.0")
documents = SimpleDirectoryReader("parsed_data/", required_exts=[".jpg", ".png", ".txt"], exclude_hidden=False).load_data()
client = qdrant_client.QdrantClient(path="furniture_db")
vector_store = QdrantVectorStore(client=client, collection_name="mycollection")
storage_context = StorageContext.from_defaults(vector_store=vector_store, image_store=vector_store)2. Build the Multimodal Index
index = MultiModalVectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
transformations=[TokenTextSplitter(separator="------------------------", chunk_size=300)],
image_embed_model=Settings.embed_model,
)
print("Index creation complete.")3. Test Multimodal Retrieval
retriever_engine = index.as_retriever(similarity_top_k=3, image_similarity_top_k=2)
query = "一款可折叠屏幕的手机"
retrieval_results = retriever_engine.retrieve(query)
# Separate image nodes and display them
retrieved_image = []
for res_node in retrieval_results:
if isinstance(res_node.node, ImageNode):
retrieved_image.append(res_node.node.metadata["file_path"])
else:
display_source_node(res_node, source_length=500)
display_images(retrieved_image)The above code retrieves relevant images for a textual query and displays them, confirming that the multimodal index supports text‑to‑image search.
4. Image‑to‑Image Retrieval
query_image = 'router.png'
retrieval_results = retriever_engine.image_to_image_retrieve(query_image)
# Display the retrieved images similarly5. RAG Generation with Multimodal LLM
from llama_index.core.prompts import PromptTemplate
from llama_index.multi_modal_llms.dashscope import DashScopeMultiModal, DashScopeMultiModalModels
qa_tmpl_str = (
"以下是上下文信息:
"
"---------------------
"
"{context_str}
"
"---------------------
"
"请仅基于提供的上下文和输入的图片(不要使用先验知识)回答问题。
"
"请按以下格式回答:
"
"Result: [基于上下文的回答]
"
"---------------------
"
"我的问题: {query_str}
"
)
qa_tmpl = PromptTemplate(qa_tmpl_str)
multimodal_llm = OpenAIMultiModal(model="gpt-4o-mini", temperature=0.0, max_tokens=1024)
query_engine = index.as_query_engine(
llm=multimodal_llm,
text_qa_template=qa_tmpl,
similarity_top_k=5,
image_similarity_top_k=2,
)
result = query_engine.query("介绍小米的高性能游戏笔记本")
print(result)
display_images([result.metadata["image_nodes"][0].metadata["file_path"]])The final query engine feeds both retrieved text and images to a multimodal LLM, producing answers that incorporate visual context.
Key Benefits
Unified vector space for multiple modalities simplifies integration and reduces storage overhead.
Supports mixed‑modal retrieval: text‑to‑text, text‑to‑image, and image‑to‑image.
Higher performance and lower cost compared to separate VLM‑based pipelines.
Improves relevance of retrieved context, leading to better RAG responses.
Conclusion
By leveraging a mature multimodal embedding model such as Cohere’s embed‑multilingual‑v3.0 together with LlamaIndex and Qdrant, enterprises can efficiently index and retrieve both textual and visual assets in a single vector store, dramatically simplifying the development of AI search and RAG applications while expanding capabilities to mixed‑modal scenarios.
AI Large Model Application Practice
Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
