Artificial Intelligence 13 min read

How to Build a Multimodal Embedding RAG with Cohere and LlamaIndex

This guide explains how to overcome the limitations of text‑only embeddings for enterprise AI search by using a multimodal embedding model to index and retrieve both text and images, detailing the full workflow, code examples, and performance benefits.

AI Large Model Application Practice

Nov 8, 2024

How to Build a Multimodal Embedding RAG with Cohere and LlamaIndex

In enterprise AI applications, massive amounts of data exist in complex multimodal forms, making the construction of an enterprise‑grade AI search or Retrieval‑Augmented Generation (RAG) system challenging, especially when it comes to indexing and retrieving image data.

Limitations of Text‑Only Embeddings

Traditional RAG pipelines often rely on text embeddings, converting images to text via large visual models (VLMs). This approach introduces a cumbersome indexing process, high computational cost, and potential semantic loss, preventing mixed‑modal retrieval such as image‑to‑image search.

Multimodal Embedding Models

Using a multimodal embedding model (e.g., Cohere’s embed‑multilingual‑v3.0) allows direct generation of embeddings for both text and images, storing them in a single vector space. This simplifies integration, reduces cost, and improves retrieval accuracy across modalities.

Step‑by‑Step Multimodal RAG Example

The example demonstrates building a multimodal RAG pipeline with the following components:

Document parsing and image extraction using LlamaParse .

Embedding generation with Cohere’s multimodal model.

Vector storage in Qdrant .

Index creation and retrieval via LlamaIndex .

Final query answering using a multimodal LLM (e.g., gpt‑4o‑mini).

1. Parse Documents and Extract Media

import os
from llama_index import SimpleDirectoryReader, Settings, StorageContext
from llama_index.embeddings import CohereEmbedding
from llama_index.vector_stores import QdrantVectorStore
import qdrant_client

FILE_NAME = "xiaomi_products.docx"
IMAGES_DOWNLOAD_PATH = "parsed_data"

def parse_doc():
    parser = LlamaParse(api_key=LLAMA_CLOUD_API_KEY, language='ch_sim', result_type="markdown")
    json_objs = parser.get_json_result(FILE_NAME)
    json_list = json_objs[0]["pages"]
    text_nodes = [TextNode(text=page["text"], metadata={"page": page["page"]}) for page in json_list]
    texts = [node.text for node in text_nodes]
    all_text = "

".join(texts)
    os.makedirs(IMAGES_DOWNLOAD_PATH, exist_ok=True)
    with open(f"{IMAGES_DOWNLOAD_PATH}/extracted_texts.txt", "w", encoding="utf-8") as file:
        file.write(all_text)
    parser.get_images(json_objs, download_path=IMAGES_DOWNLOAD_PATH)

parse_doc()

Settings.embed_model = CohereEmbedding(api_key=COHERE_API_KEY, model_name="embed-multilingual-v3.0")

documents = SimpleDirectoryReader("parsed_data/", required_exts=[".jpg", ".png", ".txt"], exclude_hidden=False).load_data()
client = qdrant_client.QdrantClient(path="furniture_db")
vector_store = QdrantVectorStore(client=client, collection_name="mycollection")
storage_context = StorageContext.from_defaults(vector_store=vector_store, image_store=vector_store)

2. Build the Multimodal Index

index = MultiModalVectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    transformations=[TokenTextSplitter(separator="------------------------", chunk_size=300)],
    image_embed_model=Settings.embed_model,
)
print("Index creation complete.")

3. Test Multimodal Retrieval

retriever_engine = index.as_retriever(similarity_top_k=3, image_similarity_top_k=2)
query = "一款可折叠屏幕的手机"
retrieval_results = retriever_engine.retrieve(query)
# Separate image nodes and display them
retrieved_image = []
for res_node in retrieval_results:
    if isinstance(res_node.node, ImageNode):
        retrieved_image.append(res_node.node.metadata["file_path"])
    else:
        display_source_node(res_node, source_length=500)
display_images(retrieved_image)

The above code retrieves relevant images for a textual query and displays them, confirming that the multimodal index supports text‑to‑image search.

4. Image‑to‑Image Retrieval

query_image = 'router.png'
retrieval_results = retriever_engine.image_to_image_retrieve(query_image)
# Display the retrieved images similarly

5. RAG Generation with Multimodal LLM

from llama_index.core.prompts import PromptTemplate
from llama_index.multi_modal_llms.dashscope import DashScopeMultiModal, DashScopeMultiModalModels

qa_tmpl_str = (
    "以下是上下文信息:
"
    "---------------------
"
    "{context_str}
"
    "---------------------
"
    "请仅基于提供的上下文和输入的图片(不要使用先验知识)回答问题。
"
    "请按以下格式回答:
"
    "Result: [基于上下文的回答]
"
    "---------------------
"
    "我的问题: {query_str}
"
)
qa_tmpl = PromptTemplate(qa_tmpl_str)

multimodal_llm = OpenAIMultiModal(model="gpt-4o-mini", temperature=0.0, max_tokens=1024)

query_engine = index.as_query_engine(
    llm=multimodal_llm,
    text_qa_template=qa_tmpl,
    similarity_top_k=5,
    image_similarity_top_k=2,
)
result = query_engine.query("介绍小米的高性能游戏笔记本")
print(result)
display_images([result.metadata["image_nodes"][0].metadata["file_path"]])

The final query engine feeds both retrieved text and images to a multimodal LLM, producing answers that incorporate visual context.

Key Benefits

Unified vector space for multiple modalities simplifies integration and reduces storage overhead.

Supports mixed‑modal retrieval: text‑to‑text, text‑to‑image, and image‑to‑image.

Higher performance and lower cost compared to separate VLM‑based pipelines.

Improves relevance of retrieved context, leading to better RAG responses.

Conclusion

By leveraging a mature multimodal embedding model such as Cohere’s embed‑multilingual‑v3.0 together with LlamaIndex and Qdrant, enterprises can efficiently index and retrieve both textual and visual assets in a single vector store, dramatically simplifying the development of AI search and RAG applications while expanding capabilities to mixed‑modal scenarios.

Python LLM RAG multimodal embedding LlamaIndex Cohere

Written by

AI Large Model Application Practice

Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.