Implementing Image Analysis and Audio Transcription in a Multimodal RAG System with LangChain 1.0

This tutorial extends a LangChain 1.0 multimodal RAG project by adding end‑to‑end image analysis and audio transcription features using Qwen3‑Omni, detailing data structures, utility classes, API changes, and Postman testing procedures.

Fun with Large Models
Fun with Large Models
Fun with Large Models
Implementing Image Analysis and Audio Transcription in a Multimodal RAG System with LangChain 1.0

Overview

This article builds on the previous LangChain 1.0 multimodal RAG tutorial, adding image analysis and audio transcription capabilities. By leveraging the multimodal strengths of Qwen3‑Omni, the system can now understand and respond to both visual and auditory inputs in a unified RAG pipeline.

Traditional Image Analysis Pipeline

The author first outlines the conventional multi‑stage image processing workflow, which includes:

Image preprocessing : resizing, color correction, and noise filtering using OpenCV or PIL.

Text extraction : OCR models such as MonkeyOCR or DeepSeek‑OCR to extract textual content.

Semantic understanding : feeding the image to a multimodal large model (e.g., Qwen‑VL series) for scene and object analysis.

While technically mature, this approach suffers from complexity and error accumulation, prompting a shift to an end‑to‑end solution.

End‑to‑End Image Analysis Implementation

The core data structures ( ContentBlock and MessageRequest) remain unchanged, storing images as base64 strings within the content field.

class ContentBlock(BaseModel):
    type: str = Field(description="Content type: text, image, audio")
    content: Optional[str] = Field(description="Content data")

class MessageRequest(BaseModel):
    content_blocks: List[ContentBlock] = Field(default=[], description="Content blocks")
    history: List[Dict[str, Any]] = Field(default=[], description="Conversation history")

A new ImageProcessor utility in utils.py handles MIME‑type detection and base64 encoding:

import base64
from fastapi import UploadFile, HTTPException

class ImageProcessor:
    """Image processing utility class"""

    @staticmethod
    def image_to_base64(image_file: UploadFile) -> str:
        try:
            contents = image_file.file.read()
            base64_encoded = base64.b64encode(contents).decode('utf-8')
            return base64_encoded
        except Exception as e:
            raise HTTPException(status_code=500, detail=f"Image encoding failed: {str(e)}")

    @staticmethod
    def get_image_mime_type(filename: str) -> str:
        extension = filename.split('.')[-1].lower()
        mime_types = {
            'jpg': 'image/jpeg',
            'jpeg': 'image/jpeg',
            'png': 'image/png',
            'gif': 'image/gif',
            'bmp': 'image/bmp',
            'webp': 'image/webp',
        }
        return mime_types.get(extension, 'image/jpeg')

The message‑building function is updated to embed the image as a image_url payload:

def create_multimodal_message(request: MessageRequest, image_file: UploadFile) -> HumanMessage:
    """Create a multimodal message"""
    message_content = []
    if image_file:
        processor = ImageProcessor()
        mime_type = processor.get_image_mime_type(image_file.filename)
        base64_image = processor.image_to_base64(image_file)
        message_content.append({
            "type": "image_url",
            "image_url": {"url": f"data:{mime_type};base64,{base64_image}"}
        })
    for i, block in enumerate(request.content_blocks):
        if block.type == "text":
            message_content.append({"type": "text", "text": block.content})
        elif block.type == "image" and block.content.startswith("data:image"):
            message_content.append({"type": "image_url", "image_url": {"url": block.content}})
    return HumanMessage(content=message_content)

The history conversion function now includes a system prompt that describes the assistant’s multimodal abilities and processes image blocks accordingly.

def convert_history_to_messages(history: List[Dict[str, Any]]) -> List[BaseMessage]:
    """Convert conversation history to LangChain messages, supporting multimodal content"""
    messages = []
    system_prompt = """
    You are a multimodal RAG assistant capable of:
    1. Conversational interaction.
    2. Image recognition and analysis (OCR, object detection, scene understanding).
    Follow the guidelines to combine uploaded images with user queries.
    """
    messages.append(SystemMessage(content=system_prompt))
    for msg in history:
        if msg["role"] == "user":
            message_content = []
            for block in msg.get("content_blocks", []):
                if block.get("type") == "text":
                    message_content.append({"type": "text", "text": block.get("content", "")})
                elif block.get("type") == "image" and block.get("content", "").startswith("data:image"):
                    message_content.append({"type": "image_url", "image_url": {"url": block["content"]}})
            messages.append(HumanMessage(content=message_content))
        elif msg["role"] == "assistant":
            messages.append(AIMessage(content=msg.get("content", "")))
    return messages

The FastAPI endpoint is modified to accept multipart/form-data with an image file, parse JSON strings for content_blocks and history, and return a streaming response:

@app.post("/api/chat/stream")
async def chat_stream(
        image_file: UploadFile = File(...),
        content_blocks: str = Form(default="[]"),
        history: str = Form(default="[]")):
    """Streaming chat endpoint supporting multimodal inputs"""
    try:
        content_blocks_data = json.loads(content_blocks)
        history_data = json.loads(history)
        request_data = MessageRequest(content_blocks=content_blocks_data, history=history_data)
        messages = convert_history_to_messages(request_data.history)
        current_message = create_multimodal_message(request_data, image_file)
        messages.append(current_message)
        return StreamingResponse(
            generate_streaming_response(messages),
            media_type="text/event-stream",
            headers={"Cache-Control": "no-cache", "Connection": "keep-alive", "Content-Type": "text/event-stream"}
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))
Postman request headers
Postman request headers

Testing the image analysis feature with Postman involves setting the Content-Type header to multipart/form-data, uploading a Gemini 3.0 logo image, and providing a JSON content_blocks payload such as [{"type": "text", "content": "请分析这张图片"}]. The response confirms that the system correctly identifies the logo, describes its visual elements, and returns a full image analysis.

Audio Transcription Implementation

The audio workflow mirrors the image pipeline: audio files are base64‑encoded and stored in the content field. A new AudioProcessor validates MIME types, enforces a 10 MB size limit, and performs base64 conversion.

class AudioProcessor:
    """Audio processing utility class"""

    @staticmethod
    def audio_to_base64(audio_file: UploadFile) -> str:
        try:
            if not AudioProcessor.is_valid_audio_type(audio_file.content_type, audio_file.filename):
                raise HTTPException(status_code=400, detail="Unsupported audio format. Supported: MP3, WAV, OGG, M4A, FLAC")
            contents = audio_file.file.read()
            max_size = 10 * 1024 * 1024
            if len(contents) > max_size:
                raise HTTPException(status_code=400, detail=f"Audio file too large, max {max_size // 1024 // 1024} MB")
            return base64.b64encode(contents).decode('utf-8')
        except HTTPException:
            raise
        except Exception as e:
            raise HTTPException(status_code=500, detail=f"Audio encoding failed: {str(e)}")

    @staticmethod
    def get_audio_mime_type(filename: str) -> str:
        extension = filename.split('.')[-1].lower()
        mime_types = {'mp3': 'audio/mpeg', 'wav': 'audio/wav', 'm4a': 'audio/mp4'}
        return mime_types.get(extension, 'audio/mpeg')

    @staticmethod
    def is_valid_audio_type(content_type: str, filename: str) -> bool:
        supported_mimes = {'audio/mpeg', 'audio/wav', 'audio/mp4'}
        if content_type and content_type in supported_mimes:
            return True
        extension = filename.split('.')[-1].lower()
        return extension in {'mp3', 'wav', 'm4a'}

The create_multimodal_message function is extended to accept an optional audio_file and embed it as an audio_url payload:

def create_multimodal_message(request: MessageRequest, image_file: UploadFile | None, audio_file: UploadFile | None) -> HumanMessage:
    """Create a multimodal message supporting image and audio"""
    message_content = []
    if image_file:
        # image handling (same as before)
        ...
    if audio_file:
        processor = AudioProcessor()
        mime_type = processor.get_audio_mime_type(audio_file.filename)
        base64_audio = processor.audio_to_base64(audio_file)
        message_content.append({
            "type": "audio_url",
            "audio_url": {"url": f"data:{mime_type};base64,{base64_audio}"}
        })
    # handle text and existing image blocks
    ...
    return HumanMessage(content=message_content)

The history conversion function is also updated to recognize audio blocks and generate appropriate audio_url entries.

elif block.get("type") == "audio":
    audio_data = block.get("content", "")
    if audio_data.startswith("data:audio"):
        message_content.append({"type": "audio_url", "audio_url": {"url": audio_data}})

The FastAPI endpoint now accepts both image_file and audio_file parameters:

@app.post("/api/chat/stream")
async def chat_stream(
        image_file: UploadFile | None = File(None),
        content_blocks: str = Form(default="[]"),
        history: str = Form(default="[]"),
        audio_file: UploadFile | None = File(None)):
    # same processing flow, calling create_multimodal_message with both files
    ...

Testing the audio transcription feature via Postman involves uploading a short audio clip containing the phrase "你好", setting Content-Type: multipart/form-data, and providing content_blocks such as [{"type": "text", "content": "请解析音频内容"}]. The system returns three occurrences of "你好", confirming correct speech‑to‑text conversion.

Audio transcription result
Audio transcription result

Conclusion

By integrating Qwen3‑Omni’s full‑modality capabilities, the author demonstrates a concise, end‑to‑end solution for image analysis and audio transcription within a multimodal RAG system. This approach eliminates the complexity of traditional multi‑stage pipelines and showcases a practical path toward future multimodal AI applications.

Image AnalysisLangChainFastAPIMultimodal RAGBase64audio transcriptionQwen3-Omni
Fun with Large Models
Written by

Fun with Large Models

Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.