9 min read

How to Build Multimodal Prompts with LangChain: A Step‑by‑Step Guide

Learn how LangChain enables multimodal interactions by preparing inputs, constructing prompts, invoking models like GPT‑4o, and processing responses, with a complete example that demonstrates image‑question answering, code walkthrough, environment setup, and key considerations for API keys and image URLs.

BirdNest Tech Talk

Oct 30, 2025

How to Build Multimodal Prompts with LangChain: A Step‑by‑Step Guide

LangChain provides the ability to interact with multimodal models, allowing developers to handle and generate data that includes text, images, audio, and other modalities, which makes it possible to build richer, context‑aware applications.

Core Concepts

Multimodal Prompts : Combine different data types (e.g., text and images) into a single prompt sent to a multimodal model.

Multimodal Models : Models that can understand and generate multiple modalities of data.

Typical Use Cases

Image Description & Q&A : Provide an image and a textual question, and receive a description or answer related to the image.

Visual Content Understanding : Analyze an image and extract key information.

Multimodal Chatbots : Build bots that understand both textual and visual user inputs and respond in a multimodal way.

How It Works in LangChain

Interacting with a multimodal model in LangChain generally follows these steps:

Prepare Multimodal Input : Organize data such as plain‑text strings and Base64‑encoded images into a format the model can understand.

Build a Multimodal Prompt : Use LangChain’s prompt‑template utilities to embed the multimodal input into a single prompt.

Call the Multimodal Model : Send the constructed prompt to a multimodal LLM or chat model.

Process the Multimodal Output : Receive and parse the model’s response, which may contain text, images, or other modalities.

Example 1: Multimodal Prompt (example_1_multimodal_prompt.py)

This example shows how to build and use a multimodal prompt that combines text and an image and sends it to a model that supports multimodal input.

Key Components

ChatOpenAI

: Initializes a multimodal chat model (e.g., gpt‑4o). SystemMessage: Defines the model’s role and behavior. HumanMessage: Constructs a message that contains multimodal content. type: "text": Holds textual content. type: "image_url": Holds an image URL; the image_url field can be a publicly accessible link.

Execution Flow

Load Environment Variables : Read OPENAI_API_KEY from a .env file.

Initialize the Model : Create a ChatOpenAI instance with the API key and specify a multimodal model such as gpt‑4o (the source uses qianfan‑vl‑70b as an example).

Prepare Multimodal Messages : Build a list containing a SystemMessage that sets the assistant’s role and a HumanMessage whose content field is a list mixing a text item and an image_url item.

The text asks, “What does this picture depict? Please describe in detail.”

The image URL points to a public picture (the example uses a Baidu logo URL).

Invoke the Model : Call chat_model.invoke(messages) to send the multimodal prompt.

Handle the Response : Print the textual part of the model’s response.

If an exception occurs (e.g., invalid_model), the script prints an error message and reminds the user to verify the API key and model access.

Important Considerations

API Key & Model Access : Running the example requires a valid OpenAI API key and permission to use the chosen multimodal model (e.g., gpt‑4o). Errors such as invalid_model indicate missing access.

Image URL : The demo uses a publicly reachable image URL. In production you can replace it with another URL or embed a local image as a Base64 string in the image_url field.

import os
from dotenv import load_dotenv
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI

# Load environment variables
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")
if not openai_api_key:
    raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in a .env file.")

# Initialize the chat model (example uses qianfan‑vl‑70b; replace with gpt‑4o as needed)
chat_model = ChatOpenAI(openai_api_key=openai_api_key, model="qianfan-vl-70b")

def run_multimodal_prompt_example():
    """Demonstrate a multimodal prompt (text + image)."""
    print("--- Running multimodal prompt example ---")
    image_url = "https://www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png"
    messages = [
        SystemMessage(content="You are an image‑analysis assistant that can understand image content and answer related questions."),
        HumanMessage(content=[
            {"type": "text", "text": "What does this picture depict? Please describe in detail."},
            {"type": "image_url", "image_url": {"url": image_url}}
        ])
    ]
    try:
        response = chat_model.invoke(messages)
        print("Model response:")
        print(response.content)
    except Exception as e:
        print(f"Error calling model: {e}")
        print("Make sure your OpenAI API key is valid and the model supports multimodal input (e.g., gpt‑4o).")

if __name__ == "__main__":
    run_multimodal_prompt_example()

Running the script produces output similar to:

--- Running multimodal prompt example ---
Model response:
This image shows the famous Chinese search engine Baidu’s logo. The logo consists of two parts: on the left, the English word "Baidu" in red, and on the right, a blue bear paw icon containing the white characters "du". The design is simple and the red‑blue contrast highlights Baidu’s brand identity as a leading Chinese internet company.

References

How to: pass multimodal data directly to models – https://python.langchain.com/docs/how_to/multimodal_data

How to: use multimodal prompts – https://python.langchain.com/docs/how_to/multimodal_prompts

Python LLM prompt engineering LangChain OpenAI multimodal

Written by

BirdNest Tech Talk

Author of the rpcx microservice framework, original book author, and chair of Baidu's Go CMC committee.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.