Artificial Intelligence 11 min read

Unlocking LLaVA: A Hands‑On Guide to the Open‑Source Visual Language Model

This article introduces LLaVA, an open‑source large language‑visual assistant that replicates GPT‑4‑V capabilities, explains its architecture, training process, and key features, and provides step‑by‑step instructions for using the web demo, running it locally via Ollama or HuggingFace, and building a simple Gradio chatbot with code examples.

21CTO

Jan 31, 2024

Unlocking LLaVA: A Hands‑On Guide to the Open‑Source Visual Language Model

What Is LLaVA?

LLaVA (Large Language and Vision Assistant) is an open‑source generative AI model that mimics some of GPT‑4’s image‑dialogue abilities. Users can add images to a chat, discuss their content, or use images to convey ideas and context.

Key Advantages

Improves on other open‑source solutions while using a simpler architecture and less training data.

Faster training, lower cost, and can run on consumer‑grade hardware (as little as 8 GB RAM and 4 GB disk).

Online Demo

The easiest way to try LLaVA is through the authors’ web interface. Users upload an image on the left and ask questions in the chat window. For example, uploading a fridge photo yields recipe suggestions based on detected ingredients.

Running LLaVA Locally

LLaVA can be installed via Ollama or Mozilla’s llamafile, allowing it to run on CPUs of typical consumer machines. It even works on a Raspberry Pi.

Architecture Overview

LLaVA combines a language model (based on Vicuna/LLaMA‑2) with a visual encoder (CLIP‑ViT‑L/14). The visual encoder converts images to token embeddings that are inserted as soft prompts into the language model. A projection layer aligns dimensions between the two models.

Training Process

The training consists of two stages. In stage 1, only the projection layer is fine‑tuned using ~600 k image‑caption pairs from the CC3M dataset while keeping the visual encoder and LLM frozen. In stage 2, both the projection layer and the LLM are fine‑tuned on 158 k language‑image instruction data generated by GPT‑4. The entire training takes about one day on eight A100 GPUs.

Programming with LLaVA

LLaVA models are available in the transformers library. Below is a snippet that loads the 7B variant in 4‑bit quantization on a Colab notebook.

from transformers import pipeline, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)
model_id = "llava-hf/llava-1.5-7b-hf"
pipe = pipeline("image-to-text", model=model_id, model_kwargs={"quantization_config": quantization_config})

Load an image with Pillow and query the model:

import requests
from PIL import Image

image_url = "https://cdn.pixabay.com/photo/2018/01/29/14/13/italy-3116211_960_720.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

prompt = "USER: <image>
Describe this picture
ASSISTANT:"
outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs[0]["generated_text"])

Building a Simple Chatbot with Gradio

The chatbot UI consists of a Gradio Image widget and a ChatInterface. The interface calls a function that builds a prompt from the conversation history and passes it to the LLaVA pipeline.

import gradio as gr

with gr.Blocks() as demo:
    with gr.Row():
        image = gr.Image(type='pil', interactive=True)
        gr.ChatInterface(update_conversation, additional_inputs=[image])

def update_conversation(new_message, history, image):
    if image is None:
        return "Please upload an image first using the widget on the left"
    prompt = "USER: <image>
"
    for user, assistant in history:
        prompt += f"{user}
ASSISTANT: {assistant}
"
    prompt += new_message + "
ASSISTANT:"
    outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200, "do_sample": True, "temperature": 0.7})[0]["generated_text"]
    return outputs[len(prompt)-6:]

demo.launch(debug=True)

After launching, a web interface appears where users can upload images and converse with LLaVA.