Unlocking LLaVA: A Hands‑On Guide to the Open‑Source Visual Language Model
This article introduces LLaVA, an open‑source large language‑visual assistant that replicates GPT‑4‑V capabilities, explains its architecture, training process, and key features, and provides step‑by‑step instructions for using the web demo, running it locally via Ollama or HuggingFace, and building a simple Gradio chatbot with code examples.
What Is LLaVA?
LLaVA (Large Language and Vision Assistant) is an open‑source generative AI model that mimics some of GPT‑4’s image‑dialogue abilities. Users can add images to a chat, discuss their content, or use images to convey ideas and context.
Key Advantages
Improves on other open‑source solutions while using a simpler architecture and less training data.
Faster training, lower cost, and can run on consumer‑grade hardware (as little as 8 GB RAM and 4 GB disk).
Online Demo
The easiest way to try LLaVA is through the authors’ web interface. Users upload an image on the left and ask questions in the chat window. For example, uploading a fridge photo yields recipe suggestions based on detected ingredients.
Running LLaVA Locally
LLaVA can be installed via Ollama or Mozilla’s llamafile, allowing it to run on CPUs of typical consumer machines. It even works on a Raspberry Pi.
Architecture Overview
LLaVA combines a language model (based on Vicuna/LLaMA‑2) with a visual encoder (CLIP‑ViT‑L/14). The visual encoder converts images to token embeddings that are inserted as soft prompts into the language model. A projection layer aligns dimensions between the two models.
Training Process
The training consists of two stages. In stage 1, only the projection layer is fine‑tuned using ~600 k image‑caption pairs from the CC3M dataset while keeping the visual encoder and LLM frozen. In stage 2, both the projection layer and the LLM are fine‑tuned on 158 k language‑image instruction data generated by GPT‑4. The entire training takes about one day on eight A100 GPUs.
Programming with LLaVA
LLaVA models are available in the transformers library. Below is a snippet that loads the 7B variant in 4‑bit quantization on a Colab notebook.
from transformers import pipeline, BitsAndBytesConfig
import torch
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model_id = "llava-hf/llava-1.5-7b-hf"
pipe = pipeline("image-to-text", model=model_id, model_kwargs={"quantization_config": quantization_config})Load an image with Pillow and query the model:
import requests
from PIL import Image
image_url = "https://cdn.pixabay.com/photo/2018/01/29/14/13/italy-3116211_960_720.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
prompt = "USER: <image>
Describe this picture
ASSISTANT:"
outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs[0]["generated_text"])Building a Simple Chatbot with Gradio
The chatbot UI consists of a Gradio Image widget and a ChatInterface. The interface calls a function that builds a prompt from the conversation history and passes it to the LLaVA pipeline.
import gradio as gr
with gr.Blocks() as demo:
with gr.Row():
image = gr.Image(type='pil', interactive=True)
gr.ChatInterface(update_conversation, additional_inputs=[image])
def update_conversation(new_message, history, image):
if image is None:
return "Please upload an image first using the widget on the left"
prompt = "USER: <image>
"
for user, assistant in history:
prompt += f"{user}
ASSISTANT: {assistant}
"
prompt += new_message + "
ASSISTANT:"
outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200, "do_sample": True, "temperature": 0.7})[0]["generated_text"]
return outputs[len(prompt)-6:]
demo.launch(debug=True)After launching, a web interface appears where users can upload images and converse with LLaVA.
References
LLaVA GitHub repository
LLaVA paper (arXiv)
HuggingFace LLaVA model docs
Ollama‑WebUI project
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
