Artificial Intelligence 10 min read

Step-by-Step Guide to Structured Output in Local Vision Language Models with Pydantic

This article walks through the challenges of prompting small vision language models, demonstrates a conventional JSON‑based prompt, then shows how to define Pydantic models, embed their JSON schema into prompts, run inference with Qwen2.5‑VL, and visualize the structured results.

AI Algorithm Path

Mar 27, 2025

Step-by-Step Guide to Structured Output in Local Vision Language Models with Pydantic

When building applications on small language models, prompt design is crucial; verbose prompts are often needed to coax the model, and large models tend to add redundant instructions like "return JSON" that complicate maintenance.

Conventional implementation typically uses a hand‑crafted prompt that asks the model to output a JSON object containing an object_list with fields such as name, description, x, and y. An example prompt and expected JSON response are shown in the code block below.

messages = [
    {
        "role": "user",
        "content": """Identify all objects in the image. Your response should return a JSON object with an \"object_list\" array. Each item must contain \"name\", \"description\", \"x\", and \"y\" fields. Enclose your JSON response within a ```json``` code block. **Do not** include any additional text, explanations, or markdown outside of the JSON structure.

### Example:
**User:** Describe the objects present in the following image.
**Assistant Response:**
```json
{
    "object_list": [
        {"name": "Bookshelf", "description": "stores books", "x": 13.8, "y": 59.6},
        {"name": "Laptop", "description": "An open laptop on the desk", "x": 34.3, "y": 38.2}
    ]
}
```"""
    }
]

Maintaining such prompts becomes cumbersome. By using Pydantic , the response structure can be defined once and reused.

Introducing Pydantic

class Object(BaseModel):
    name: str
    description: str = Field(..., description="short description")
    x: float = Field(..., description="x coordinate of the object")
    y: float = Field(..., description="y coordinate of the object")
    model_config = ConfigDict(
        json_schema_extra={
            "example": [{"name": "object 1", "description": "object 1 description", "x": 13.3, "y": 59.6}]
        }
    )

class ObjectList(BaseModel):
    object_list: list[Object]

The prompt can now embed the generated JSON schema directly:

messages = [
    {
        "role": "user",
        "content": f"""identify the main 3 objects in the image. Your response should Return the correct JSON response within a ```json codeblock. not the JSON_SCHEMA"""
JSON schema: {json.dumps(ObjectList.model_json_schema())}. Also use the provided example to format your json.
"""
    }
]

Using the schema ensures the model output conforms to the predefined structure, enabling reliable validation and downstream processing.

Setup and test image

Install required libraries and prepare a test image test.jpg:

pip install -U einops torch torchvision matplotlib ollama transformers

Model inference

Reference HuggingFace inference code is wrapped in a QwenCaller class that loads Qwen2.5‑VL‑7B‑Instruct, prepares the chat template, processes vision information, and generates the output.

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

class QwenCaller:
    def __init__(self, model_path="Qwen/Qwen2.5-VL-7B-Instruct"):
        self.model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path, torch_dtype="auto", device_map="auto")
        self.processor = AutoProcessor.from_pretrained(model_path)
    def call(self, query, image_path):
        messages = [{"role": "user", "content": [{"type": "image", "image": image_path}, {"type": "text", "text": query}]}]
        text = self.processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        image_inputs, video_inputs = process_vision_info(messages)
        inputs = self.processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
        inputs = inputs.to(self.model.device)
        generated_ids = self.model.generate(**inputs, max_new_tokens=1000)
        generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
        output_text = self.processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
        return output_text

Calling the model:

if __name__ == "__main__":
    vlm_caller = QwenCaller()
    query = """identify the main 3 objects in the image. Your response should Return the correct JSON response within a ```json codeblock. not the JSON_SCHEMA"""
    query += f"
JSON schema: {json.dumps(ObjectList.model_json_schema())}. Also use the provided example to format your json."
    img_path = "test.jpg"
    answer = vlm_caller.call(query, img_path)
    print("answer is:", answer)

The printed answer is a JSON object matching the ObjectList schema, as shown in the screenshot below.

Result visualization

A helper function plot_locations takes an ObjectList instance and the original image, normalizes coordinates, and draws labeled points on the image using matplotlib.

def plot_locations(objects: ObjectList, image_path: str, point_size=10, font_size=12):
    grayscale_image = Image.open(image_path)
    img_width, img_height = grayscale_image.size
    x_coords = [obj.x for obj in objects.object_list]
    y_coords = [obj.y for obj in objects.object_list]
    labels = [obj.name for obj in objects.object_list]
    plt.figure(figsize=(10, 8))
    plt.imshow(grayscale_image)
    plt.axis("off")
    for i, (x, y, label) in enumerate(zip(x_coords, y_coords, labels), start=1):
        label_with_number = f"{i}: {label}"
        plt.plot(x, y, "o", color="red", markersize=point_size)
        plt.annotate(label_with_number, (x, y), xytext=(0, 10), textcoords="offset points", ha="center", color="red", fontsize=font_size)
    plt.show()

Running this function on the model’s output produces a plot with the detected objects marked, as illustrated below.

Conclusion

By defining response structures with Pydantic and injecting the generated JSON schema into prompts, developers can obtain concise, maintainable, and validated outputs from local VLMs. This approach simplifies building RAG pipelines, error handling, and retry mechanisms while remaining model‑agnostic.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python JSON Schema model inference Pydantic structured output Vision Language Model Qwen2.5VL

Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.