Step-by-Step Guide to Structured Output in Local Vision Language Models with Pydantic
This article walks through the challenges of prompting small vision language models, demonstrates a conventional JSON‑based prompt, then shows how to define Pydantic models, embed their JSON schema into prompts, run inference with Qwen2.5‑VL, and visualize the structured results.
When building applications on small language models, prompt design is crucial; verbose prompts are often needed to coax the model, and large models tend to add redundant instructions like "return JSON" that complicate maintenance.
Conventional implementation typically uses a hand‑crafted prompt that asks the model to output a JSON object containing an object_list with fields such as name, description, x, and y. An example prompt and expected JSON response are shown in the code block below.
messages = [
{
"role": "user",
"content": """Identify all objects in the image. Your response should return a JSON object with an \"object_list\" array. Each item must contain \"name\", \"description\", \"x\", and \"y\" fields. Enclose your JSON response within a ```json``` code block. **Do not** include any additional text, explanations, or markdown outside of the JSON structure.
### Example:
**User:** Describe the objects present in the following image.
**Assistant Response:**
```json
{
"object_list": [
{"name": "Bookshelf", "description": "stores books", "x": 13.8, "y": 59.6},
{"name": "Laptop", "description": "An open laptop on the desk", "x": 34.3, "y": 38.2}
]
}
```"""
}
]Maintaining such prompts becomes cumbersome. By using Pydantic , the response structure can be defined once and reused.
Introducing Pydantic
class Object(BaseModel):
name: str
description: str = Field(..., description="short description")
x: float = Field(..., description="x coordinate of the object")
y: float = Field(..., description="y coordinate of the object")
model_config = ConfigDict(
json_schema_extra={
"example": [{"name": "object 1", "description": "object 1 description", "x": 13.3, "y": 59.6}]
}
)
class ObjectList(BaseModel):
object_list: list[Object]The prompt can now embed the generated JSON schema directly:
messages = [
{
"role": "user",
"content": f"""identify the main 3 objects in the image. Your response should Return the correct JSON response within a ```json codeblock. not the JSON_SCHEMA"""
JSON schema: {json.dumps(ObjectList.model_json_schema())}. Also use the provided example to format your json.
"""
}
]Using the schema ensures the model output conforms to the predefined structure, enabling reliable validation and downstream processing.
Setup and test image
Install required libraries and prepare a test image test.jpg:
pip install -U einops torch torchvision matplotlib ollama transformersModel inference
Reference HuggingFace inference code is wrapped in a QwenCaller class that loads Qwen2.5‑VL‑7B‑Instruct, prepares the chat template, processes vision information, and generates the output.
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
class QwenCaller:
def __init__(self, model_path="Qwen/Qwen2.5-VL-7B-Instruct"):
self.model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path, torch_dtype="auto", device_map="auto")
self.processor = AutoProcessor.from_pretrained(model_path)
def call(self, query, image_path):
messages = [{"role": "user", "content": [{"type": "image", "image": image_path}, {"type": "text", "text": query}]}]
text = self.processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = self.processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
inputs = inputs.to(self.model.device)
generated_ids = self.model.generate(**inputs, max_new_tokens=1000)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_text = self.processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
return output_textCalling the model:
if __name__ == "__main__":
vlm_caller = QwenCaller()
query = """identify the main 3 objects in the image. Your response should Return the correct JSON response within a ```json codeblock. not the JSON_SCHEMA"""
query += f"
JSON schema: {json.dumps(ObjectList.model_json_schema())}. Also use the provided example to format your json."
img_path = "test.jpg"
answer = vlm_caller.call(query, img_path)
print("answer is:", answer)The printed answer is a JSON object matching the ObjectList schema, as shown in the screenshot below.
Result visualization
A helper function plot_locations takes an ObjectList instance and the original image, normalizes coordinates, and draws labeled points on the image using matplotlib.
def plot_locations(objects: ObjectList, image_path: str, point_size=10, font_size=12):
grayscale_image = Image.open(image_path)
img_width, img_height = grayscale_image.size
x_coords = [obj.x for obj in objects.object_list]
y_coords = [obj.y for obj in objects.object_list]
labels = [obj.name for obj in objects.object_list]
plt.figure(figsize=(10, 8))
plt.imshow(grayscale_image)
plt.axis("off")
for i, (x, y, label) in enumerate(zip(x_coords, y_coords, labels), start=1):
label_with_number = f"{i}: {label}"
plt.plot(x, y, "o", color="red", markersize=point_size)
plt.annotate(label_with_number, (x, y), xytext=(0, 10), textcoords="offset points", ha="center", color="red", fontsize=font_size)
plt.show()Running this function on the model’s output produces a plot with the detected objects marked, as illustrated below.
Conclusion
By defining response structures with Pydantic and injecting the generated JSON schema into prompts, developers can obtain concise, maintainable, and validated outputs from local VLMs. This approach simplifies building RAG pipelines, error handling, and retry mechanisms while remaining model‑agnostic.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Algorithm Path
A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
