Extract Structured Vehicle Data from Images with Pydantic and GPT‑4 Vision
This tutorial shows how to build a LangChain pipeline that uses GPT‑4 Vision to read vehicle images, defines a Pydantic model for type, license, make, model and color, and returns the results as structured JSON for both single and batch inference.
Problem
Goal: extract vehicle type, license plate, make, model, and color from checkpoint camera images. Traditional computer‑vision methods struggle with pattern variations, and supervised deep‑learning pipelines require multiple specialized models and large labeled datasets. Multimodal large language models such as GPT‑4 Vision can perform zero‑shot extraction, but the raw output must be converted to a structured format.
Tech Stack
GPT‑4 Vision – OpenAI multimodal model accessed via the OpenAI API.
LangChain – Framework for chaining image loading, prompt creation, model invocation, and output parsing.
Pydantic – Python library used to define the structured output schema.
Dataset
Vehicle images are taken from the Kaggle "Car Number Plate" dataset (Apache‑2.0). Link: https://www.kaggle.com/datasets/alihassanml/car-number-plate
Pipeline Components
Image loading
def image_encoding(inputs):
"""Load and convert image to base64 encoding"""
with open(inputs["image_path"], "rb") as image_file:
image_base64 = base64.b64encode(image_file.read()).decode("utf-8")
return {"image": image_base64}Output schema
class Vehicle(BaseModel):
Type: str = Field(..., examples=["Car","Truck","Motorcycle","Bus"], description="Return the type of the vehicle.")
License: str = Field(..., description="Return the license plate number of the vehicle.")
Make: str = Field(..., examples=["Toyota","Honda","Ford","Suzuki"], description="Return the Make of the vehicle.")
Model: str = Field(..., examples=["Corolla","Civic","F-150"], description="Return the Model of the vehicle.")
Color: str = Field(..., example=["Red","Blue","Black","White"], description="Return the color of the vehicle.")Parser
parser = JsonOutputParser(pydantic_object=Vehicle)
instructions = parser.get_format_instructions()Prompt generation
@chain
def prompt(inputs):
"""Create the prompt"""
prompt = [
SystemMessage(content="You are an AI assistant whose job is to inspect an image and provide the desired information from the image. If the desired field is not clear or not well detected, return none for this field. Do not try to guess."),
HumanMessage(content=[
{"type": "text", "text": "Examine the main vehicle type, make, model, license plate number and color."},
{"type": "text", "text": instructions},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{inputs['image']}", "detail": "low"}}
])
]
return promptMLLM component
model = ChatOpenAI(model="gpt-4-vision-preview", temperature=0, max_tokens=1024)Pipeline assembly
All components are linked with the | operator to build a LangChain pipeline.
Single‑image inference
output = pipeline.invoke({"image_path": img_path})
json_output = json.dumps(output)Batch inference
# Prepare a list of dictionaries with image paths
batch_input = [{"image_path": path} for path in image_paths]
# Perform batch inference
output = pipeline.batch(batch_input)
df = pd.DataFrame(output)Observations
GPT‑4 Vision reliably returns values for vehicle type, license plate, make, model, and color; fields that cannot be determined are returned as None. The pipeline is modular and can be swapped for other multimodal LLMs. Users should remain aware of potential hallucinations inherent to large language models.
AI Algorithm Path
A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
