Artificial Intelligence 33 min read

How to Build an Open‑Set Object Detection Workflow: A Comprehensive Guide

This article presents a step‑by‑step agentic object detection pipeline that combines open‑vocabulary detectors such as Grounding‑DINO with visual language models (GPT‑4o, o1) for concept extraction, critique, refinement, and validation, complete with code snippets, design rationale, and real‑world examples.

AI Algorithm Path

Jul 20, 2025

How to Build an Open‑Set Object Detection Workflow: A Comprehensive Guide

Overall Design

Concept reasoning: a vision‑language model (VLM) parses the user request to extract target concepts; if none are explicit, the VLM identifies salient objects.

Initial detection: the extracted concepts are fed to an open‑vocabulary detector (Grounding‑DINO or OWL‑ViT) to produce bounding boxes.

Interactive annotation: detected boxes are drawn on the image with numbered arrows for clear reference.

Query correction: the annotated image, original request, and detected concepts are sent to a VLM (e.g., OpenAI o1) which uses chain‑of‑thought reasoning to abstract overly specific labels (e.g., "teacup poodle" → "dog").

Refined detection: the refined concept list is re‑run through the detector.

Final filtering: a second VLM pass validates each numbered box against the user intent, returning a JSON map of valid detections.

VLM Tool Implementation

import base64

def encode_image(image_path):
    """Encodes an image file into a base64‑encoded string.
    Args:
        image_path (str): Path to the image file.
    Returns:
        str: Base64‑encoded string of the image content.
        None: If an error occurs during encoding.
    """
    try:
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode("utf-8")
    except Exception as e:
        print(f"Error encoding image: {e}")
        return None

import base64, json
from openai import OpenAI

class VLMTool:
    def __init__(self, api_key):
        self.client = OpenAI(api_key=api_key)

    def chat_completion(self, messages, model="o1", max_tokens=300, temperature=0.1, response_format=None):
        if model in ["gpt-4o", "gpt-4o-mini"]:
            response = self.client.chat.completions.create(
                model=model,
                messages=messages,
                max_tokens=max_tokens,
                temperature=temperature,
                response_format=response_format if response_format else {"type": "text"}
            )
        elif model == "o1":
            response = self.client.chat.completions.create(
                model=model,
                messages=messages,
                response_format=response_format if response_format else {"type": "text"}
            )
        else:
            raise NotImplementedError("This model is not supported")
        return response.choices[0].message.content

    def extract_objects_from_request(self, image_path, user_text, model="gpt-4o"):
        base64_image = encode_image(image_path)
        if not base64_image:
            return []
        prompt = (
            "You are an AI vision assistant that extracts objects to be identified from a user's request. "
            "If the user wants to detect or segment all objects, return a comma‑separated list of visible objects. "
            "If the user specifies particular objects, extract only those. "
            "Respond ONLY with the list, nothing else."
        )
        messages = [
            {"role": "system", "content": prompt},
            {"role": "user", "content": [
                {"type": "text", "text": user_text},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}", "detail": "high"}}
            ]}
        ]
        result = self.chat_completion(messages, model=model)
        if result:
            return [obj.strip().lower() for obj in result.split(",") if obj.strip()]
        return []

Object Detection Tool

class ObjectDetectionTool:
    """Performs object detection using Grounding‑DINO or OWL‑ViT, with optional VLM‑driven critique."""
    def __init__(self, model_id, device, vlm_tool, confidence_threshold=0.2,
                 concept_detection_model="gpt-4o", initial_critique_model="o1",
                 final_critique_model="gpt-4o"):
        self.model_id = model_id
        self.processor = AutoProcessor.from_pretrained(model_id)
        self.model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)
        self.device = device
        self.vlm_tool = vlm_tool
        self.confidence_threshold = confidence_threshold
        self.concept_detection_model = concept_detection_model
        self.initial_critique_model = initial_critique_model
        self.final_critique_model = final_critique_model
        self.last_detection_bboxes = []
        self.last_filtered_objects = []

Running the Detector

def _run_detector(self, image_path, query_list):
    """Low‑level routine to run the detection model on `query_list`.
    Returns (detected_objects, labeled_image_path).
    Each entry in `detected_objects` is (num, label_id, [x1, y1, x2, y2])."""
    from PIL import ImageFont
    # Format queries for the model
    if INV_MODEL_TYPES[self.model_id] == "owlvit":
        formatted_queries = [f"An image of {q}" for q in query_list]
    elif INV_MODEL_TYPES[self.model_id] == "grounding_dino":
        formatted_queries = " ".join([f"{q}." for q in set(query_list)])
    else:
        raise NotImplementedError("Model not supported")
    # Load image
    img = Image.open(image_path).convert("RGB")
    inputs = self.processor(text=formatted_queries, images=img, return_tensors="pt", padding=True, truncation=True).to(self.device)
    self.model.eval()
    with torch.no_grad():
        outputs = self.model(**inputs)
    # Post‑process bounding boxes
    if INV_MODEL_TYPES[self.model_id] == "grounding_dino":
        results = self.processor.post_process_grounded_object_detection(
            outputs, inputs.input_ids, box_threshold=0.4, text_threshold=0.3, target_sizes=[img.size[::-1]])
        boxes = results[0]["boxes"]
        scores = results[0]["scores"]
        labels = results[0]["labels"]
    elif INV_MODEL_TYPES[self.model_id] == "owlvit":
        logits = torch.max(outputs["logits"][0], dim=-1)
        scores = torch.sigmoid(logits.values).cpu().numpy()
        labels = logits.indices.cpu().numpy()
        boxes = outputs["pred_boxes"][0].cpu().numpy()
    else:
        raise NotImplementedError("Model not supported")
    detected_objects = []
    idx = 1
    for score, box, label_idx in zip(scores, boxes, labels):
        if score < self.confidence_threshold:
            continue
        detected_objects.append((idx, label_idx, box.tolist()))
        idx += 1
    # Draw numbers for VLM reference
    labeled_image_path = draw_arrows_and_numbers(image_path, detected_objects)
    return detected_objects, labeled_image_path

Arrow‑Number Annotation

def draw_arrows_and_numbers(image_path, detected_objects):
    """Draw arrows and numbers on an image to label detected objects.
    Returns the path to the saved annotated image.
    """
    img = cv2.imread(image_path)
    font = cv2.FONT_HERSHEY_SIMPLEX
    top, bottom, left, right = 50, 50, 50, 50
    img = cv2.copyMakeBorder(img, top, bottom, left, right, cv2.BORDER_CONSTANT, value=[255, 255, 255])
    height, width, _ = img.shape
    for num, _, box in detected_objects:
        x1, y1, x2, y2 = map(int, box)
        cx, cy = (x1 + x2) // 2, (y1 + y2) // 2
        # Adjust for padding
        x1 += left; y1 += top; x2 += left; y2 += top; cx += left; cy += top
        distances = {'top': cy, 'bottom': height - cy, 'left': cx, 'right': width - cx}
        direction = min(distances, key=distances.get)
        if direction == 'top':
            arrow_end = (cx, top)
            text_pos = (cx - 10, top - 10)
        elif direction == 'bottom':
            arrow_end = (cx, height - bottom)
            text_pos = (cx - 10, height - 5)
        elif direction == 'left':
            arrow_end = (left, cy)
            text_pos = (left - 30, cy + 5)
        else:
            arrow_end = (width - right, cy)
            text_pos = (width - 30, cy + 5)
        color = (0, 0, 0)
        cv2.arrowedLine(img, (cx, cy), arrow_end, color, 2, tipLength=0.3)
        overlay = img.copy()
        cv2.rectangle(overlay, (text_pos[0] - 5, text_pos[1] - 20), (text_pos[0] + 30, text_pos[1] + 5), (0, 0, 0), -1)
        cv2.addWeighted(overlay, 0.5, img, 0.5, 0, img)
        cv2.putText(img, str(num), text_pos, font, 0.8, color, 2)
    labeled_path = "labeled_objects_optimized.jpg"
    cv2.imwrite(labeled_path, img)
    return labeled_path

Critique and Refinement

def _critique_and_refine_query(self, user_request, original_concepts, labeled_image_path, objects_detected, model="o1"):
    """Ask the VLM to refine detection queries if needed.
    Returns a new list of objects or the original list.
    """
    base64_img = encode_image(labeled_image_path)
    refine_messages = [
        {"role": "system", "content": "You are an AI system that refines detection queries. ..."},
        {"role": "user", "content": [
            {"type": "text", "text": f"User's request: {user_request}
Original Concepts: {original_concepts}"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_img}", "detail": "high"}}
        ]}
    ]
    response = self.vlm_tool.chat_completion(refine_messages, model=model, response_format={"type": "json_object"})
    refined = json.loads(response).get("refined_list", "").split(",")
    refined = [r.strip().lower() for r in refined if r.strip()]
    return refined if refined else []

Bounding‑Box Validation with VLM

def _validate_bboxes_with_llm(self, user_request, labeled_image_path, model="o1"):
    """Pass the annotated image to the VLM to filter bounding boxes based on the user request.
    Returns a dict of valid box numbers.
    """
    base64_img = encode_image(labeled_image_path)
    messages = [
        {"role": "system", "content": "You are an AI reviewing an object detection output. All detected objects have been marked with numbered arrows. Identify which objects satisfy the user's request and return a JSON map of valid numbers."},
        {"role": "user", "content": [
            {"type": "text", "text": f"User's request: {user_request}"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_img}", "detail": "high"}}
        ]}
    ]
    response = self.vlm_tool.chat_completion(messages, model=model, response_format={"type": "json_object"})
    try:
        data = json.loads(response)
        return data.get("valid_numbers", {})
    except json.JSONDecodeError:
        return {}

Full Pipeline

def run(self, image_path, user_request, do_critique=True):
    """Execute the agentic object detection workflow.
    1. Extract objects from the user request via VLM.
    2. Detect bounding boxes.
    3. Optional critique to abstract overly specific concepts.
    4. Re‑run detection with refined concepts if they differ.
    5. Final VLM validation of numbered boxes.
    6. Return the final annotated image and a summary string.
    """
    # Step 1
    objects = self.vlm_tool.extract_objects_from_request(image_path, user_request, model=self.concept_detection_model)
    if not objects:
        return None, "⚠️ No objects to detect or invalid request."
    # Step 2
    detected, labeled_path = self._run_detector(image_path, objects)
    # Step 3 – optional critique
    if do_critique:
        current_labels = ",".join(set(str(lbl) for _, lbl, _ in detected))
        refined = self._critique_and_refine_query(user_request, current_labels, labeled_path, current_labels, model=self.initial_critique_model)
        if refined and set(refined) != set(objects):
            detected, labeled_path = self._run_detector(image_path, refined)
            if not detected:
                return None, "No objects found after refinement."
    # Step 4 – final validation
    valid_numbers = self._validate_bboxes_with_llm(user_request, labeled_path, model=self.final_critique_model)
    if valid_numbers:
        filtered = [(n, valid_numbers[str(n)], box) for (n, _, box) in detected if str(n) in valid_numbers]
    else:
        filtered = detected
    self.last_detection_bboxes = [box for _, _, box in filtered]
    self.last_filtered_objects = filtered
    final_img = draw_bounding_boxes(image_path, filtered)
    final_text = f"🔍 Validated objects: {', '.join(set(str(lbl) for _, lbl, _ in filtered))}"
    return final_img, final_text

Experimental Demonstrations

Two Gradio demos illustrate the workflow.

Demo 1: The user asks to detect an iPod and an iPhone. The pipeline extracts the concepts, runs Grounding‑DINO, optionally refines overly specific terms, and validates each numbered box, yielding precise detections.

Demo 2: The user requests only coffee cups bearing the "Coffee Chat" latte art. The VLM filters out the unrelated cup by interpreting the textual content on the objects.

Conclusion and Limitations

The agentic object detection framework shows how VLM‑driven reasoning can augment open‑vocabulary detectors, providing concept abstraction, batch‑style verification via numbered arrows, and text‑aware filtering without retraining the underlying detector. The approach is bounded by the detector’s vocabulary; if a target class is absent from the detector’s knowledge, VLM‑only refinement cannot recover it. Emerging multimodal models (e.g., GPT‑4V, Gemini) that directly output bounding boxes may eventually subsume the separate detector‑VLM architecture.

Source code:

https://github.com/anand-subu/blog_resources/blob/main/agentic_object_detection/models/vlm_tool.py

https://github.com/anand-subu/blog_resources/blob/main/agentic_object_detection/models/object_detection_tool.py

Python Grounding DINO Pipeline visual language model VLM open-vocabulary detection agentic object detection

Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.