Artificial Intelligence 8 min read

How YOLO-Count Enables Precise Object Counting in Text-to-Image Generation

YOLO-Count introduces a fully differentiable, open‑vocabulary object counting model that guides text‑to‑image generators to produce the exact number of objects specified in prompts, achieving state‑of‑the‑art performance on both generic counting and controlled image synthesis tasks.

AI Frontier Lectures

Sep 7, 2025

How YOLO-Count Enables Precise Object Counting in Text-to-Image Generation

Background

Precise control of object quantity in text‑to‑image (T2I) generation is required for many applications, but current diffusion models (e.g., Stable Diffusion XL) often ignore numeric cues in prompts. Traditional counting pipelines (object detectors or density‑map regression) are either non‑differentiable or biased by object size, making them unsuitable as a guidance signal for end‑to‑end T2I training.

YOLO‑Count Overview

YOLO‑Count is a fully differentiable counting module built on the YOLO‑World architecture. It can be attached to any T2I model and trained jointly so that counting errors are back‑propagated to the diffusion network, enabling the generator to satisfy explicit count constraints.

Cardinality Map

The model predicts, for each grid cell (determined by the backbone stride), a scalar c∈[0,1] that estimates the probability that the cell contains an object. The total count ĤN is obtained by summing all cell values: ĤN = Σ_{i=1}^{H·W} c_i Because each object contributes roughly one unit regardless of its spatial extent, the representation is unbiased with respect to scale and density. During training the ground‑truth count N is regressed with an L1 loss L_cnt = |ĤN−N|, optionally combined with a binary cross‑entropy loss on per‑cell occupancy when segmentation masks are available.

Differentiable Open‑Vocabulary Counting

YOLO‑Count inherits the CLIP‑based text encoder of YOLO‑World, allowing arbitrary class names to be queried at inference time. The entire forward pass (image → feature map → cardinality map) is differentiable, so the counting loss can be propagated through the T2I model’s conditioning pathway. In practice the loss is added to the diffusion model’s denoising objective with a weighting factor λ_cnt.

Hybrid Strong‑Weak Supervision

Training data consists of:

Strong annotations : images with pixel‑level segmentation masks, providing exact per‑pixel occupancy labels for a binary cross‑entropy term.

Weak annotations : point clicks or only the total object count, for which only the L1 count loss is applied.

This mixed scheme dramatically expands the usable dataset (e.g., COCO‑Stuff masks + OpenImages count labels) while preserving high counting accuracy.

Integration with Text‑to‑Image Models

During T2I training, a prompt is encoded by the text encoder of the diffusion model. The generated latent image is fed to YOLO‑Count, which produces ĤN for the queried class. The counting loss L_cnt is combined with the standard diffusion loss L_diff: L_total = L_diff + λ_cnt·L_cnt Back‑propagation updates both the diffusion network and the YOLO‑Count parameters, allowing the generator to learn to respect numeric constraints without any post‑hoc correction.

Experimental Evaluation

Quantity‑Controlled Generation

On a benchmark of prompts such as “5 apples” and “3 cars”, YOLO‑Count‑guided Stable Diffusion XL reduces the absolute count error from >2 objects (baseline) to <0.3 objects on average, achieving >90 % exact‑count success on both seen and unseen categories. Qualitative examples show consistent generation of the requested number of objects while preserving visual fidelity (FID comparable to the baseline).

Generic Object Counting

YOLO‑Count is evaluated on standard counting datasets (e.g., FSC‑147, CARPK). It attains mean absolute error (MAE) of 1.8 on FSC‑147 and 2.1 on CARPK, surpassing previous state‑of‑the‑art methods such as Count‑CNN and DM‑Count. The open‑vocabulary capability is demonstrated by counting novel categories (e.g., “zebras”) without additional fine‑tuning.

Key Contributions

Introduces a differentiable cardinality‑map representation that eliminates size‑related counting bias.

Leverages YOLO‑World’s open‑vocabulary detection to count arbitrary classes.

Provides a plug‑and‑play module that can be jointly optimized with any diffusion‑based T2I model, enabling precise numeric control.

Employs a hybrid strong‑weak supervision strategy to reduce annotation cost while maintaining state‑of‑the‑art counting performance.

Resources

Paper: YOLO‑Count: Differentiable Object Counting for Text‑to‑Image Generation (arXiv https://arxiv.org/pdf/2508.00728v1)

Code and pretrained models are released at https://github.com/your-repo/YOLO-Count.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

text-to-image Generative AI object counting YOLO-Count differentiable models

Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.