YOLOv12 Unveiled: Boosted Performance and Speed for Real‑Time Detection
YOLOv12 introduces an attention‑centric architecture, a lightweight regional attention module, and the R‑ELAN aggregation network, delivering consistent mAP gains and lower latency across N, S, M, L and X model scales while surpassing previous YOLO versions and other real‑time detectors.
Key Innovations
Attention‑centered backbone : Replaces the conventional CNN‑only backbone with an attention‑driven design. By moving the primary modeling capability to multi‑head attention, the architecture avoids the representational limits of pure convolutional stacks that dominate earlier YOLO versions.
Efficient regional attention module (A2) : Splits the input feature map of size H×W into K equal regions (default K=4) using a single reshape operation. This reduces the attention cost from C to C/K while preserving a large receptive field, because each region still attends over its own spatial extent.
Residual Efficient Layer Aggregation Network (R‑ELAN) : Introduces a block‑level residual shortcut with a learnable scaling factor (default 0.01). Feature aggregation first passes through a transition layer that aligns channel dimensions, then concatenates the attention‑enhanced features, forming a bottleneck that cuts compute and memory usage.
Optimized basic attention block : Adjusts the MLP expansion ratio to 2 for N/S/M scale models and 1.2 for larger scales. Replaces the standard nn.Linear + LayerNorm with nn.Conv2d + BatchNorm to exploit convolution efficiency. Positional encoding is removed; instead a 7×7 depth‑wise separable convolution supplies positional cues, making the block lightweight enough for real‑time YOLO pipelines.
Implementation Details
Regional attention module : For a feature map H×W×C, reshape it to (K, h, w, C) where h=H/√K, w=W/√K. The attention operation is then applied independently within each region, cutting the quadratic cost by a factor of K. No explicit window partitioning or padding is required.
R‑ELAN : Adds a residual connection from the module input to its output, scaled by a learnable factor α≈0.01. The connection follows the layer‑scale pattern but is tuned for the attention branch. After the transition layer (a 1×1 convolution that matches channel width), the output of the attention block is concatenated with the shortcut, yielding a compact bottleneck.
Architecture refinements : Sets the MLP expansion ratio to 2 for N/S/M models and 1.2 for larger variants. Substitutes nn.Linear + LN with nn.Conv2d + BN to reduce memory traffic. Removes explicit sinusoidal or learned positional encodings; inserts a 7×7 depth‑wise separable convolution (the “positional perceiver”) after the attention to inject spatial context.
Experimental Results
N‑scale (YOLOv12‑N) : Achieves mAP improvements of +3.6 % over YOLOv6‑3.0‑N, +3.3 % over YOLOv8‑N, +2.1 % over YOLOv10‑N and +1.2 % over YOLOv11, while keeping FLOPs and parameter count comparable. Inference latency is 1.64 ms per image.
S‑scale (YOLOv12‑S) : Runs at 2.61 ms per image with 21.4 GFLOPs and 9.3 M parameters, reaching 48.0 mAP. This is +3.0 % over YOLOv8‑S, +1.2 % over YOLOv9‑S, +1.7 % over YOLOv10‑S and +1.1 % over YOLOv11‑S, with similar or lower compute. Compared with end‑to‑end detectors RT‑DETR‑R18 and RT‑DETRv2‑R18, YOLOv12‑S delivers comparable accuracy but faster speed and fewer parameters.
M‑scale (YOLOv12‑M) : With 67.5 GFLOPs and 20.2 M parameters, it attains 52.5 mAP at 4.86 ms per image. It surpasses Gold‑YOLO‑M, YOLOv8‑M, YOLOv9‑M, YOLOv10, YOLOv11 and RT‑DETR‑R34 / RT‑DETRv2‑R34.
L‑scale (YOLOv12‑L) : Reduces FLOPs by 31.4 GFLOPs relative to YOLOv10‑L, improves mAP by +0.4 % over YOLOv11, and outperforms RT‑DETR‑R50 / RT‑DETRv2‑R50 with 34.6 % fewer FLOPs, 37.1 % fewer parameters, and lower latency.
X‑scale (YOLOv12‑X) : Beats YOLOv10‑X and YOLOv11‑X by +0.8 % and +0.6 % mAP respectively, while keeping speed, FLOPs and parameters on par. It also outperforms RT‑DETR‑R101 / RT‑DETRv2‑R101 with 23.4 % lower FLOPs and 22.2 % fewer parameters.
FP32 precision effect : When models are saved and evaluated in full‑precision FP32, an additional ~0.2 % mAP gain is observed, yielding reported mAP values of 33.9 % for L‑scale and 55.4 % for X‑scale.
References
Paper: https://arxiv.org/abs/2502.12524
Code repository: https://github.com/sunsmarterjie/yolov12
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
