Artificial Intelligence 16 min read

How EFSI‑DETR Achieves 188 FPS and Boosts Small‑Object Detection Accuracy by 5.8%

The article dissects EFSI‑DETR, a UAV small‑object detector that combines simulated frequency processing with dynamic semantic enhancement to overcome pixel scarcity, static fusion, and ignored frequency cues, delivering 188 FPS and a 5.8% APₛ gain on VisDrone while remaining lightweight.

AIWalker

Mar 9, 2026

How EFSI‑DETR Achieves 188 FPS and Boosts Small‑Object Detection Accuracy by 5.8%

Why Small‑Object Detection on UAVs Is Hard

When a drone monitors power lines, targets such as insulators or birds may occupy only a few to dozens of pixels, causing models to miss them or mistake them for noise. On the VisDrone benchmark, even the strongest YOLOv12‑X reaches only 17.9% APₛ for small objects, meaning more than 80% of tiny instances are invisible to the detector.

The authors identify three root causes:

Pixel scarcity : aggressive down‑sampling in CNN backbones erodes the already limited pixel information of tiny objects.

Static multi‑scale fusion : methods like FPN or PANet use fixed fusion weights that cannot adapt to the highly variable drone imagery.

Neglected frequency cues : high‑frequency texture and edge information, crucial for tiny objects, is ignored because most pipelines operate solely in the spatial domain.

Attempts to insert explicit FFT‑based frequency transforms introduce three severe drawbacks—kernel incompatibility, massive memory traffic, and poor deployment compatibility—making real‑time inference impossible.

Core Architecture: The Three‑Horse‑Carriage Design

EFSI‑DETR solves the above issues with three tightly coupled modules:

Dynamic Frequency‑Spatial Fusion Network (DyFusNet)

Efficient Semantic Feature Concentrator (ESFC)

Fine‑Grained Feature Retention (FFR) strategy

DyFusNet – Simulated Frequency Processing Without FFT

Instead of a costly FFT, DyFusNet learns spatial operators that mimic low‑, mid‑, and high‑frequency filters. The Dynamic Multi‑Resolution Spectral Decomposition (DMSD) module splits the input feature map into three parallel paths:

Low‑frequency path : an AvgPool layer acts as a low‑pass filter, capturing smooth global structure.

Mid‑frequency path : an Identity mapping preserves the original feature untouched.

High‑frequency path : a depth‑wise convolution Conv_dw serves as a high‑pass filter, extracting edges and fine textures.

The three outputs are fused by a lightweight attention network (global average pooling + two‑layer MLP) that generates dynamic weights conditioned on the input content. Consequently, regions with dense, complex textures receive higher weight on the high‑frequency path, while smooth sky or water areas rely more on the low‑frequency path—an adaptability impossible for static filters or FFT pipelines.

Spatial‑Frequency Collaborative Modulation (SFCM)

After DMSD, features enter the SFCM module, which first aggregates spatial context via parallel depth‑wise convolutions of varying receptive fields, then applies a channel‑attention mechanism to re‑weight channels based on global statistics. This enhances task‑relevant channels and suppresses noisy ones, crucial for the high background‑object confusion in drone footage.

Overall, DyFusNet = DMSD + SFCM . By routing part of the channels through the frequency pipeline and keeping the rest unchanged, the design injects frequency priors while keeping computational overhead low.

ESFC – Lightweight Semantic Expert

ESFC extracts high‑level semantics efficiently. Its centerpiece is the Dynamic Expert Convolution (DEConv) , which maintains three expert kernels (the optimal number found in experiments). A tiny attention gate selects and blends these experts per input, allowing the network to decide which convolution is most effective for the current feature slice.

To offset the extra cost, ESFC incorporates an Efficient Ghost Block (EGBlock) : a cheap convolution first generates a subset of features, then a depth‑wise convolution “ghost‑produces” additional maps, dramatically expanding representational capacity with minimal parameters.

Finally, the Dual‑Domain Guided Aggregation (DGA) module computes attention maps in both channel and spatial dimensions, finely modulating features so that the most discriminative regions receive stronger emphasis.

FFR Strategy – Preserving Pixel‑Level Detail

The backbone of RT‑DETR downsamples heavily, discarding fine details needed for tiny objects. FFR directly injects shallow, high‑resolution features (e.g., from early stages of the backbone) into the mixed encoder and discards the deepest, coarsest feature maps in the decoder, relying instead on intermediate layers that retain spatial detail.

This design ensures that pixel‑level cues survive all the way to the detection head, addressing the core limitation of conventional DETR‑style detectors.

Experimental Validation

Extensive tests on the VisDrone and CODrone UAV datasets confirm the approach.

SOTA Comparison on VisDrone

EFSI‑DETR achieves 33.1% overall AP, surpassing the previous best YOLOv12‑X by 5.0%. More strikingly, small‑object APₛ reaches 24.8%, a 6.9% absolute (≈40% relative) improvement over YOLOv12‑X’s 17.9%.

Inference runs at 188 FPS (5.3 ms per image) with 27.3 M parameters, faster and lighter than the comparable RemDet‑L (7.1 ms, 35.3 M parameters) while delivering higher accuracy.

On CODrone, EFSI‑DETR also leads in AP, AP₅₀, and APₛ across all YOLO and DETR families, demonstrating strong generalisation.

Ablation Studies

Using RT‑DETR‑R18 as a baseline, adding the FFR strategy alone lifts AP by 4.4% and APₛ by 4.9%, confirming the critical role of shallow detail preservation.

Appending DyFusNet brings an extra 1.4% AP gain, while ESFC adds 0.4% AP while reducing the model size by 1.5 M parameters.

Varying the number of DEConv experts shows that three experts strike the best trade‑off between performance and efficiency; more experts cause redundancy, fewer hurt representation power.

Two FFR designs were compared: keeping high‑level features versus discarding them. Removing the high‑level branch improves speed but degrades accuracy, overturning the naive belief that “more features always help” and highlighting that precise fine‑grained detail outweighs coarse semantics for tiny objects.

Objective Evaluation and Outlook

The method shows limited gains on large‑object APₗ, which the authors attribute to the architecture’s focus on preserving fine detail rather than modeling extensive context. This trade‑off is acceptable for UAV scenarios where tiny targets dominate.

Key takeaways include:

Prioritising deployment‑friendly designs over flashy FFT tricks.

Protecting pixel‑level information with the FFR strategy.

Employing dynamic, content‑aware weighting throughout the network.

Future work may explore even more efficient adaptive multi‑scale fusion mechanisms that further balance speed and accuracy across object scales.

Conclusion

EFSI‑DETR demonstrates a clear path to high‑performance, real‑time small‑object detection: dynamic frequency‑spatial fusion extracts discriminative features, the efficient semantic module refines high‑level cues, and the fine‑grained retention strategy safeguards essential pixel detail. On VisDrone, the model lifts small‑object AP by 5.8% while running at 188 FPS, offering a powerful tool for UAV security, inspection, and agricultural monitoring.

Reference: EFSI‑DETR: Efficient Frequency‑Semantic Integration for Real‑Time Small Object Detection in UAV Imagery.

small object detection Real-time Inference DETR dynamic fusion frequency domain processing UAV vision

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.