Artificial Intelligence 25 min read

How HGO‑YOLO Achieves 87.4% Accuracy at 56 FPS with Only 4.6 MB Parameters

This paper presents HGO‑YOLO, a lightweight real‑time anomaly‑behavior detector that integrates HGNetv2 and GhostConv into YOLOv8, achieving 87.4% mAP with just 4.6 MB of parameters and 56 FPS on CPU, and validates its performance across multiple datasets and hardware platforms.

AIWalker

May 14, 2025

How HGO‑YOLO Achieves 87.4% Accuracy at 56 FPS with Only 4.6 MB Parameters

Introduction

Accurate and real‑time object detection is crucial for anomaly‑behavior monitoring, especially on hardware‑constrained devices. The authors propose HGO‑YOLO, which embeds the HGNetv2 backbone into YOLOv8 and introduces a lightweight detection head (OptiConvDetect) to balance precision and speed.

Related Work

Prior works improve multi‑scale detection via feature‑pyramid networks (e.g., PANet, QAFPN) and lightweight backbones such as MobileNet and ShuffleNetv2. However, these methods often increase computational cost or reduce representational capacity. GhostConv, introduced by Han et al. [16], reduces parameters by generating redundant feature maps.

Method

YOLOv8 Overview

YOLOv8 consists of a Backbone (C2f with SPPF), a Neck (FPN‑like), and a decoupled Head that separates classification and regression branches, using distribution‑based Focal Loss for regression.

HGO‑YOLO Architecture

The backbone replaces the standard backbone with HGNetv2, which stacks HGBlocks of varying filter sizes to capture hierarchical graph‑structured features. GhostConv replaces standard Conv in selected layers, cutting FLOPs from 8.9 GFLOPs to 4.3 GFLOPs while preserving accuracy. The detection head, OptiConvDetect, shares a single PConv layer across three heads, followed by a 1×1 Conv (Convpt) for each branch, drastically reducing parameters.

Key Components

HGNetv2 : Hierarchical graph network that aggregates multi‑scale features via graph convolutions, enhancing robustness.

GhostConv : Generates primary features with a cheap convolution, then cheap linear operations to produce redundant features; computational cost drops to 1⁄s of traditional Conv.

OptiConvDetect : Parameter‑shared detection head that decouples classification and regression while reusing PConv, achieving lower GFLOPs and higher FPS.

Experiments

Datasets

Six public datasets covering fall, fight, and smoking scenarios were merged, yielding 10,201 images (4,065 fall, 3,224 fight, 2,912 smoke). Data were split 8:1:1 for training, validation, and testing.

Setup

All models (YOLOv5‑v11, RT‑DTETR, and HGO‑YOLO variants) were implemented in PyTorch and evaluated on an Intel Xeon Silver 4310 CPU, NVIDIA A100 80 GB GPU, and Ubuntu 20.04.2. Training used 200 epochs, batch size 32, with hyper‑parameters taken from the original repositories.

Metrics

mAP, FPS, and GFLOPs were used. mAP measures average precision across classes; GFLOPs indicate inference cost.

Ablation Study

Five configurations were compared (Table 1): H‑YOLO (HGNetv2 only), HG‑YOLO (HGNetv2 + GhostConv), O‑YOLO (OptiConvDetect only), HO‑YOLO (HGNetv2 + OptiConvDetect), and HGO‑YOLO (all three). HGO‑YOLO achieved the best trade‑off: 56 FPS, 87.4% mAP, with per‑class mAP of 85.1 (fall), 93.1 (fight), 75.4 (smoke), 96.0 (person).

Scale Analysis

Across model sizes (YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, YOLOv8x), HGO‑YOLO consistently improved mAP by 2.8‑3.7 points and reduced GFLOPs, demonstrating scalability.

Loss Function Comparison

Four IoU‑based losses were evaluated: DIoU, CIoU, MPDIoU, and Inner‑CIoU. MPDIoU yielded the highest mAP (value omitted in source) and was adopted for HGO‑YOLO [24].

Device Tests

Real‑time inference on Raspberry Pi 4 and NVIDIA devices showed HGO‑YOLO outperforming YOLOv8, reaching 33.33 FPS on the NVIDIA platform.

Visualization

Qualitative comparisons (Figures 7‑9) illustrate that HGO‑YOLO produces higher confidence detections for falls, fights, and smoke, even under low‑light or occlusion, while reducing false positives.

Conclusion

HGO‑YOLO delivers superior accuracy (87.4% mAP) and speed (56 FPS) with a tiny 4.6 MB footprint, making it suitable for edge devices. Limitations include reduced performance on very small smoke targets and potential degradation on extremely low‑resource hardware. Future work will focus on further optimization for diverse deployment scenarios.

computer vision object detection Edge AI anomaly detection Lightweight Models YOLO

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.