From Theory to Practice: Reproducing YOLOv1 – A Step‑by‑Step Guide for Traditional Programmers

This article provides a comprehensive, hands‑on walkthrough of YOLOv1—from its single‑stage detection principles and core architectural questions to a full PyTorch implementation, training pipeline, common pitfalls, and a live camera demo—targeted at developers transitioning into AI.

xkx's Tech General Store
xkx's Tech General Store
xkx's Tech General Store
From Theory to Practice: Reproducing YOLOv1 – A Step‑by‑Step Guide for Traditional Programmers

Inspired by the book "YOLO Object Detection," the author, a traditional software developer moving into AI, documents a complete end‑to‑end recreation of YOLOv1, linking theory with practical experiments on a GPU cloud service.

Core Framework and Key Questions

YOLOv1 introduced an end‑to‑end single‑stage detector that replaces the classic two‑step "region proposal + classification" pipeline with a direct mapping from image to detection results using convolutional and fully‑connected layers. The author clarifies the following fundamental points:

Why YOLOv1 formulates detection as a regression problem: it predicts bounding‑box coordinates and class probabilities for each of the 7×7 grid cells, treating box parameters as continuous values to be regressed.

The role of the 24 convolutional layers: they act as a generic visual feature extractor, progressively capturing edges, textures, and object structures, ultimately producing a 7×7×1024 feature map.

The purpose of the fully‑connected layers: they flatten the feature map and learn a mapping from extracted features to the 7×7×30 output tensor (objectness, class probabilities, and box coordinates).

Input preprocessing: ground‑truth boxes are converted to grid‑relative offsets so the model can learn to predict them.

Forward vs. backward propagation: forward passes use only the input image, while backward passes require labels to compute loss.

When NMS is applied: only during inference to suppress duplicate detections; it is omitted during training to preserve gradient information.

Engineering the YOLOv1 Implementation

The code repository is https://gitee.com/crystonesc/pytorch_yolov1-main. The author adopts an improved architecture that replaces the original 24 convolutions with a ResNet‑18 backbone and adds a Spatial Pyramid Pooling (SPP) neck, following the rationale that ResNet offers stronger feature extraction and SPP provides multi‑scale context without a large computational overhead.

Project structure (core directories):

backbone/ : ResNet‑18 implementation, outputting a 512‑dimensional feature map (C5).

models/ : basic convolution blocks, SPP module, YOLOv1 model definition, loss computation.

data/ : VOC/COCO dataset wrappers with augmentation.

Root files: train.py (training entry), eval.py (evaluation), detector.py (inference).

Model assembly steps:

Backbone : ResNet‑18 extracts stride‑32 features (e.g., 14×14×512 for a 448×448 input).

Neck : SPP applies three same‑size max‑poolings (5×5, 9×9, 13×13) on the 14×14×512 map, concatenating to 2048 channels, then a 1×1 convolution reduces back to 512 channels, adding global context.

Detection head : a series of 1×1 reduction and 3×3 feature‑extraction convolutions culminate in a final 1×1 convolution that predicts per‑grid confidence, 20 class probabilities, and 4 box parameters (tx, ty, tw, th). No anchors are used, preserving the original YOLOv1 style.

Grid coordinate mapping : a pre‑generated coordinate matrix (e.g., 14×14 (x, y)) is added to the predicted offsets and multiplied by the stride (32) to recover original‑image box coordinates during inference.

Training pipeline:

Label generation converts each ground‑truth box into grid assignment, grid‑relative offset, and log‑scaled width/height, ensuring each grid receives at most one object.

Loss is computed in three parts:

Confidence loss (object vs. no‑object) with weighting.

Class loss calculated only for grids containing objects.

Box loss using BCE + MSE, with higher weight for small objects to avoid being overwhelmed by large ones.

Practical Pitfalls

The author encountered compatibility issues among Python 3.10, PyTorch, and CUDA versions, which were resolved by adapting the environment.

Final Demo

A camera_detect.py script uses OpenCV to capture webcam frames and run real‑time detection. The demo, trained on VOC with limited epochs, can detect simple objects but struggles with dense, small, or distant targets.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

computer visionDeep Learningobject detectionPyTorchResNetSPPYOLOv1
xkx's Tech General Store
Written by

xkx's Tech General Store

Code with the left hand, enjoy with the right; a keystroke sweeps away worries.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.