Artificial Intelligence 13 min read

From AlexNet to ResNeXt: Key Milestones in CNN Evolution

This article traces the evolution of convolutional neural networks from the pioneering AlexNet through VGG, Inception, ResNet, Inception‑v4, Inception‑ResNet and ResNeXt, highlighting architectural innovations, performance gains, and the underlying biological inspirations that shaped modern deep learning models.

Hulu Beijing
Hulu Beijing
Hulu Beijing
From AlexNet to ResNeXt: Key Milestones in CNN Evolution

Introduction

How can machines learn to see the world? Biological visual cognition offers clues: a 1981 Nobel‑winning study showed that living organisms process visual stimuli through multiple layers of cells, building hierarchical representations. Eight years later Yann LeCun introduced the first convolutional neural network (CNN) prototype, and over the next three decades CNNs have become the cornerstone of computer vision, culminating in the 2012 AlexNet breakthrough that sparked the deep‑learning explosion.

Question

Please summarize the main developments of convolutional neural networks from AlexNet to ResNeXt.

Answer

AlexNet

AlexNet was the first CNN that achieved practical image‑classification results. It stacked convolution and pooling layers, used ReLU activation, introduced Local Response Normalization (LRN), dropout, and simple data augmentation. Grouped convolution helped overcome GPU memory limits, and the model reached a top‑5 error of 15.3% on ILSVRC 2012.

VGG

Building on AlexNet, the VGG series replaced large kernels with 3×3 convolutions and 2×2 pooling, deepening the network to nearly 20 layers and widening feature maps. This architecture achieved a top‑5 error of 6.8% on ILSVRC 2012.

Inception (GoogLeNet‑v1)

The Inception‑v1 (GoogLeNet) introduced the Inception module, which splits a wide convolution into parallel branches of 1×1, 3×3, and 5×5 kernels, effectively approximating a sparse connection pattern. It also added a bottleneck 1×1 convolution before larger kernels to reduce computation. The network achieved a top‑5 error of 6.67% on ILSVRC 2014.

Inception‑v3

Inception‑v3 factorized large kernels into multiple smaller ones (e.g., a 5×5 convolution became two 3×3 convolutions) and introduced asymmetric 1×k followed by k×1 convolutions, reducing parameters while preserving receptive field. It also enlarged the input size to 299×299 and used label smoothing and model ensembles, achieving a top‑5 error of 3.50% on ILSVRC 2014.

ResNet

ResNet addressed the degradation problem—where deeper networks suffer higher training and test errors—by introducing shortcut (skip) connections. These shortcuts shorten the gradient back‑propagation path, mitigating vanishing gradients and allowing networks to scale to hundreds or thousands of layers. ResNet‑152 achieved a top‑5 error of 4.49% (3.57% with ensembles) on ILSVRC 2012.

Inception‑v4 and Inception‑ResNet

Inception‑v4 refined the stem and added modular blocks (Inception‑A/B/C, Reduction‑A/B). Inception‑ResNet combined Inception modules with residual connections, accelerating training. Inception‑v4 reached a top‑5 error of 3.8%, while Inception‑ResNet‑v1/v2 achieved 4.3%/3.7% (3.1% with a three‑model ensemble).

ResNeXt

ResNeXt refined the residual block by reducing the bottleneck ratio and replacing the middle convolution with grouped convolution, improving accuracy without increasing parameter count and reducing hyper‑parameter complexity. It lowered the top‑5 error by roughly 0.5% compared to ResNet at similar computational cost.

Additional Notes

Other notable classification models such as SENet, DPN, PolyNet, and NASNet have emerged in recent years, but are omitted here for brevity.

References

Hinton G.E., Krizhevsky A., Sutskever I. “Imagenet classification with deep convolutional neural networks”, 2012.

Simonyan K., Zisserman A. “Very Deep Convolutional Networks for Large‑Scale Image Recognition”, 2015.

Szegedy C. et al. “Going deeper with convolutions”, 2015.

Ioffe S., Szegedy C. “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, 2015.

Szegedy C. et al. “Rethinking the inception architecture for computer vision”, 2016.

He K. et al. “Deep residual learning for image recognition”, 2016.

Szegedy C. et al. “Inception‑v4, inception‑resnet and the impact of residual connections on learning”, 2017.

Xie S. et al. “Aggregated residual transformations for deep neural networks”, 2017.

CNNComputer VisionDeep Learningmodel architectureResNetInceptionAlexNet
Hulu Beijing
Written by

Hulu Beijing

Follow Hulu's official WeChat account for the latest company updates and recruitment information.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.