Artificial Intelligence 9 min read

Understanding Residual Networks: Ideas, Mechanisms, Variants, and Insights

This article reviews the concept of residual networks, explains their working principle and data‑flow interpretation, discusses why they improve deep models, analyzes path‑length effects on gradients, and surveys various residual block designs and practical takeaways.

360 Tech Engineering
360 Tech Engineering
360 Tech Engineering
Understanding Residual Networks: Ideas, Mechanisms, Variants, and Insights

Recent breakthroughs in deep residual networks (ResNet) have enabled neural networks to scale from hundreds to thousands of layers, achieving top performance in many competitions. This article examines the motivation behind ResNet, its core mechanism, and how residual blocks are constructed.

What is ResNet

Instead of learning a direct mapping H(x), ResNet learns a residual function F(x)=H(x)−x and adds the input back, i.e., output = F(x) + x. The residual unit typically contains an identity skip‑connection (output ≡ input) alongside a nonlinear transformation.

The right‑hand side of the figure shows the identity mapping (skip‑connect). Stacking many such blocks yields a residual network.

Residual idea inspired by what?

Observations on shallow networks suggest that adding an identity mapping in parallel can preserve performance when depth increases, hinting that the difficulty lies in learning the identity itself.

Why does it work?

Two explanations exist: (1) a data‑flow view where the network has two parallel routes—an identity path and a residual (non‑linear) path; (2) an ensemble‑by‑construction view, where the network behaves like a collection of many shallow sub‑networks.

The data‑flow perspective shows two streams: the skip‑connect (direct) and the residual (non‑linear) stream. The ensemble view explains why removing some residual blocks has little impact on test performance, as the remaining paths still form a strong ensemble.

Analysis of path‑length shows a binomial distribution of the number of residual blocks along each path; short paths (with few residual blocks) contribute the majority of the gradient, while very long paths still suffer from vanishing gradients.

Experiments that randomly sample a subset of residual blocks during training confirm that only a small set of effective paths is needed for good performance.

Variant of Residual Module

Different block designs affect performance. For example, moving the ReLU activation from post‑activation to pre‑activation improves results, and other architectural tweaks can further boost accuracy.

Thoughts and Summary

1. Compared with dropout, which ensembles by training on a single path, ResNet ensembles by construction through its architecture.

2. Random forests also achieve ensemble effects both in structure and training data sampling.

3. Residual networks are not truly deep in the traditional sense; they are ensembles of relatively shallow networks, and gradient vanishing remains an open issue for long paths.

References

1. Deep Residual Learning for Image Recognition

2. Identity Mappings in Deep Residual Networks

3. Residual Networks are Exponential Ensembles of Relatively Shallow Networks

deep learningneural networksgradientensembleModel ArchitectureResNet
360 Tech Engineering
Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.