Can ResNet Still Beat Transformers? A Deep Dive into Modern Training Tricks
This article reviews recent research and official PyTorch blog updates that modify ResNet architectures and training tricks, compares their performance against EfficientNet, ConvNeXt, and Vision Transformers using extensive ImageNet benchmarks, and provides both literature‑based and local evaluation results to assess whether classic CNNs remain competitive.
Introduction
ResNet has been a cornerstone of image classification for more than six years. Recent research revisits its architecture and training pipeline, asking whether modern convolutional networks can still compete with Vision Transformers (ViT) that dominate many benchmarks.
ResNet‑RS: Tricks All‑in‑One
ResNet‑RS (2021) augments a standard ResNet‑200 (or shallower variants) with three groups of modifications:
Training tricks : cosine learning‑rate decay with warm‑up, 600‑epoch schedule, RandAugment (or TrivialAugment), Stochastic Depth, Label Smoothing, Exponential Moving Average (EMA) of weights, higher weight‑decay and the removal of Dropout.
Regularization tweaks : stronger RandAugment, careful weight‑decay tuning, EMA.
Architecture tweaks : Squeeze‑and‑Excitation (SE) modules, ResNet‑D style down‑sampling (average‑pool + 1×1 convolution), minor stem changes.
These combined changes raise ImageNet‑1k top‑1 accuracy from the vanilla ResNet‑50 baseline (~76 %) to about 80.4 % while keeping inference speed competitive.
TorchVision 0.11 Update
In November 2021 the TorchVision library (v0.11) incorporated the same set of tricks used in the timm repository. The updated training pipeline consists of:
SGD with warm‑up cosine annealing
RandAugment (implemented as TrivialAugment)
600‑epoch schedule
Random Erasing, Mixup, CutMix, Label Smoothing
EMA of model weights
Cross‑validated weight‑decay selection
Evaluated at 224×224 resolution the model reaches 80.4 % top‑1; increasing the input size to 232×232 adds ~0.2 % (≈80.67 %).
ConvNeXt vs Swin‑Transformer
The ConvNeXt paper (2022) deliberately mirrors design choices of the Swin‑Transformer:
Patch‑ify stem with 2×2 stride‑2 pooling
GELU activation and large convolution kernels
LayerNorm and LayerScale (from CaiT)
AdamW optimizer with cosine learning‑rate schedule
Regularization suite identical to ResNet‑RS (Mixup, CutMix, RandAugment, Stochastic Depth, EMA)
Unlike Swin‑Transformer, ConvNeXt is trained for only 300 epochs. When matched for parameters and FLOPs, ConvNeXt‑T achieves 82.1 % top‑1, surpassing Swin‑T (81.3 %) and other ViT‑style models, and also shows stronger transfer performance on downstream tasks.
Benchmark Summary (literature)
Normalized ImageNet‑1k results (no extra data) reported in recent papers are:
ResNet‑RS‑50: 78.8 % top‑1, 4.9 k img/s
ResNet‑RS‑101: 80.3 % top‑1, 3.0 k img/s
ResNet‑50 (timm‑A1): 80.4 % top‑1, 2.0 k img/s
ConvNeXt‑T: 82.1 % top‑1, 0.9 k img/s
Swin‑T: 81.3 % top‑1, 0.9 k img/s
EfficientNet‑B0: 77.1 % top‑1, 2.9 k img/s
These numbers illustrate that modern CNNs equipped with advanced tricks can match or exceed transformer‑based models while often retaining higher throughput.
Local Evaluation (author’s own tests)
Tests were run on a workstation with AMD Ryzen 5 2600, RTX 2070, CUDA 11.4, PyTorch 1.10 (Windows 10). All models were pretrained from the timm library, evaluated with 224×224 input, batch‑size 32 for inference and 256 for training.
EfficientNet‑V2‑B0 – 1691 img/s inference, 78.4 % top‑1
EfficientNet‑B0 – 1293 img/s, 76.8 % top‑1
ResNet‑50 – 621 img/s, 80.4 % top‑1
ResNet‑50d – 577 img/s, 80.5 % top‑1
ResNet‑RS‑50 – 514 img/s, 79.9 % top‑1
SEResNeXt‑50 (32×4d) – 413 img/s, 81.3 % top‑1
ConvNeXt‑T (optimized) – 386 img/s, 82.1 % top‑1
ConvNeXt‑T (original repo) – 326 img/s, 82.1 % top‑1
ResNeSt‑50 – 305 img/s, 81.0 % top‑1
Swin‑T – 301 img/s, 81.4 % top‑1
A scatter plot (speed vs. accuracy, bubble size = memory) visualizes these trade‑offs.
Conclusion
Even in the era of Vision Transformers, well‑engineered CNNs such as ResNet‑RS and ConvNeXt deliver state‑of‑the‑art ImageNet accuracy with comparable or superior inference speed. Modern training tricks—cosine learning‑rate schedules, RandAugment, EMA, label smoothing, Mixup/CutMix—have become the de‑facto standard for high‑performance ImageNet training. When these tricks are applied uniformly, the architectural gap between CNNs and Transformers narrows, and the competition between the two families is expected to continue.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
