Artificial Intelligence 11 min read

How Tiny Perturbations Can Fool 95% Accurate Image Classifiers

Despite achieving over 95% accuracy on ImageNet, popular models like ResNet, VGG, and EfficientNet can be easily misled by carefully crafted adversarial examples using FGSM, revealing deep learning’s inherent vulnerability and prompting the need for robust defense strategies.

Data Party THU

Feb 1, 2026

How Tiny Perturbations Can Fool 95% Accurate Image Classifiers

Adversarial Examples: A Hidden Threat

State‑of‑the‑art image classifiers such as ResNet, VGG and EfficientNet routinely achieve >90% top‑1 accuracy on ImageNet, yet a minuscule, carefully crafted perturbation can cause them to output completely wrong labels with high confidence. This reveals a fundamental security weakness in deep learning systems.

Adversarial patch on a French bulldog image causing VGG to predict 'football'

What Are Adversarial Samples?

An adversarial sample is an input deliberately altered to deceive a model. Unlike random noise, the perturbation is computed via optimization so that it remains imperceptible to humans while maximizing the model’s prediction error.

FGSM: Fast Gradient Sign Method

Mathematical Principle

Given a classifier f, an input x and its true label y, FGSM computes the perturbation η = ε·sign(∇_x L(f(x), y)) where ε controls the perturbation magnitude. The adversarial image is x_adv = x + η, clipped to the valid pixel range.

Why FGSM Works

Deep networks behave almost linearly in a small neighbourhood around a data point, so the gradient points directly toward the most damaging direction. A tiny step along this direction can cause a large shift in the high‑dimensional output space, something random noise cannot achieve.

FGSM perturbation on a panda image causing misclassification as a gibbon

Python Walkthrough: Building Your First Adversarial Example

The following steps use PyTorch and a pretrained ResNet‑50 model.

pip install torch torchvision matplotlib numpy pillow

import torch
import torch.nn.functional as F
import torchvision.models as models
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image

Step 1 – Load the classifier

model = models.resnet50(pretrained=True)
model.eval()

Step 2 – Prepare the image

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
])
img = Image.open("your_image.jpg").convert("RGB")
x = transform(img).unsqueeze(0)
x.requires_grad = True

Step 3 – Original prediction

logits = model(x)
pred = logits.argmax(dim=1)
print(f"Original prediction: {pred.item()}")

Step 4 – FGSM attack

label = pred
loss = F.cross_entropy(logits, label)
loss.backward()
epsilon = 0.01  # perturbation budget
perturbation = epsilon * x.grad.sign()
x_adv = torch.clamp(x + perturbation, 0, 1)

Step 5 – Verify adversarial prediction

logits_adv = model(x_adv)
pred_adv = logits_adv.argmax(dim=1)
print(f"Adversarial prediction: {pred_adv.item()}")

Step 6 – Visualize

def show_adversarial_attack(original, adversarial, perturbation):
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    axes[0].imshow(original)
    axes[0].set_title("Original Image")
    axes[0].axis("off")
    axes[1].imshow(adversarial)
    axes[1].set_title("Adversarial Image")
    axes[1].axis("off")
    axes[2].imshow(perturbation, cmap="gray")
    axes[2].set_title("Noise Pattern (10x Amplified)")
    axes[2].axis("off")
    plt.tight_layout()
    plt.show()

orig_np = x.detach().squeeze().permute(1, 2, 0).numpy()
adv_np = x_adv.detach().squeeze().permute(1, 2, 0).numpy()
noise_np = (adv_np - orig_np) * 10
show_adversarial_attack(orig_np, adv_np, noise_np)

Why Neural Networks Are So Fragile

High‑dimensional geometry: A 224×224 RGB image lives in a 150,528‑dimensional space; tiny changes in each dimension accumulate to a large overall distance.

Local linearity: Despite nonlinear activations, networks behave almost linearly in a small neighbourhood, making gradient‑based attacks highly effective.

Non‑generalizable features: Models rely on statistical shortcuts that humans do not perceive; adversarial perturbations exploit these “shortcut” features.

Limitations of This Demonstration

FGSM is a single‑step, relatively weak attack; stronger iterative methods such as PGD or Carlini‑Wagner exist. The code assumes a white‑box scenario where the attacker has full access to model weights and gradients, which is not always realistic. Physical adversarial patches and black‑box attacks introduce additional challenges.

Defensive Strategies

Adversarial training: Incorporate adversarial examples into the training set to improve robustness.

Input preprocessing: Apply JPEG compression, random resizing, or bit‑depth reduction to disrupt perturbations.

Ensemble methods: Combine predictions from multiple models or add randomness to increase attack difficulty.

Certified defenses: Techniques such as randomized smoothing provide provable robustness within a perturbation radius.

Detection mechanisms: Train separate detectors to flag potential adversarial inputs.

Each defense incurs trade‑offs among accuracy, computational cost, and generalization.

Conclusion

Adversarial examples reveal a fundamental gap between statistical optimization and human perception: deep learning excels at pattern matching but lacks true semantic understanding. Robustness must become a first‑class engineering requirement alongside accuracy, fairness, and efficiency, especially as AI systems are deployed in safety‑critical domains.

image classification PyTorch adversarial examples FGSM deep learning security

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.