Artificial Intelligence 7 min read

Five Quick Tricks to Supercharge Your Neural Network Training

This article presents five concise, widely applicable techniques—adversarial training with FGM, exponential moving average (EMA), test‑time augmentation (TTA), pseudo‑label learning, and special‑sample handling via nearest‑neighbor retrieval—to reliably improve model performance with minimal code changes.

Baobao Algorithm Notes

Dec 14, 2021

Five Quick Tricks to Supercharge Your Neural Network Training

Background

The author needed short, effective methods to boost neural‑network scores for a paper submission, preferring lightweight tricks that require little code and no extensive hyper‑parameter tuning.

1. Adversarial Training (FGM)

Adversarial training adds a small perturbation to the input embeddings during the forward pass, then performs a second backward pass on the perturbed data. The following plug‑and‑play code demonstrates the workflow:

# Initialize
fgm = FGM(model)
for batch_input, batch_label in data:
    # Normal training
    loss = model(batch_input, batch_label)
    loss.backward()  # compute normal gradients
    # Adversarial training
    fgm.attack()  # add perturbation to embeddings
    loss_adv = model(batch_input, batch_label)
    loss_adv.backward()  # accumulate adversarial gradients
    fgm.restore()  # restore original embeddings
    optimizer.step()
    model.zero_grad()

The concrete implementation of FGM is:

import torch
class FGM():
    def __init__(self, model):
        self.model = model
        self.backup = {}
    def attack(self, epsilon=1., emb_name='emb.'):
        for name, param in self.model.named_parameters():
            if param.requires_grad and emb_name in name:
                self.backup[name] = param.data.clone()
                norm = torch.norm(param.grad)
                if norm != 0 and not torch.isnan(norm):
                    r_at = epsilon * param.grad / norm
                    param.data.add_(r_at)
    def restore(self, emb_name='emb.'):
        for name, param in self.model.named_parameters():
            if param.requires_grad and emb_name in name:
                assert name in self.backup
                param.data = self.backup[name]
        self.backup = {}

2. Exponential Moving Average (EMA)

EMA keeps a shadow copy of model parameters that is updated with a decay factor, providing a smoothed version of the weights after a certain training stage. The code below shows a minimal EMA utility:

# Initialize
ema = EMA(model, 0.999)
ema.register()

def train():
    optimizer.step()
    ema.update()

def evaluate():
    ema.apply_shadow()
    # evaluate model
    ema.restore()

class EMA():
    def __init__(self, model, decay):
        self.model = model
        self.decay = decay
        self.shadow = {}
        self.backup = {}
    def register(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                self.shadow[name] = param.data.clone()
    def update(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                assert name in self.shadow
                new_average = (1.0 - self.decay) * param.data + self.decay * self.shadow[name]
                self.shadow[name] = new_average.clone()
    def apply_shadow(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                assert name in self.shadow
                self.backup[name] = param.data
                param.data = self.shadow[name]
    def restore(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                assert name in self.backup
                param.data = self.backup[name]
        self.backup = {}

3. Test‑Time Augmentation (TTA)

During inference, generate several reasonable augmentations of each test sample, obtain predictions for each augmented version, and average the results. This simple strategy often yields a modest boost without changing the training pipeline.

4. Pseudo‑Label Learning

Use a trained model to generate predictions on unlabeled or test data, treat those predictions as “pseudo‑labels,” and then continue training on the combined dataset. Care must be taken to avoid label leakage.

5. Special‑Sample Handling

For rare, long‑tail, or low‑confidence samples, convert the classification task into a retrieval problem: represent each sample with a vector, then find its nearest neighbors in a feature bank. An ICLR 2020 survey (https://arxiv.org/abs/1910.09217) provides a comprehensive overview of such methods.

Conclusion

All five tricks are lightweight, broadly applicable, and have been shown to improve performance across many datasets. They can be combined or used individually depending on the constraints of a given project.

AI adversarial training EMA TTA

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.