Five Quick Tricks to Supercharge Your Neural Network Training
This article presents five concise, widely applicable techniques—adversarial training with FGM, exponential moving average (EMA), test‑time augmentation (TTA), pseudo‑label learning, and special‑sample handling via nearest‑neighbor retrieval—to reliably improve model performance with minimal code changes.
Background
The author needed short, effective methods to boost neural‑network scores for a paper submission, preferring lightweight tricks that require little code and no extensive hyper‑parameter tuning.
1. Adversarial Training (FGM)
Adversarial training adds a small perturbation to the input embeddings during the forward pass, then performs a second backward pass on the perturbed data. The following plug‑and‑play code demonstrates the workflow:
# Initialize
fgm = FGM(model)
for batch_input, batch_label in data:
# Normal training
loss = model(batch_input, batch_label)
loss.backward() # compute normal gradients
# Adversarial training
fgm.attack() # add perturbation to embeddings
loss_adv = model(batch_input, batch_label)
loss_adv.backward() # accumulate adversarial gradients
fgm.restore() # restore original embeddings
optimizer.step()
model.zero_grad()The concrete implementation of FGM is:
import torch
class FGM():
def __init__(self, model):
self.model = model
self.backup = {}
def attack(self, epsilon=1., emb_name='emb.'):
for name, param in self.model.named_parameters():
if param.requires_grad and emb_name in name:
self.backup[name] = param.data.clone()
norm = torch.norm(param.grad)
if norm != 0 and not torch.isnan(norm):
r_at = epsilon * param.grad / norm
param.data.add_(r_at)
def restore(self, emb_name='emb.'):
for name, param in self.model.named_parameters():
if param.requires_grad and emb_name in name:
assert name in self.backup
param.data = self.backup[name]
self.backup = {}2. Exponential Moving Average (EMA)
EMA keeps a shadow copy of model parameters that is updated with a decay factor, providing a smoothed version of the weights after a certain training stage. The code below shows a minimal EMA utility:
# Initialize
ema = EMA(model, 0.999)
ema.register()
def train():
optimizer.step()
ema.update()
def evaluate():
ema.apply_shadow()
# evaluate model
ema.restore() class EMA():
def __init__(self, model, decay):
self.model = model
self.decay = decay
self.shadow = {}
self.backup = {}
def register(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
self.shadow[name] = param.data.clone()
def update(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
assert name in self.shadow
new_average = (1.0 - self.decay) * param.data + self.decay * self.shadow[name]
self.shadow[name] = new_average.clone()
def apply_shadow(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
assert name in self.shadow
self.backup[name] = param.data
param.data = self.shadow[name]
def restore(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
assert name in self.backup
param.data = self.backup[name]
self.backup = {}3. Test‑Time Augmentation (TTA)
During inference, generate several reasonable augmentations of each test sample, obtain predictions for each augmented version, and average the results. This simple strategy often yields a modest boost without changing the training pipeline.
4. Pseudo‑Label Learning
Use a trained model to generate predictions on unlabeled or test data, treat those predictions as “pseudo‑labels,” and then continue training on the combined dataset. Care must be taken to avoid label leakage.
5. Special‑Sample Handling
For rare, long‑tail, or low‑confidence samples, convert the classification task into a retrieval problem: represent each sample with a vector, then find its nearest neighbors in a feature bank. An ICLR 2020 survey (https://arxiv.org/abs/1910.09217) provides a comprehensive overview of such methods.
Conclusion
All five tricks are lightweight, broadly applicable, and have been shown to improve performance across many datasets. They can be combined or used individually depending on the constraints of a given project.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
