Artificial Intelligence 10 min read

Boost Model Accuracy with 6 Proven Training Tricks

This article compiles six practical machine‑learning tricks—including adversarial training (FGM), EMA/SWA, R‑Drop contrastive loss, test‑time augmentation, pseudo‑labeling, and missing‑value imputation—explaining their principles, providing ready‑to‑use code snippets, and discussing their benefits and trade‑offs for stable and faster model training.

Baobao Algorithm Notes

Jul 26, 2022

Boost Model Accuracy with 6 Proven Training Tricks

Machine‑learning practitioners often rely on a handful of well‑known tricks such as learning‑rate schedules, data augmentation, dropout, and batch‑norm. This guide groups additional techniques into three categories—stable‑useful tricks, scenario‑limited tricks, and performance‑boosting tricks—offering concrete implementations and practical advice.

Stable‑Useful Tricks

1. Model Ensembling – Classic stacking or probability averaging, useful for competitions but rarely needed for research papers.

2. Adversarial Training (FGM) – Adds perturbations to the embedding layer during training to improve robustness.

# Initialize
fgm = FGM(model)
for batch_input, batch_label in data:
    # Normal forward‑backward
    loss = model(batch_input, batch_label)
    loss.backward()
    # Adversarial step
    fgm.attack()  # add perturbation to embeddings
    loss_adv = model(batch_input, batch_label)
    loss_adv.backward()  # accumulate adversarial gradients
    fgm.restore()  # remove perturbation
    optimizer.step()
    model.zero_grad()

FGM implementation:

import torch
class FGM():
    def __init__(self, model):
        self.model = model
        self.backup = {}
    def attack(self, epsilon=1., emb_name='emb.'):
        for name, param in self.model.named_parameters():
            if param.requires_grad and emb_name in name:
                self.backup[name] = param.data.clone()
                norm = torch.norm(param.grad)
                if norm != 0 and not torch.isnan(norm):
                    r_at = epsilon * param.grad / norm
                    param.data.add_(r_at)
    def restore(self, emb_name='emb.'):
        for name, param in self.model.named_parameters():
            if param.requires_grad and emb_name in name:
                assert name in self.backup
                param.data = self.backup[name]
        self.backup = {}

Scenario‑Limited Tricks

Prompt‑based methods (e.g., PET) – effective for zero‑shot or few‑shot tasks.

Focal loss – helps with long‑tail or rare classes.

Mixup / CutMix – useful for data‑sensitive tasks such as audio classification.

Modified softmax for small datasets (e.g., face recognition).

Domain‑specific pre‑training – re‑train a BERT base on a specialized corpus when the target domain differs significantly.

Performance‑Boosting Tricks

Mixed‑Precision Training (AMP) – Simple plug‑in that yields immediate speed gains.

Gradient Accumulation – Accumulates gradients over several forward‑backward passes before an optimizer step, allowing larger effective batch sizes.

Queue or Memory Bank – Enables very large batch equivalents; see MoCo for contrastive learning.

Non‑essential No‑Sync – In multi‑GPU DDP training, use no_sync() during gradient accumulation to reduce unnecessary synchronization.

Additional Techniques

3. R‑Drop (Contrastive Learning) – Combines standard cross‑entropy with a KL‑divergence term between two stochastic forward passes.

# Training context
ce = CrossEntropyLoss(reduction='none')
kld = nn.KLDivLoss(reduction='none')
logits1 = model(input)
logits2 = model(input)
kl_weight = 0.5
ce_loss = (ce(logits1, target) + ce(logits2, target)) / 2
kl_1 = kld(F.log_softmax(logits1, dim=-1), F.softmax(logits2, dim=-1)).sum(-1)
kl_2 = kld(F.log_softmax(logits2, dim=-1), F.softmax(logits1, dim=-1)).sum(-1)
loss = ce_loss + kl_weight * (kl_1 + kl_2) / 2

4. Test‑Time Augmentation (TTA) – Apply lightweight augmentations at inference, average the predictions.

5. Pseudo‑Labeling – Use a trained model to generate labels for unlabeled data, then retrain with the combined dataset.

model1.fit(train_set, label, val=validation_set)  # step1
pseudo_label = model1.predict(test_set)          # step2
new_label = concat(pseudo_label, label)           # step3
new_train_set = concat(test_set, train_set)     # step3
model2.fit(new_train_set, new_label, val=validation_set)  # step4
final_predict = model2.predict(test_set)        # step5

6. Neural‑Network Missing‑Value Imputation – Mask missing features, learn a parameter to fill them, then add back to the input (similar to TabNet).

These methods generally improve model robustness or final scores, but they often increase training time and may only benefit the top‑ranked portion of the leaderboard.

Illustrative Diagram of Dropout Robustness

Classification‑to‑Retrieval Diagram

Overall, these tricks are easy to integrate, but practitioners should weigh the extra computational cost against the marginal gains, especially when scaling to large datasets.

AI training tricks EMA pseudo-labeling R-Drop

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.