Practical Deep Learning Tricks: Cyclic LR, Flooding, Warmup, RAdam, Adversarial Training, Focal Loss, Dropout, Normalization, ReLU, Group Normalization, Label Smoothing, Wasserstein GAN, Skip Connections, Weight Initialization
This article presents a concise collection of practical deep‑learning techniques—including cyclic learning‑rate, flooding, warmup, RAdam, adversarial training, focal loss, dropout, various normalization methods, ReLU, group normalization, label smoothing, Wasserstein GAN, skip connections, and weight initialization—along with code snippets and references for implementation.
This article summarizes a series of practical deep‑learning tricks collected from real‑world machine‑learning practice and includes a 1090‑page PDF of experience from major internet companies at the Global Machine Learning Technology Conference.
01 Cyclic LR – Restarting the learning rate periodically allows the model to converge to multiple local minima within a fixed time, providing diverse models for ensemble. scheduler = lambda x: ((LR_INIT-LR_MIN)/2)*(np.cos(PI*(np.mod(x-1,CYCLE)/(CYCLE)))+1)+LR_MIN
02 With Flooding – When training loss exceeds a threshold, normal gradient descent is performed; when it falls below the threshold, the gradient is reversed to keep loss near the threshold, encouraging a flat loss region and a double‑descent test loss. flood = (loss - b).abs() + b
03 Warmup – Warmup mitigates early over‑fitting to mini‑batches, stabilizes distribution, and helps maintain deep model stability. warmup_steps = int(batches_per_epoch * 5) warmup_lr = (initial_learning_rate * tf.cast(global_step, tf.float32) / tf.cast(warmup_steps, tf.float32)) return tf.cond(global_step < warmup_steps, lambda: warmup_lr, lambda: lr)
04 RAdam – RAdam uses exponential moving averages to estimate first‑order (momentum) and second‑order (adaptive learning rate) moments of each gradient component, normalizing the first moment with the second to compute updates. from radam import *
05 Adversarial Training – Generates adversarial examples (FGSM, I‑FGSM, PGD) during training, acting as a regularizer that imposes a Lipschitz constraint on the network. It can slightly reduce test accuracy but improves robustness. # Enable adversarial training with a single line adversarial_training(model, 'Embedding-Token', 0.5)
06 Focal Loss – Addresses class‑imbalance by weighting the cross‑entropy loss with a modulation factor that down‑weights easy examples, focusing training on hard samples. loss = -np.log(p) loss = (1-p)**G * loss
07 Dropout – Randomly drops units during training to suppress over‑fitting and improve model robustness.
08 Normalization (Batch Normalization) – Normalizes each neuron using the mean and variance computed over a mini‑batch, accelerating convergence and stabilizing training. x = (x - x.mean()) / x.std()
09 ReLU – Implements a simple non‑linear activation that mitigates gradient vanishing. x = max(x, 0)
10 Group Normalization – Divides channels into groups and normalizes within each group, offering an alternative to Batch Normalization for small batch sizes. def GroupNorm(x, gamma, beta, G, eps=1e-5): # x: [N, C, H, W] N, C, H, W = x.shape x = tf.reshape(x, [N, G, C // G, H, W]) mean, var = tf.nn.moments(x, [2, 3, 4], keepdims=True) x = (x - mean) / tf.sqrt(var + eps) x = tf.reshape(x, [N, C, H, W]) return x * gamma + beta
11 Label Smoothing – Converts hard labels to soft labels, smoothing the target distribution to improve generalization and reduce over‑fitting. targets = (1 - label_smooth) * targets + label_smooth / num_classes
12 Wasserstein GAN – Resolves GAN training instability, eliminates mode collapse, provides a meaningful training metric, and works with simple fully‑connected architectures.
13 Skip Connection – Adds an identity mapping (F(x) = F(x) + x) to prevent degradation when networks become deeper. F(x) = F(x) + x
14 Weight Initialization – Proper initialization prevents all neurons from starting with identical outputs, which would cause identical gradients and hinder learning. Embedding(embeddings_initializer=word2vec_emb, input_dim=2009, output_dim=DOTA)
For more detailed notes and a 1090‑page PDF, send “ML2021” to the public account linked below.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.