Practical Deep Learning Training Tricks: Cyclic LR, Flooding, Warmup, RAdam, Adversarial Training, Focal Loss, Dropout, Normalization and More
This article compiles essential deep learning training techniques—including cyclic learning rates, flooding, warmup, RAdam optimizer, adversarial training, focal loss, dropout, batch/group/weight normalization, label smoothing, Wasserstein GAN, skip connections, and weight initialization—providing concise explanations and code snippets for each method.
The article serves as a concise collection of practical tricks for machine learning and deep learning model training, offering brief explanations and ready‑to‑use code snippets for each technique.
Cyclic LR : Periodically restart the learning rate to explore multiple local minima within a fixed time budget. scheduler = lambda x: ((LR_INIT-LR_MIN)/2)*(np.cos(PI*(np.mod(x-1,CYCLE)/(CYCLE)))+1)+LR_MIN
Flooding : Keep the training loss around a predefined threshold to encourage a "random walk" and find flatter loss regions, improving test loss stability. flood = (loss - b).abs() + b
Warmup : Gradually increase the learning rate at the early stage to avoid premature over‑fitting on mini‑batches and to stabilize deep layers. warmup_steps = int(batches_per_epoch * 5) warmup_lr = (initial_learning_rate * tf.cast(global_step, tf.float32) / tf.cast(warmup_steps, tf.float32)) return tf.cond(global_step < warmup_steps, lambda: warmup_lr, lambda: lr)
RAdam : Uses exponential moving averages of first‑ and second‑order moments to adapt the learning rate, normalizing the first moment with the second. from radam import *
Adversarial Training : Generates adversarial examples (e.g., FGSM, I‑FGSM, PGD) during training, acting as a regularizer that imposes a Lipschitz constraint on the network. adversarial_training(model, 'Embedding-Token', 0.5)
Focal Loss : Mitigates class‑imbalance by down‑weighting easy samples and focusing the loss on hard examples. loss = -np.log(p) loss = (1-p)^G * loss
Dropout : Randomly drops neurons during training to reduce over‑fitting and improve model robustness.
Normalization : Batch Normalization – normalizes each neuron using mini‑batch statistics. Group Normalization – divides channels into groups and normalizes within each group. def GroupNorm(x, gamma, beta, G, eps=1e-5): # x: input features with shape [N,C,H,W] # gamma, beta: scale and offset, with shape [1,C,1,1] # G: number of groups for GN N, C, H, W = x.shape x = tf.reshape(x, [N, G, C // G, H, W]) mean, var = tf.nn.moments(x, [2, 3, 4], keep_dims=True) x = (x - mean) / tf.sqrt(var + eps) x = tf.reshape(x, [N, C, H, W]) return x * gamma + beta
ReLU : Implements a simple non‑linear activation to alleviate gradient vanishing. x = max(x, 0)
Skip Connection : Provides an identity mapping to prevent degradation in very deep networks. F(x) = F(x) + x
Weight Initialization : Proper initialization (e.g., non‑zero, variance‑scaled) speeds up convergence and improves final model quality. Embedding(embeddings_initializer=word2vec_emb, input_dim=2009, output_dim=DOTA)
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.