Artificial Intelligence 11 min read

Practical Deep Learning Training Tricks: Cyclic LR, Flooding, Warmup, RAdam, Adversarial Training, Focal Loss, Dropout, Normalization and More

This article compiles essential deep learning training techniques—including cyclic learning rates, flooding, warmup, RAdam optimizer, adversarial training, focal loss, dropout, batch/group/weight normalization, label smoothing, Wasserstein GAN, skip connections, and weight initialization—providing concise explanations and code snippets for each method.

DataFunTalk
DataFunTalk
DataFunTalk
Practical Deep Learning Training Tricks: Cyclic LR, Flooding, Warmup, RAdam, Adversarial Training, Focal Loss, Dropout, Normalization and More

The article serves as a concise collection of practical tricks for machine learning and deep learning model training, offering brief explanations and ready‑to‑use code snippets for each technique.

Cyclic LR : Periodically restart the learning rate to explore multiple local minima within a fixed time budget. scheduler = lambda x: ((LR_INIT-LR_MIN)/2)*(np.cos(PI*(np.mod(x-1,CYCLE)/(CYCLE)))+1)+LR_MIN

Flooding : Keep the training loss around a predefined threshold to encourage a "random walk" and find flatter loss regions, improving test loss stability. flood = (loss - b).abs() + b

Warmup : Gradually increase the learning rate at the early stage to avoid premature over‑fitting on mini‑batches and to stabilize deep layers. warmup_steps = int(batches_per_epoch * 5) warmup_lr = (initial_learning_rate * tf.cast(global_step, tf.float32) / tf.cast(warmup_steps, tf.float32)) return tf.cond(global_step < warmup_steps, lambda: warmup_lr, lambda: lr)

RAdam : Uses exponential moving averages of first‑ and second‑order moments to adapt the learning rate, normalizing the first moment with the second. from radam import *

Adversarial Training : Generates adversarial examples (e.g., FGSM, I‑FGSM, PGD) during training, acting as a regularizer that imposes a Lipschitz constraint on the network. adversarial_training(model, 'Embedding-Token', 0.5)

Focal Loss : Mitigates class‑imbalance by down‑weighting easy samples and focusing the loss on hard examples. loss = -np.log(p) loss = (1-p)^G * loss

Dropout : Randomly drops neurons during training to reduce over‑fitting and improve model robustness.

Normalization : Batch Normalization – normalizes each neuron using mini‑batch statistics. Group Normalization – divides channels into groups and normalizes within each group. def GroupNorm(x, gamma, beta, G, eps=1e-5): # x: input features with shape [N,C,H,W] # gamma, beta: scale and offset, with shape [1,C,1,1] # G: number of groups for GN N, C, H, W = x.shape x = tf.reshape(x, [N, G, C // G, H, W]) mean, var = tf.nn.moments(x, [2, 3, 4], keep_dims=True) x = (x - mean) / tf.sqrt(var + eps) x = tf.reshape(x, [N, C, H, W]) return x * gamma + beta

ReLU : Implements a simple non‑linear activation to alleviate gradient vanishing. x = max(x, 0)

Skip Connection : Provides an identity mapping to prevent degradation in very deep networks. F(x) = F(x) + x

Weight Initialization : Proper initialization (e.g., non‑zero, variance‑scaled) speeds up convergence and improves final model quality. Embedding(embeddings_initializer=word2vec_emb, input_dim=2009, output_dim=DOTA)

Optimizationdeep learningneural networksregularizationtraining tricks
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.