Artificial Intelligence 8 min read

Practical Deep Learning Tricks: Cyclic LR, Flooding, Warmup, RAdam, Adversarial Training, Focal Loss, Dropout, Normalization, ReLU, Group Normalization, Label Smoothing, Wasserstein GAN, Skip Connections, Weight Initialization

This article presents a concise collection of practical deep‑learning techniques—including cyclic learning‑rate, flooding, warmup, RAdam, adversarial training, focal loss, dropout, various normalization methods, ReLU, group normalization, label smoothing, Wasserstein GAN, skip connections, and weight initialization—along with code snippets and references for implementation.

DataFunTalk

Aug 10, 2021

Practical Deep Learning Tricks: Cyclic LR, Flooding, Warmup, RAdam, Adversarial Training, Focal Loss, Dropout, Normalization, ReLU, Group Normalization, Label Smoothing, Wasserstein GAN, Skip Connections, Weight Initialization

This article summarizes a series of practical deep‑learning tricks collected from real‑world machine‑learning practice and includes a 1090‑page PDF of experience from major internet companies at the Global Machine Learning Technology Conference.

01 Cyclic LR – Restarting the learning rate periodically allows the model to converge to multiple local minima within a fixed time, providing diverse models for ensemble.

scheduler = lambda x: ((LR_INIT-LR_MIN)/2)*(np.cos(PI*(np.mod(x-1,CYCLE)/(CYCLE)))+1)+LR_MIN

02 With Flooding – When training loss exceeds a threshold, normal gradient descent is performed; when it falls below the threshold, the gradient is reversed to keep loss near the threshold, encouraging a flat loss region and a double‑descent test loss. flood = (loss - b).abs() + b 03 Warmup – Warmup mitigates early over‑fitting to mini‑batches, stabilizes distribution, and helps maintain deep model stability.

warmup_steps = int(batches_per_epoch * 5)
warmup_lr = (initial_learning_rate * tf.cast(global_step, tf.float32) / tf.cast(warmup_steps, tf.float32))
return tf.cond(global_step < warmup_steps, lambda: warmup_lr, lambda: lr)

04 RAdam – RAdam uses exponential moving averages to estimate first‑order (momentum) and second‑order (adaptive learning rate) moments of each gradient component, normalizing the first moment with the second to compute updates. from radam import * 05 Adversarial Training – Generates adversarial examples (FGSM, I‑FGSM, PGD) during training, acting as a regularizer that imposes a Lipschitz constraint on the network. It can slightly reduce test accuracy but improves robustness.

# Enable adversarial training with a single line
adversarial_training(model, 'Embedding-Token', 0.5)

06 Focal Loss – Addresses class‑imbalance by weighting the cross‑entropy loss with a modulation factor that down‑weights easy examples, focusing training on hard samples.

loss = -np.log(p)
loss = (1-p)**G * loss

07 Dropout – Randomly drops units during training to suppress over‑fitting and improve model robustness.

08 Normalization (Batch Normalization) – Normalizes each neuron using the mean and variance computed over a mini‑batch, accelerating convergence and stabilizing training. x = (x - x.mean()) / x.std() 09 ReLU – Implements a simple non‑linear activation that mitigates gradient vanishing. x = max(x, 0) 10 Group Normalization – Divides channels into groups and normalizes within each group, offering an alternative to Batch Normalization for small batch sizes.

def GroupNorm(x, gamma, beta, G, eps=1e-5):
    # x: [N, C, H, W]
    N, C, H, W = x.shape
    x = tf.reshape(x, [N, G, C // G, H, W])
    mean, var = tf.nn.moments(x, [2, 3, 4], keepdims=True)
    x = (x - mean) / tf.sqrt(var + eps)
    x = tf.reshape(x, [N, C, H, W])
    return x * gamma + beta

11 Label Smoothing – Converts hard labels to soft labels, smoothing the target distribution to improve generalization and reduce over‑fitting.

targets = (1 - label_smooth) * targets + label_smooth / num_classes

12 Wasserstein GAN – Resolves GAN training instability, eliminates mode collapse, provides a meaningful training metric, and works with simple fully‑connected architectures.

13 Skip Connection – Adds an identity mapping (F(x) = F(x) + x) to prevent degradation when networks become deeper. F(x) = F(x) + x 14 Weight Initialization – Proper initialization prevents all neurons from starting with identical outputs, which would cause identical gradients and hinder learning.

Embedding(embeddings_initializer=word2vec_emb, input_dim=2009, output_dim=DOTA)

For more detailed notes and a 1090‑page PDF, send “ML2021” to the public account linked below.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning GaN adversarial training normalization Regularization training tricks learning rate schedule

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.