Review: Semi‑Supervised Learning with Ladder Networks and Virtual Adversarial Training (VAT)
This article reviews Ladder Networks and the Γ‑Model for semi‑supervised learning, explains how they minimize supervised and unsupervised costs, presents experimental results on MNIST, CIFAR‑10 and SVHN, and details the Virtual Adversarial Training method with TensorFlow code.
Ladder Networks and the Γ‑Model, introduced by The Curious AI Company, Nokia Labs and Aalto University at NIPS 2015, train a model to minimize the sum of supervised and unsupervised cost functions, building on a denoising auto‑encoder architecture.
In a denoising auto‑encoder, noise is added to the input \(x\) to produce \(\tilde{x}\); the encoder then reconstructs \(\hat{x}\) and the loss is the reconstruction error between \(\hat{x}\) and the clean input \(x\). This encourages the intermediate deep latent features to retain rich information useful for downstream tasks.
To directly minimize differences in deep features, the same reconstruction loss is applied to the latent variable \(z\) instead of \(x\), yielding a cost that penalizes discrepancies between clean and corrupted latent representations.
The Ladder Network consists of three parallel paths per layer: a clean (supervised) path, a corrupted path where noise is injected at each layer, and a denoising path that reconstructs the clean activations from the corrupted ones. Each layer contributes a denoising cost, and a function \(g\) denoises the corrupted activation using the corresponding clean activation as reference.
The Γ‑Model is a simplified Ladder Network that applies the denoising cost only at the top layer, allowing most of the decoder to be omitted while retaining the regularization benefit.
Experimental results show that a fully‑connected MLP on MNIST with only 50 labeled examples achieves a 1.62% test error, while training with 100 labels sometimes fails to converge. On MNIST CNNs, two models (Conv‑FC and Conv‑Small) using the Γ‑Model benefit from additional convolutions, leading to more reliable convergence. On CIFAR‑10, a large CNN with the Γ‑Model reduces error by roughly 3% when trained with 4000 labels.
Virtual Adversarial Training (VAT) is a regularization technique that measures the local smoothness of the conditional label distribution around each input. It defines a Local Distributional Smoothness (LDS) loss using the KL divergence between the model’s output distribution and that of a perturbed input.
The full objective combines the standard supervised negative log‑likelihood with the average LDS over all labeled and unlabeled samples. The perturbation \(r_{vadv}\) is obtained by iteratively normalizing a random vector, computing the gradient of the KL divergence, and scaling by a small epsilon.
The following TensorFlow implementation follows the paper’s algorithm:
def get_normalized_vector(d):
d /= (1e-12 + tf.reduce_max(tf.abs(d), range(1, len(d.get_shape())), keep_dims=True))
d /= tf.sqrt(1e-6 + tf.reduce_sum(tf.pow(d, 2.0), range(1, len(d.get_shape())), keep_dims=True))
return d
def generate_virtual_adversarial_perturbation(x, logit, is_training=True):
d = tf.random_normal(shape=tf.shape(x))
for _ in range(FLAGS.num_power_iterations):
d = FLAGS.xi * get_normalized_vector(d)
logit_p = logit
logit_m = forward(x + d, update_batch_stats=False, is_training=is_training)
dist = L.kl_divergence_with_logit(logit_p, logit_m)
grad = tf.gradients(dist, [d], aggregation_method=2)[0]
d = tf.stop_gradient(grad)
return FLAGS.epsilon * get_normalized_vector(d)
def virtual_adversarial_loss(x, logit, is_training=True, name="vat_loss"):
r_vadv = generate_virtual_adversarial_perturbation(x, logit, is_training=is_training)
logit = tf.stop_gradient(logit)
logit_p = logit
logit_m = forward(x + r_vadv, update_batch_stats=False, is_training=is_training)
loss = L.kl_divergence_with_logit(logit_p, logit_m)
return tf.identity(loss, name=name)The code repository is https://github.com/takerum/vat_tf.git. Experiments show VAT achieves a 14.82% test error on CIFAR‑10 without data augmentation, outperforming recent semi‑supervised methods. An ablation study varies the perturbation magnitude \(\epsilon\) while keeping \(\alpha=1\); \(\epsilon\) is the sole hyper‑parameter to tune.
Observations indicate that LDS values are large for points near class boundaries and decrease after each model update, confirming the regularizing effect of VAT.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code DAO
We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
