Understanding GANs: Theory, Minimax Game, and Training Challenges
This article introduces Generative Adversarial Networks (GANs), explains their minimax formulation, value function, Jensen‑Shannon divergence, common variants, and practical training issues such as gradient saturation, while also previewing the next topic on Hidden Markov Models.
Introduction to Generative Adversarial Networks (GANs)
In 2014 Goodfellow and friends conceived GANs in a bar, proposing a new framework for training generative models. GANs quickly spread across deep learning, spawning many variants such as WGAN, InfoGAN, f‑GAN, BiGAN, DCGAN, IRGAN, etc.
Conceptual Analogy
The GAN framework can be likened to a Tai‑Chi diagram: the generator (G) creates data (the "yang"), while the discriminator (D) judges authenticity (the "yin"). G samples from a prior distribution, transforms it via a neural network, and produces synthetic data; D receives both real and synthetic samples and tries to distinguish them, forming a competitive pair.
Problem Statements
Three questions are posed:
Formulate the minimax value function of GANs, give the Nash equilibrium (G*, D*) and the value at equilibrium; then derive the optimal discriminator D G* when G is fixed, and the optimal generator G D* when D is fixed.
Explain how GANs avoid the costly probabilistic inference required by traditional generative models.
Discuss whether the ideal minimization objective is achieved in practice and what training problems arise.
Answers and Analysis
(1) Minimax Game and Value Function
The discriminator aims to assign high probability to real data and low probability to generated data, leading to a binary‑cross‑entropy loss. Assuming equal prior for real and generated samples, the loss can be expressed as:
Maximizing the corresponding value function V(G,D) yields the classic minimax objective:
Optimizing G minimizes the Jensen‑Shannon divergence between the data distribution p data and the generator distribution p g . At the equilibrium p data =p g , the optimal discriminator outputs ½ for any input and the value function equals –log 4.
When D is fixed, the optimal G minimizes the same value function, leading to the same equilibrium distribution.
(2) Avoiding Probabilistic Inference
Traditional generative models require explicit density functions and costly marginal or conditional probability calculations. GANs bypass this by learning a deterministic mapping f: Z→X using a neural network, where Z is sampled from a simple prior. The Jacobian of f relates the distributions of Z and X, allowing the model to implicitly represent p(X) without evaluating partition functions.
(3) Training Challenges
Early in training the generator produces poor samples that the discriminator easily rejects, causing vanishing gradients (optimization saturation). The discriminator’s sigmoid output D(x)=σ(o(x)) yields near‑zero gradients for G when D is too strong. This hampers G’s learning.
The derivative of the generator’s loss with respect to its parameters becomes almost zero, indicating that a powerful discriminator provides little useful gradient to improve G. Various techniques (e.g., alternative loss functions, feature matching, label smoothing) are proposed to mitigate this issue.
Next Topic Preview: Hidden Markov Models
The upcoming article will discuss Hidden Markov Models (HMMs), a classic generative model for sequence labeling tasks such as Chinese word segmentation, POS tagging, and speech recognition. It will cover how to model Chinese word segmentation with HMMs and how to train the model from a corpus.
Hulu Beijing
Follow Hulu's official WeChat account for the latest company updates and recruitment information.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
