Artificial Intelligence 11 min read

Understanding GANs: Theory, Minimax Game, and Training Challenges

This article introduces Generative Adversarial Networks (GANs), explains their minimax formulation, value function, Jensen‑Shannon divergence, common variants, and practical training issues such as gradient saturation, while also previewing the next topic on Hidden Markov Models.

Hulu Beijing

Feb 1, 2018

Understanding GANs: Theory, Minimax Game, and Training Challenges

Introduction to Generative Adversarial Networks (GANs)

In 2014 Goodfellow and friends conceived GANs in a bar, proposing a new framework for training generative models. GANs quickly spread across deep learning, spawning many variants such as WGAN, InfoGAN, f‑GAN, BiGAN, DCGAN, IRGAN, etc.

Conceptual Analogy

The GAN framework can be likened to a Tai‑Chi diagram: the generator (G) creates data (the "yang"), while the discriminator (D) judges authenticity (the "yin"). G samples from a prior distribution, transforms it via a neural network, and produces synthetic data; D receives both real and synthetic samples and tries to distinguish them, forming a competitive pair.

Problem Statements

Three questions are posed:

Formulate the minimax value function of GANs, give the Nash equilibrium (G*, D*) and the value at equilibrium; then derive the optimal discriminator D G* when G is fixed, and the optimal generator G D* when D is fixed.

Explain how GANs avoid the costly probabilistic inference required by traditional generative models.

Discuss whether the ideal minimization objective is achieved in practice and what training problems arise.

Answers and Analysis

(1) Minimax Game and Value Function

The discriminator aims to assign high probability to real data and low probability to generated data, leading to a binary‑cross‑entropy loss. Assuming equal prior for real and generated samples, the loss can be expressed as:

Maximizing the corresponding value function V(G,D) yields the classic minimax objective:

Optimizing G minimizes the Jensen‑Shannon divergence between the data distribution p data and the generator distribution p g . At the equilibrium p data =p g , the optimal discriminator outputs ½ for any input and the value function equals –log 4.

When D is fixed, the optimal G minimizes the same value function, leading to the same equilibrium distribution.

(2) Avoiding Probabilistic Inference

Traditional generative models require explicit density functions and costly marginal or conditional probability calculations. GANs bypass this by learning a deterministic mapping f: Z→X using a neural network, where Z is sampled from a simple prior. The Jacobian of f relates the distributions of Z and X, allowing the model to implicitly represent p(X) without evaluating partition functions.

(3) Training Challenges

Early in training the generator produces poor samples that the discriminator easily rejects, causing vanishing gradients (optimization saturation). The discriminator’s sigmoid output D(x)=σ(o(x)) yields near‑zero gradients for G when D is too strong. This hampers G’s learning.

The derivative of the generator’s loss with respect to its parameters becomes almost zero, indicating that a powerful discriminator provides little useful gradient to improve G. Various techniques (e.g., alternative loss functions, feature matching, label smoothing) are proposed to mitigate this issue.

Next Topic Preview: Hidden Markov Models

The upcoming article will discuss Hidden Markov Models (HMMs), a classic generative model for sequence labeling tasks such as Chinese word segmentation, POS tagging, and speech recognition. It will cover how to model Chinese word segmentation with HMMs and how to train the model from a corpus.