Artificial Intelligence 13 min read

MCMC Demystified: Monte Carlo Basics, Metropolis-Hastings & Gibbs Sampling

Markov Chain Monte Carlo (MCMC) extends classic Monte Carlo by generating dependent samples via a Markov chain, enabling Bayesian inference through concepts like the plug‑in principle, burn‑in, asymptotic independence, and algorithms such as Metropolis‑Hastings and Gibbs sampling, while addressing convergence and effective sample size.

Model Perspective
Model Perspective
Model Perspective
MCMC Demystified: Monte Carlo Basics, Metropolis-Hastings & Gibbs Sampling

Markov Chain Monte Carlo (MCMC) is a powerful Monte Carlo method widely used in Bayesian inference.

Classic Monte Carlo relies on independent samples, whereas MCMC generates a sequence of dependent observations forming a Markov chain.

Monte Carlo

To understand MCMC, one must first grasp the basics of Monte Carlo methods.

When analytical calculation is impossible, Monte Carlo approximates characteristics of a random variable X (e.g., its expectation) by generating independent samples from X’s distribution.

The empirical distribution of these samples is then used for computation; it assigns probability mass to each observed value.

Example: To estimate the expectation of X, we use the sample mean of the empirical distribution.
Example: To estimate the probability that X falls below a threshold, we use the proportion of generated samples below that threshold.
Example: To approximate the variance of X, we compute the sample variance.

Plug‑in Principle

Generally, any calculation on the true distribution of X can be approximated by the analogous calculation on the empirical distribution, which is easy to perform on a computer because it is a finite‑support discrete distribution.

This approach works well; as the sample size n increases, the approximation converges to the true quantity—this is the plug‑in principle.

Markov Chain Basics

MCMC works like standard Monte Carlo but the generated samples are not independent; they form a sequence with Markov dependence.

Specifically, the samples are random variables that together constitute a Markov chain.

Markov Property

A sequence is a Markov chain if, given the current value, future observations are conditionally independent of past values for any positive integers.

This memoryless property means the future distribution depends only on the present state, not on the path taken to reach it.

Conditional and Unconditional Distributions

Because of the Markov property, analysis of a chain requires only two components: the conditional probability (or density) of the next state given the current state, and the unconditional probability of the current state.

Asymptotic Independence

Although general Markov chains are not independent, the chains generated by MCMC possess the following property: as the number of steps n grows, the dependence between two variables diminishes, and they become asymptotically independent.

Two variables are not independent for finite n, but they approach independence as n increases.

This implies that as n becomes large, the influence of the initial value on the chain’s distribution vanishes, and the chain converges to a unique stationary distribution regardless of the starting point.

Target Distribution

The stationary distribution of the MCMC chain coincides with the target distribution from which we wish to draw samples. As n grows, the chain’s distribution becomes increasingly close to the target distribution.

Black‑Box Approach

We can view an MCMC algorithm as a black box that accepts two inputs:

Initial value.

Target distribution.

The output is a sequence of values. Although the first value may be drawn from a distribution far from the target, the similarity to the target improves as the chain progresses.

Burn‑in Sample

Because the first few samples may be far from the target distribution, it is common practice to discard the initial draws (the burn‑in period).

For example, from 110,000 generated samples, one might discard the first 10,000 and keep the remaining 100,000.

The discarded set is called the burn‑in sample. Removing it eliminates bias from samples far from the target, improving the accuracy of Monte Carlo approximations.

Correlation and Effective Sample Size

After discarding burn‑in, the retained samples closely resemble the target distribution, but they are still correlated.

Lack of independence means that correlated observations provide less information than the same number of independent observations.

Effective sample size quantifies this: a set of dependent observations may be equivalent to a smaller number of independent ones (e.g., 1,000 dependent observations might equal 100 independent ones).

Higher correlation reduces effective sample size, making MCMC approximations less accurate; thus, much effort is devoted to reducing correlation to achieve a satisfactory effective sample size.

Metropolis‑Hastings Algorithm

The Metropolis‑Hastings (M‑H) algorithm is one of the most popular MCMC methods.

Let π(x) denote the density (or mass) of the target distribution from which we wish to draw samples (e.g., a posterior density in Bayesian inference).

We also define a proposal distribution q(y|x) from which we can generate candidate samples (e.g., a multivariate normal with a given mean).

The generated samples, the proposal values, and the target distribution share the same dimensionality.

Algorithm

The M‑H algorithm proceeds as follows, starting from an arbitrary value x₀ within the support of the target distribution:

Generate a candidate y from the proposal distribution q(y|x).

Compute the acceptance probability α = min{1, [π(y) q(x|y)] / [π(x) q(y|x)]}.

Draw a uniform random variable u on [0,1].

If u ≤ α, accept the candidate (set the next state to y); otherwise, reject it and retain the current state.

Convergence

Under certain technical conditions, the distribution of the chain converges to the target distribution π.

Moreover, for any function with finite expectation under π, a strong law of large numbers holds for the sample averages.

Advantages of the M‑H Algorithm

The strength of Metropolis‑Hastings lies in requiring only the target density up to a multiplicative constant.

In Bayesian inference, the posterior is often known only up to such a constant because the likelihood and prior are known but the normalizing constant is not.

The acceptance probability depends only on ratios of π, allowing computation without knowing the constant.

This makes M‑H a powerful tool for sampling from distributions even when their normalizing constants are unknown.

Gibbs Sampling

Gibbs sampling is another widely used MCMC algorithm.

Sufficient Condition

Suppose we wish to sample a random vector x = (x₁,…,x_k) with joint density f(x). For each component x_i, let x_{-i} denote the vector of all other components.

If we can sample from the full conditional distributions f(x_i | x_{-i}), we can construct a Gibbs sampler.

Algorithm

The Gibbs sampler starts from an arbitrary vector and iteratively updates each component:

For each i = 1,…,k, draw a new value for x_i from its conditional distribution given the current values of all other components.

Thus, at each iteration, each block is sampled from its full conditional distribution using the most recent values of the other blocks.

Convergence

Gibbs sampling can be shown to be a special case of Metropolis‑Hastings; consequently, its samples converge to the target distribution, and an ergodic theorem for the sample mean holds.

References

https://www.statlect.com/fundamentals-of-statistics/Markov-Chain-Monte-Carlo

https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm

https://www.statlect.com/fundamentals-of-statistics/Markov-Chain-Monte-Carlo-diagnostics

Bayesian inferencemonte carloMCMCMarkov ChainGibbs samplingMetropolis-Hastings
Model Perspective
Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.