Why PINNs Training Fails: Diagnosing and Fixing Gradient Pathologies

The article explains that physics‑informed neural networks often stall because the PDE residual loss dominates the boundary‑condition loss, causing severe gradient imbalance, and presents two remedies—an adaptive loss‑weighting scheme and a modified fully‑connected architecture—that together can improve prediction accuracy by up to two orders of magnitude.

AI Agent Research Hub
AI Agent Research Hub
AI Agent Research Hub
Why PINNs Training Fails: Diagnosing and Fixing Gradient Pathologies

Introduction

When practitioners train physics‑informed neural networks (PINNs) on non‑trivial PDE problems, the loss curve typically drops quickly at first and then plateaus, with the PDE residual loss decreasing while the boundary‑condition loss remains large, leading to predictions that deviate dramatically from the exact solution.

Gradient Imbalance Diagnosis

Wang, Teng and Perdikaris (2021) showed that the root cause is not network capacity or data scarcity but a "gradient pathology": the gradients of the PDE residual term can be orders of magnitude larger than those of the boundary‑condition term. Consequently, the optimizer focuses on satisfying the PDE residual and ignores the boundary conditions, producing non‑unique or completely wrong solutions.

Key insight: monitor each loss component’s gradient separately rather than the total gradient.

Histogram visualisations on a 2‑D Helmholtz benchmark reveal that boundary‑condition gradients are sharply concentrated near zero, while PDE‑residual gradients are widely spread, confirming the imbalance.

Theoretical Analysis on a 1‑D Poisson Equation

Assuming the trained network approximates the exact solution well, the authors derive two inequalities that bound the gradients of the PDE residual and boundary‑condition terms. The bounds show that as the frequency parameter of the target solution increases, the PDE‑residual gradient grows proportionally to the frequency, whereas the boundary‑condition gradient bound remains constant. Experiments with three frequencies confirm that larger frequencies exacerbate the gradient gap.

Adaptive Learning‑Rate Annealing (Algorithm 1)

The first remedy adjusts the weight of each loss component dynamically during training. At each step (or every few steps) the algorithm computes the ratio of the maximum PDE‑residual gradient to the mean gradient of each data‑related loss term, smooths the ratio with a moving average, and rescales the loss weights accordingly.

import torch

def compute_adaptive_weights(model, loss_r, loss_terms, alpha=0.9, prev_lambdas=None):
    """Self‑adaptive learning‑rate annealing (Wang et al. 2021)
    loss_r: scalar PDE‑residual loss
    loss_terms: list of other loss components [L_ub, L_u0, ...]
    """
    grads_r = torch.autograd.grad(loss_r, model.parameters(), retain_graph=True)
    max_grad_r = max(g.abs().max().item() for g in grads_r)
    lambdas = []
    for i, loss_i in enumerate(loss_terms):
        grads_i = torch.autograd.grad(loss_i, model.parameters(), retain_graph=True)
        mean_grad_i = sum(g.abs().mean().item() for g in grads_i) / len(grads_i)
        lambda_hat = max_grad_r / (mean_grad_i + 1e-8)
        if prev_lambdas is not None:
            lam = (1 - alpha) * prev_lambdas[i] + alpha * lambda_hat
        else:
            lam = lambda_hat
        lambdas.append(lam)
    return lambdas

The method balances the "volume" of gradient signals, analogous to Adam’s per‑parameter adaptation but applied across loss components.

Improved Fully‑Connected Architecture

The second remedy modifies the network topology. Inspired by transformer gating, each hidden layer receives two transformed streams U and V (linear layers followed by tanh) and a gate Z that interpolates between them:

class ImprovedPINNArch(nn.Module):
    """Wang et al. (2021) improved fully‑connected architecture"""
    def __init__(self, input_dim, hidden_dim, output_dim, n_layers):
        super().__init__()
        self.U_layer = nn.Linear(input_dim, hidden_dim)
        self.V_layer = nn.Linear(input_dim, hidden_dim)
        self.hidden_layers = nn.ModuleList()
        self.hidden_layers.append(nn.Linear(input_dim, hidden_dim))
        for _ in range(n_layers - 1):
            self.hidden_layers.append(nn.Linear(hidden_dim, hidden_dim))
        self.output_layer = nn.Linear(hidden_dim, output_dim)
        self.activation = nn.Tanh()
    def forward(self, x):
        U = self.activation(self.U_layer(x))
        V = self.activation(self.V_layer(x))
        H = self.activation(self.hidden_layers[0](x))
        for layer in self.hidden_layers[1:]:
            Z = self.activation(layer(H))
            H = (1 - Z) * U + Z * V
        return self.output_layer(H)

This design introduces element‑wise multiplication (gating) and residual connections, increasing expressive power while adding only two extra linear layers.

Experimental Results

Helmholtz benchmark (2‑D): A standard 4‑layer × 50‑neuron network trained for 40 000 steps yields a relative error of 1.81e‑01 (≈18 %). Adding the adaptive weight scheme (M2) reduces the error to 1.27e‑02 (≈14× improvement). Combining both the weight scheme and the improved architecture (M4) achieves a relative error of 3.69e‑03 (≈49× improvement) and dramatically lowers boundary‑region errors.

Helmholtz equation: standard PINNs prediction
Helmholtz equation: standard PINNs prediction

Klein‑Gordon equation (non‑linear, multi‑objective): The full solution (M4) attains a relative error of 2.81e‑03, a 64× reduction over the baseline (M1) while requiring roughly five times longer training due to the extra gradient statistics.

Klein‑Gordon: gradient histograms
Klein‑Gordon: gradient histograms

2‑D cavity flow (Re=100): When the problem is formulated with velocity‑pressure outputs, all models fail (error > 70 %). Re‑formulating with a stream‑function‑pressure representation satisfies incompressibility by construction; the full scheme (M4) reduces the error to 3.42 % and yields physically plausible flow fields.

Cavity flow: stream‑function representation
Cavity flow: stream‑function representation

Across all benchmarks, the adaptive weight scheme alone (M2) provides 10–14× error reductions, the improved architecture alone (M3) yields similar gains, and their combination (M4) consistently delivers the best performance, often improving accuracy by 50–55× compared with the baseline.

Impact and Limitations

The paper, published in SIAM Journal on Scientific Computing, established that gradient imbalance is a fundamental cause of PINNs training failure, shifting the community’s view from ad‑hoc hyper‑parameter tuning to principled diagnosis. The adaptive weighting algorithm has been adopted in libraries such as DeepXDE, and the modified architecture has been cited in numerous follow‑up works.

Limitations include: (1) the theoretical analysis relies on the assumption that the network is already close to convergence and is limited to 1‑D linear problems; (2) the heuristic choice of max‑/mean‑based ratios lacks a rigorous justification; (3) the extra gradient‑statistics computation adds overhead, especially for many loss components; (4) all experiments are 2‑D and moderate‑scale, leaving open the question of scalability to 3‑D turbulent flows.

Subsequent research extended the analysis with Neural Tangent Kernel theory, introduced GradNorm‑style multi‑task balancing, and developed causal PINNs and comprehensive training guides that together form a full toolbox for robust PINNs training.

deep learningadaptive loss weightingPDEgradient pathologyPINNs
AI Agent Research Hub
Written by

AI Agent Research Hub

Sharing AI, intelligent agents, and cutting-edge scientific computing

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.