The Emerging ‘Newton’s Law’ of Deep Learning: Toward a Scientific Theory

Amid rapid scaling of large models, a new paper by researchers from UC Berkeley, Harvard, and Stanford proposes a unified "Learning Mechanics" framework that stitches together five theoretical strands—idealized solvable settings, extreme limits, empirical laws, hyperparameter theory, and universal behavior—to begin forming a scientific theory of deep learning.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
The Emerging ‘Newton’s Law’ of Deep Learning: Toward a Scientific Theory

Deep learning has long lacked a solid scientific foundation; leading figures such as Yann LeCun and Geoffrey Hinton have described the field as a "desert" of theory where progress relies on engineering intuition and massive experiments.

A recent collaborative paper titled There Will Be a Scientific Theory of Deep Learning (arXiv:2604.21691) systematically gathers scattered theoretical fragments from the past decade and proposes a unifying framework called Learning Mechanics , analogous to Newtonian mechanics for physical systems.

Where Did the Basic Theory Go?

Historically, breakthroughs like AlexNet, ResNet, and the Transformer emerged from practical discoveries rather than theoretical derivations, leaving researchers to tune failing models by trial and error.

The paper identifies five research strands that converge toward a single unified theory:

Solvable Idealized Settings : Under simplifying assumptions, the dynamics of neural networks can be solved exactly, e.g., deep linear networks admit global optimal solutions and exhibit dynamics analogous to harmonic oscillators and the hydrogen atom in physics.

Treatable Extremes : When network dimensions approach extreme limits (width, depth, batch size, learning rate), behavior becomes predictable, mirroring thermodynamic limits in physics.

Empirical Laws : Cross‑architecture regularities such as neural scaling laws and the Edge of Stability resemble Kepler’s laws and Snell’s law, suggesting universal patterns.

Hyperparameter Theory : Concepts like μP (Maximal Update Parameterization) and central flow aim to make hyperparameters transferable across model scales, akin to dimensional analysis.

Universal Behavior : Different architectures learn remarkably similar internal representations, a phenomenon comparable to critical universality in statistical mechanics.

Solvable Idealized Settings – The “Hydrogen Atom” of Neural Networks

Deep linear networks replace nonlinear activations with identity maps, reducing a multilayer perceptron to a product of matrices. Researchers prove that stochastic gradient descent (SGD) always finds the global optimum and can precisely track each update step; many qualitative features (e.g., singular‑value dynamics) persist in nonlinear networks.

In the infinite‑width limit, the Neural Tangent Kernel (NTK) describes training dynamics as kernel regression in a fixed reproducing‑kernel Hilbert space, drawing a parallel to quantum harmonic oscillators where the kernel remains constant, analogous to a conserved Hamiltonian.

image
image

Treatable Extremes – When Networks Become "Infinite"

Analogous to thermodynamic limits, studying networks as certain dimensions grow unbounded yields analytical insight. Notable extremes include:

Width Limit (Lazy vs. Rich Regime) : Wide networks either stay near initialization (lazy training, equivalent to kernel methods) or enter a feature‑learning regime where representations evolve substantially. The transition depends on width, depth, learning rate, and batch size, constituting a genuine phase‑transition akin to water freezing at 0 °C.

Depth Limit : As depth → ∞, some architectures exhibit continuous‑time dynamics.

Batch‑size Limit : Systematic differences emerge between large‑batch and small‑batch training.

Learning‑rate Limit : Extremely small rates correspond to gradient flow; extremely large rates trigger new dynamics.

These limits convert discrete empirical observations into continuous, analyzable mathematical objects.

Empirical Laws – Deep‑Learning’s Kepler Laws

Large‑scale experiments have uncovered cross‑architecture regularities:

Neural Scaling Laws : Test loss decays as a power law with respect to compute, parameter count, or data size (loss ∝ N^‑α). The exponent α varies with task and architecture, yet the power‑law form holds for Transformers, ResNets, language modeling, and image classification alike.

Edge of Stability (EoS) : With large learning rates, the largest eigenvalue of the Hessian stabilizes near 2/η (η = learning rate). This mirrors self‑organized criticality in sandpiles and earthquakes and is analogized to Snell’s law, which describes refraction without explaining its microscopic origin.

image
image

Hyperparameter Theory – Dimensional Analysis for Deep Learning

Practical training suffers from fragile hyperparameter choices. The μP framework rescales initialization and update rules so that hyperparameters transfer zero‑shot from small to large models, effectively performing dimensional analysis on the loss landscape.

Related concepts include:

Central Flow : A parameterization that preserves geometric properties of the optimization trajectory, avoiding scale‑mismatch difficulties.

Hyperparameter Decoupling and Elimination : Proposals to reduce the number of free hyperparameters by proving some are redundant or can be absorbed into others.

Universal Behavior – Convergent Representations Across Architectures

Empirically, vastly different networks (e.g., ResNet vs. Vision Transformer) trained on the same dataset (ImageNet) develop highly similar intermediate activations. This representation convergence extends across modalities, suggesting a universality class similar to critical phenomena in statistical physics.

image
image

Ten Open Questions

Analytic Theory of Non‑linear Dynamics : Most solvable results apply to linear or infinite‑width limits; finite‑width non‑linear dynamics remain a black hole.

Origin and Breakdown of Scaling Laws : Why do power‑law relationships hold, and under what conditions do they fail?

Complete Phase Diagram of Lazy vs. Rich Regimes : What does the transition region look like, and is there a third regime?

Standard Model of Hyperparameters : Can μP, central flow, and related schemes be unified into a comprehensive guide?

Mathematical Proof of Representation Convergence : Can optimization dynamics rigorously guarantee convergent internal representations?

Theoretical Upper Bounds on Generalization Error : Why do heavily over‑parameterized networks avoid severe over‑fitting?

Theory‑Driven Architecture Design : Can first‑principles derivations replace trial‑and‑error in architecture search?

Emergence Mechanisms for Language and Reasoning : Under what conditions do in‑context learning and chain‑of‑thought reasoning emerge?

Physical Symmetries and Neural Inductive Biases : Do neural networks inherently encode translational, rotational, or scale invariances, or must they learn them from data?

Formal Axiomatic System for Learning Mechanics : A rigorous set of axioms comparable to Newton’s laws or quantum mechanics is needed.

These open problems outline a roadmap; solving any of them could transform the “alchemy” era of deep learning into a disciplined scientific discipline.

deep learninghyperparameter theorylearning mechanicsneural scaling lawsNTKtheoretical AI
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.