Artificial Intelligence 16 min read

Finally, Researchers Uncover Deep Learning’s “Newton’s Law”

A new collaborative paper from top universities proposes a unified “Learning Mechanics” framework for deep learning, outlining five research strands—from solvable idealized models and extreme limits to empirical scaling laws and hyper‑parameter theory—while drawing analogies to classical physics and highlighting ten open challenges.

Data Party THU

May 2, 2026

Finally, Researchers Uncover Deep Learning’s “Newton’s Law”

Learning Mechanics – a nascent scientific theory of deep learning

The paper There Will Be a Scientific Theory of Deep Learning (arXiv:2604.21691) authored by researchers from UC Berkeley, Harvard, Stanford and other institutions surveys a decade of theoretical fragments and organizes them into a coherent framework called Learning Mechanics . The authors argue that deep learning is transitioning from an engineering‑driven “alchemy” to a physics‑inspired science.

Five inter‑related research lines

Solvable idealized settings. In deep linear networks (a multilayer perceptron with identity activations) the model reduces to a product of matrices. The authors prove that stochastic gradient descent (SGD) always finds the global optimum and derive closed‑form expressions for the trajectory of each update, showing that qualitative features such as singular‑value dynamics persist in nonlinear networks. This setting is likened to the hydrogen atom in quantum mechanics.

Extreme limits. When network width → ∞ the training dynamics become exactly described by the Neural Tangent Kernel (NTK). In this limit the output function

evolves under a time‑independent kernel

that remains constant during training, analogous to a conserved Hamiltonian. The authors distinguish a “lazy” (kernel‑like) regime from a “rich” (feature‑learning) regime and map the transition to a phase‑boundary that depends on width, depth, learning rate and batch size.

Empirical laws. Large‑scale experiments reveal cross‑architecture regularities:

Neural scaling laws: test loss L scales as a power law L ∝ C^{‑α} with compute C, model parameters, or data size, where the exponent α varies with task and architecture.

Edge‑of‑Stability (EoS): during training with a large learning rate η, the maximal Hessian eigenvalue λ_max automatically stabilises around 2/η, indicating a self‑organised critical point.

The paper draws analogies to Kepler’s laws and Snell’s law, respectively.

Hyper‑parameter theory. The Maximal Update Parameterization (μP) rescales initialization and update rules so that learning‑rate, weight‑decay and other hyper‑parameters transfer zero‑shot across model scales. μP is presented as a dimensional‑analysis tool for deep learning. Related concepts include Central Flow – a parameterisation that preserves geometric properties of the optimisation trajectory – and the broader goal of hyper‑parameter decoupling or elimination.

Universal behaviour. Experiments show that vastly different architectures (e.g., ResNet vs. Vision Transformer) trained on the same dataset converge to highly similar internal representations, a phenomenon termed Representation Convergence or the Universal Representation Hypothesis. The authors compare this to critical universality in statistical physics, where disparate physical systems share the same scaling exponents near a phase transition.

Key demonstrations

Deep linear networks provide an analytically tractable “hydrogen atom” of deep learning, with SGD dynamics fully characterised.

In the infinite‑width NTK limit the training dynamics reduce to kernel regression in a fixed reproducing‑kernel Hilbert space.

Scaling‑law plots (see image) confirm power‑law decay of loss with compute across models such as Transformers and ResNets.

Edge‑of‑Stability plots (see image) illustrate λ_max stabilising near 2/η for large η.

Representation‑convergence visualisations (see image) compare middle‑layer activations of a ResNet and a Vision Transformer on ImageNet, revealing near‑identical structure.

Open research questions

Analytic theory for nonlinear dynamics of finite‑width networks.

Fundamental origin of scaling laws and conditions under which they break.

Complete phase diagram of lazy vs. rich regimes and possible intermediate regimes.

Unified “standard model” of hyper‑parameters that integrates μP, Central Flow and other schemes.

Mathematical proof of representation convergence from optimisation dynamics.

Theoretical upper bounds on generalisation error for heavily over‑parameterised models.

First‑principles guidance for architecture design.

Mechanisms behind emergent language and reasoning abilities (in‑context learning, chain‑of‑thought).

Relationship between physical symmetries (translation, rotation, scale) and neural network inductive biases.

Formal axiomatic system for Learning Mechanics analogous to Newton’s laws or quantum postulates.

By framing deep‑learning phenomena with analogies to classical mechanics, quantum mechanics and statistical physics, the authors propose that Learning Mechanics could become the “periodic table” of neural‑network theory, providing a unified language for why networks learn, how they generalise, and what fundamental limits exist.

Code example

来源：机器学习算法与自然语言处理 机器之心
本文
约5000字
，建议阅读
10
分钟
计算机也有物理。

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deep Learning scaling laws hyperparameter theory learning mechanics neural tangent kernel representation convergence

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.