There Will Be a Scientific Theory of Deep Learning

Deep learning has become one of the most successful engineering paradigms in modern artificial intelligence. Yet its theoretical foundations remain incomplete. We know how to train large models, but we still lack a unified scientific account of why they learn, how their internal representations form, and why their performance follows such regular patterns at scale.

The paper There Will Be a Scientific Theory of Deep Learning argues that this gap will not last forever. Its central claim is that a scientific theory of deep learning is beginning to emerge, and that this theory will look less like a collection of isolated guarantees and more like a mechanics of learning: a mathematical and empirical science of how neural networks move through parameter space during training.

This proposed field is called learning mechanics. The analogy is deliberately physical. Classical mechanics studies how objects move under forces. Learning mechanics would study how models move under gradients.

\[ F = ma \qquad\Longleftrightarrow\qquad \theta_{t+1} = \theta_t - \eta \nabla L(\theta_t) \]

A rough analogy: physics describes motion under forces; learning mechanics describes parameter motion under gradients.

From empirical craft to scientific theory

Machine learning has always had theory. Classical learning theory asks how many samples are needed to generalize. Optimization theory asks when training procedures converge. Statistical theory studies bias, variance, complexity control, and estimation. These perspectives remain essential.

But modern deep learning has changed the object of study. Deep networks are large, non-convex, overparameterized systems. They do not merely fit functions. They form internal representations, display phase-like transitions during training, exhibit strong dependence on parameterization and scale, and often obey surprisingly simple empirical laws.

A typical gradient-based learning rule already suggests the dynamical viewpoint:

\[ \theta^{(t+1)} = \theta^{(t)} - \eta \nabla L(\theta^{(t)}) \]

The parameters at step \(t+1\) are obtained by moving the current parameters in the negative gradient direction.

The architecture, dataset, objective function, initialization, and optimizer together define the system’s equations of motion. Nothing is hidden in principle: we can inspect every weight, activation, gradient, loss value, and checkpoint. The obstacle is not opacity. The obstacle is complexity.

This distinction matters. Deep learning is hard to understand not because we cannot observe it, but because the observed system is nonlinear, high-dimensional, and strongly coupled. That is precisely the kind of situation in which scientific theories often emerge through simplification, limiting regimes, macroscopic variables, and empirical laws.

What would learning mechanics explain?

A mature learning mechanics would not track every parameter individually. That would be as unhelpful as trying to understand a gas by following every molecule. Instead, it would identify the right coarse variables and explain their dynamics.

Such variables might include training loss, test loss, sharpness of the loss landscape, learned representations, feature geometry, scaling laws, optimizer-induced regularization, hyperparameter transfer, and universal behavior across architectures.

The paper proposes that learning mechanics should be fundamental, mathematical, predictive, comprehensive, intuitive, useful, and humble. The last criterion is important. Like any scientific theory, learning mechanics would have a domain of validity. It would not explain every special case. It would aim to describe the important regimes in which modern deep learning actually operates.

1. Solvable models already exist

Physics often begins with idealized systems: the harmonic oscillator, the ideal gas, the hydrogen atom, the Ising model. These systems are not the whole world, but they provide footholds for understanding it.

Deep learning theory has its own idealized systems. One important example is the deep linear network:

\[ f(x;\theta) = W_L W_{L-1} \cdots W_1 x \]

A deep linear network is linear in the input, but nonlinear in its parameters because of the product of matrices.

At first sight, this model seems too simple. There are no nonlinear activations, so the resulting function is linear in the input. But the training problem remains nonlinear in the parameters. This makes deep linear networks a useful laboratory for studying learning dynamics.

They reproduce several phenomena that also appear in more realistic neural networks: saddle-dominated loss landscapes, phase-like learning dynamics, separation of timescales, dependence on initialization, low-rank implicit bias, and edge-of-stability behavior.

One recurring lesson is that learning is often progressive. Networks tend to acquire some components of the task before others. In simplified settings, dominant modes of the signal are learned first, while weaker or more detailed modes appear later.

Simplified models do not need to reproduce all of deep learning to be useful. They are valuable when they isolate a mechanism that also appears in realistic systems.

2. Infinite limits reveal distinct regimes of learning

Modern neural networks are enormous. This may seem hostile to theory, but large systems often simplify in appropriate limits. Thermodynamics, statistical mechanics, and fluid mechanics all rely on this idea.

In deep learning, one of the most important limits is the infinite-width limit. As the number of neurons per hidden layer tends to infinity, the training dynamics may simplify. But the result depends crucially on how the network is parameterized and initialized.

The lazy regime

In the lazy regime, the parameters move very little during training. The hidden representations remain almost fixed, and the network behaves like a linear approximation around its initialization:

\[ f_{\mathrm{lin}}(x;\theta) = f(x;\theta_0) + \nabla_{\theta} f(x;\theta_0)^{\top}(\theta-\theta_0) \]

This regime is closely related to the Neural Tangent Kernel:

\[ K_{\mathrm{NTK}}(x,x') = \nabla_{\theta} f(x;\theta_0)^{\top} \nabla_{\theta} f(x';\theta_0) \]

The NTK measures similarity between two inputs through the inner product of their parameter gradients at initialization.

The lazy regime is analytically tractable. It connects neural network training to kernel regression. But it fails to capture the strongest form of representation learning, because the features barely change.

The rich regime

In the rich regime, the hidden representations evolve substantially during training. The model does not simply adjust a fixed feature map; it learns new features adapted to the data.

Regime	Parameter movement	Representations	Typical behavior
Lazy learning	Small	Nearly fixed	Kernel-like behavior
Rich learning	Substantial	Evolving	Feature learning

This distinction is one of the central insights of recent deep learning theory. It shows that choices that may appear technical, such as initialization scale or output parameterization, can determine the qualitative nature of learning.

3. Scaling laws are empirical laws of deep learning

Many scientific theories begin with empirical regularities. Kepler’s laws came before Newtonian gravity. The gas laws came before statistical mechanics. Deep learning now has its own empirical regularities, the most famous of which are scaling laws.

Across many systems, test loss decreases predictably as model size, dataset size, or compute increases. A simplified form is:

\[ L(N) \approx L_{\infty} + A N^{-\alpha} \]

Loss often follows a power law in a resource \(N\), such as parameter count, data size, or compute.

A more general expression can include compute, data, and parameter count:

\[ L(C,D,P) \approx L_{\infty} + A C^{-\alpha} + B D^{-\beta} + E P^{-\gamma} \]

This is striking. Given the complexity of modern neural networks, one might expect performance to depend on countless details in an irregular way. Instead, large-scale behavior often follows smooth, low-dimensional laws.

The major open question is why these laws arise. Possible explanations involve the structure of natural data, the effective dimension of data manifolds, feature hierarchies, feature superposition, architectural biases, and optimizer effects. A mature theory should not merely fit scaling laws after the fact. It should predict their exponents from properties of the data, architecture, and training procedure.

4. Optimization dynamics show macroscopic regularities

The loss landscape of a neural network is high-dimensional and complicated. Still, some aggregate quantities behave regularly. One important example is sharpness, often measured through the largest eigenvalue of the Hessian of the loss:

\[ \lambda_{\max}\!\left(\nabla_{\theta}^{2} L(\theta)\right) \]

During full-batch gradient descent, many networks exhibit a phenomenon called the edge of stability. Sharpness increases during training and then hovers near a stability threshold determined by the learning rate:

\[ \lambda_{\max} \approx \frac{2}{\eta} \]

If curvature becomes too large relative to the learning rate, gradient descent becomes unstable.

This suggests that optimization dynamics are not arbitrary. Even when individual parameter trajectories are too complex to describe, their macroscopic behavior may follow simple laws.

5. Hyperparameters can be disentangled

Deep learning depends on many hyperparameters: learning rate, batch size, width, depth, initialization scale, weight decay, momentum, optimizer choice, and parameterization. For practitioners, these are knobs to tune. For theory, they are system parameters that shape the dynamics.

Recent theory has begun to explain how some of these parameters interact. One example is the relation between learning rate and batch size. In certain regimes, increasing both together can preserve similar training dynamics.

Another important example is Maximal Update Parameterization, or muP. The aim is to choose a parameterization in which hyperparameters tuned on small models remain close to optimal when transferred to larger models.

\[ \eta = \eta_0 \cdot \mathrm{width}^{c} \]

A scaling rule separates a scale-independent coefficient from a width-dependent factor.

The practical implication is substantial. If a learning rate, initialization scale, or optimizer setting can be tuned on a small model and transferred to a larger model, the cost of large-scale training can be reduced. This is an example of theory already becoming useful engineering.

6. Universal phenomena appear across systems

Deep learning systems differ widely. Convolutional networks, transformers, residual networks, diffusion models, recurrent networks, and mixture-of-experts architectures all have different structures. They are trained on different modalities and objectives.

Yet the paper emphasizes growing evidence for universal behavior. Different architectures can reach similar performance when compute, data, and training recipes are controlled. Different models can learn similar internal representations. In some cases, vision and language models appear to become more representationally aligned as performance improves.

This motivates what is sometimes called the Platonic representation hypothesis: sufficiently capable models may converge toward similar representations of the underlying structure of the world, despite differences in architecture or training details.

The hypothesis remains open and contested. But if it is even partly correct, it would be important. It would suggest that neural representations are not arbitrary artifacts of individual models. They may reflect stable structure in natural data.

Learning mechanics and mechanistic interpretability

The paper draws a useful distinction between learning mechanics and mechanistic interpretability.

Mechanistic interpretability aims to reverse-engineer trained models by identifying features, circuits, mechanisms, and learned algorithms. Learning mechanics, by contrast, aims to explain the dynamics that give rise to those structures.

The analogy is clear: mechanistic interpretability is closer to the biology of deep learning, while learning mechanics would be closer to its physics.

These approaches are complementary. Interpretability provides concrete phenomena that theory should explain. Theory can clarify when interpretability assumptions are valid, such as linear representation, locality, sparsity, and compositionality.

If mechanistic interpretability asks what mechanisms a trained model contains, learning mechanics asks how those mechanisms came to exist.

Why this matters for AI safety

A theory of deep learning would also matter for AI safety. As models become more capable, we need better tools for describing how capabilities emerge, how representations form, how training interventions affect behavior, and how systems change with scale.

A scientific theory would not solve alignment by itself. But it could provide a firmer basis for evaluation, interpretability, robustness, auditing, and governance. It is hard to regulate or control a technology whose relevant variables are not well understood.

Open directions

Solvable nonlinear models

Deep linear networks and kernel models have taught us a great deal, but real deep learning involves nonlinear feature learning. A major challenge is to find simplified models that remain analytically tractable while capturing genuine nonlinear representation learning.

A theory of natural data

Neural networks succeed because real data is structured. Images, text, audio, code, and other natural signals contain hierarchy, locality, redundancy, compositionality, sparsity, symmetries, and power-law statistics. A theory of learning likely requires a theory of the structure of data.

Functional simplicity

Networks often seem biased toward simpler functions. This appears under many names: simplicity bias, spectral bias, implicit regularization, maximum-margin bias, and low-rank bias. A general definition of the functional complexity minimized by deep learning remains missing.

Hyperparameter elimination

If hyperparameters can be understood through scaling rules and limiting regimes, then some of the empirical burden of training may disappear. A long-term goal is to reduce the amount of manual tuning required to train large models.

Predicting scaling exponents

Scaling laws are useful because they are predictive. But their exponents are usually measured empirically. A deeper theory should explain where these exponents come from and predict them before large-scale training.

What to take away

The paper does not claim that the theory of deep learning is already complete. It clearly is not. Its argument is more careful: several pieces of such a theory are already visible.

There are solvable idealized models.
There are informative infinite limits.
There are empirical laws such as scaling laws and edge-of-stability behavior.
There are theories of hyperparameter scaling and transfer.
There are universal phenomena across architectures and modalities.

Together, these pieces suggest that deep learning is not merely a collection of empirical tricks. It may be a complex system governed by discoverable principles.

The central question is changing. Instead of asking only whether a model generalizes, we can ask what effective laws govern its learning. Instead of asking only what features a model contains, we can ask how those features formed. Instead of asking which learning rate works, we can ask what dynamical regime a learning rate induces.

Conclusion

Deep learning occupies an unusual historical position. It is already a powerful technology, but much of its development still relies on empirical practice. The paper argues that this is a temporary stage. A scientific theory is beginning to emerge, and it will likely look like a mechanics: a quantitative study of model motion, representation formation, scaling behavior, and training dynamics.

Deep learning will remain complex. But complexity is not the same as incomprehensibility. Physics did not begin by explaining the entire universe. It began with falling bodies, pendulums, gases, circuits, spectra, and empirical regularities.

Deep learning may now be at a similar point: the moment where engineering begins to turn into science.

Reference
Jamie Simon, Daniel Kunin, Alexander Atanasov, Enric Boix-Adsera, Blake Bordelon, Jeremy Cohen, Nikhil Ghosh, Florentin Guth, Arthur Jacot, Mason Kamb, Dhruva Karkada, Eric J. Michaud, Berkan Ottlik, Joseph Turnbull. There Will Be a Scientific Theory of Deep Learning. arXiv, 2026.