The Geometry of Gradient Descent


Gradient descent is taught as an optimization algorithm. It is better understood as a dynamical system on a manifold. The geometry of the loss landscape determines not just whether the algorithm converges, but what it converges to and how it generalizes.

Curvature and Convergence

The Hessian matrix at a critical point tells you almost everything about local behavior. Its eigenvalues determine the shape of the landscape in each direction: positive eigenvalues indicate a valley, negative eigenvalues indicate a ridge, and zero eigenvalues indicate a flat direction.

In high dimensions, saddle points vastly outnumber local minima. The probability of all eigenvalues being positive decreases exponentially with dimension. This is why gradient descent in deep networks does not get stuck at bad local minima — it gets stuck at saddle points, which are escapable.

The Role of Noise

Stochastic gradient descent (SGD) introduces noise that serves a geometric purpose. The noise magnitude is anisotropic — it is larger in directions with higher gradient variance. This naturally biases the algorithm toward flatter minima, which tend to generalize better.

SGD is not a noisy version of gradient descent. It is a different algorithm with different convergence properties, and the noise is a feature, not a bug.

Flatness and Generalization

The connection between flat minima and generalization has been debated extensively. The sharpness of a minimum — measured by the largest eigenvalue of the Hessian — correlates with generalization performance, but the relationship is not causal in any simple sense.

What matters is not the absolute sharpness, but the sharpness relative to the scale of the parameters. This is where the PAC-Bayesian framework provides insight: it bounds the generalization gap in terms of the KL divergence between the learned distribution and a prior, which implicitly accounts for the geometry of the loss landscape.


Understanding gradient descent requires understanding the space it moves through. The algorithm is simple. The landscape is not.


← All writing