This paper shows that a perturbed form of
gradient
descent converges to a second-order stationary point in a number
iterations which depends only poly-logarithmically on dimension (i.e.,
it is almost "dimension-free"). The convergence rate of this procedure
matches the well-known convergence rate of gradient descent to
first-order stationary points, up to log factors. When all saddle
points are non-degenerate, all second-order stationary points are local
minima, and our result thus shows that perturbed gradient descent can
escape saddle points almost for free. Our results can be directly
applied to many machine learning applications, including deep learning.
As a particular concrete example of such an application, we show that
our results can be used directly to establish sharp global convergence
rates for matrix factorization. Our results rely on a novel
characterization of the geometry around saddle points, which may be of
independent interest to the non-convex optimization community.