Rethinking Learning Rate Schedules for Stochastic Optimization.
with Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli. Submitted.
There is a stark disparity between the learning rate schedules used
in the practice of large scale machine learning and what are considered
admissible learning rate schedules prescribed in the theory of
stochastic approximation. Recent results, such as in the
'superconvergence' methods which use oscillating learning rates, serve
to emphasize this point even more. One plausible explanation is that
nonconvex neural network training procedures are better suited to the
use of fundamentally different learning rate schedules, such as
the ``cut the learning rate every constant number of epochs'' method
(which more closely resembles an exponentially decaying learning rate
schedule); note that this widely used schedule is in stark contrast to
the polynomial decay schemes prescribed in the stochastic approximation
literature, which are indeed shown to be (worst case) optimal for
classes of convex optimization problems.
The main contribution
of this work shows that the picture is far more nuanced, where we do
not even need to move to nonconvex optimization to show other learning
rate schemes can be far more effective. In fact, even for the simple
case of stochastic linear regression with a fixed time horizon, the
rate achieved by any polynomial decay scheme is suboptimal compared to
the statistical minimax rate (by a factor of condition number); in
contrast the ```''cut the learning rate every constant number of
epochs'' provides an exponential improvement (depending only
logarithmically on the condition number) compared to any polynomial
decay scheme. Finally, it is important to ask if our theoretical
insights are somehow fundamentally tied to quadratic loss minimization
(where we have circumvented minimax lower bounds for more general
convex optimization problems)? Here, we conjecture that recent results
which make the gradient norm small at a near optimal rate, for both
convex and nonconvex optimization, may also provide more insights into
learning rate schedules used in practice.

Learning Twolayer Neural
Networks with Symmetric Inputs.
with Rohith Kuditipudi, Xiang Wang and Zhize Li. In ICLR
2019.
We give a new algorithm for learning a twolayer
neural network under a general class of input distributions. Assuming
there is a groundtruth twolayer network
y=Aσ(Wx)+ξ,
where A,W are weight matrices, ξ represents noise, and the number of
neurons in the hidden layer is no larger than the input or output, our
algorithm is guaranteed to recover the parameters A,W of the
groundtruth network. The only requirement on the input x is that it is
symmetric, which still allows highly complicated and structured input.
Our algorithm is based on the methodofmoments framework and extends
several results in tensor decompositions. We use spectral algorithms to
avoid the complicated nonconvex optimization in learning neural
networks. Experiments show that our algorithm can robustly learn the
groundtruth neural network with a small number of samples for many
symmetric input distributions.

Understanding Composition of
Word Embeddings via
Tensor Decomposition.
with Abraham Frandsen. In ICLR
2019.
Word embedding is a powerful tool in natural
language processing. In this paper we consider the problem of word
embedding composition  given vector representations of two
words, compute a vector for the entire phrase. We give a generative
model that can capture specific syntactic relations between words.
Under our model, we prove that the correlations between three words
(measured by their PMI) form a tensor that has an approximate low rank
Tucker decomposition. The result of the Tucker decomposition gives the
word embeddings as well as a core tensor, which can be used to produce
better compositions of the word embeddings. We also complement our
theoretical results with experiments that verify our assumptions, and
demonstrate the effectiveness of the new composition method.

HighDimensional Robust Mean
Estimation
in NearlyLinear Time.
with Yu Cheng and Ilias Diakonikolas. In SODA
2019.
We study the fundamental problem of
highdimensional mean estimation in a robust model where a constant
fraction of the samples are adversarially corrupted. Recent work gave
the first polynomial time algorithms for this problem with
dimensionindependent error guarantees for several families of
structured distributions. In this work, we give the first nearlylinear
time algorithms for highdimensional robust mean estimation.
Specifically, we focus on distributions with (i) known covariance and
subgaussian tails, and (ii) unknown bounded covariance. Given N
samples on R^d, an εfraction of which may be arbitrarily corrupted,
our algorithms run in time O~(Nd)/poly(ε) and approximate the true mean
within the informationtheoretically optimal error, up to constant
factors. Previous robust algorithms with comparable error guarantees
have running times Ω~(Nd2), for ε=Ω(1). Our algorithms rely on a
natural family of SDPs parameterized by our current guess ν for the
unknown mean μ⋆. We give a winwin analysis establishing the following:
either a nearoptimal solution to the primal SDP yields a good
candidate for μ⋆  independent of our current guess ν  or the dual
SDP yields a new guess ν′ whose distance from μ⋆ is smaller by a
constant factor. We exploit the special structure of the corresponding
SDPs to show that they are approximately solvable in nearlylinear
time. Our approach is quite general, and we believe it can also be
applied to obtain nearlylinear time algorithms for other
highdimensional robust learning problems.

Stronger generalization
bounds for deep nets via a compression approach.
with Sanjeev Arora, Behnam Neyshabur and Yi Zhang. In ICML
2018.
Deep nets generalize well despite having more
parameters than the number of training samples. Recent works try to
give an explanation using PACBayes and Marginbased analyses, but do
not as yet result in sample complexity bounds better than naive
parameter counting. The current paper shows generalization bounds
that're orders of magnitude better in practice. These rely upon new
succinct reparametrizations of the trained net  a compression that
is explicit and efficient. These yield generalization bounds via a
simple compressionbased framework introduced here. Our results also
provide some theoretical justification for widespread empirical success
in compressing deep nets. Analysis of correctness of our compression
relies upon some newly identified \textquotedblleft noise
stability\textquotedblright properties of trained deep nets, which are
also experimentally verified. The study of these properties and
resulting generalization bounds are also extended to convolutional
nets, which had eluded earlier attempts on proving generalization.

Minimizing Nonconvex Population
Risk from Rough Empirical Risk.
with Chi Jin, Lydia T. Liu, Michael I. Jordan. In NeurIPS 2018.
Population riskthe expectation of the loss
over the sampling mechanismis always of primary interest in machine
learning. However, learning algorithms only have access to empirical
risk, which is the average loss over training examples. Although the
two risks are typically guaranteed to be pointwise close, for
applications with nonconvex nonsmooth losses (such as modern deep
networks), the effects of sampling can transform a wellbehaved
population risk into an empirical risk with a landscape that is
problematic for optimization. The empirical risk can be nonsmooth, and
it may have many additional local minima. This paper considers a
general optimization framework which aims to find approximate local
minima of a smooth nonconvex function F (population risk) given only
access to the function value of another function f (empirical risk),
which is pointwise close to F. We propose a simple algorithm based on
stochastic gradient descent (SGD) on a smoothed version of f which is
guaranteed to find a secondorder stationary point if f and F
are close enough, thus escaping all saddle points of F and all the
additional local minima introduced by f. We also provide an almost
matching lower bound showing that our SGDbased approach achieves the
optimal tradeoff between the relevant parameters, as well as the
optimal dependence on problem dimension d, among all algorithms making
a polynomial number of queries. As a concrete example, we show that our
results can be directly used to give sample complexities for learning a
ReLU unit, whose empirical risk is nonsmooth.

NonConvex Matrix Completion
Against a SemiRandom Adversary.
with Yu Cheng. In COLT 2018.
Matrix completion is a wellstudied problem with
many machine learning applications. In practice, the problem is often
solved by nonconvex optimization algorithms. However, the current
theoretical analysis for nonconvex algorithms relies heavily on the
assumption that every entry is observed with exactly the same
probability p, which is not realistic in practice.
In this paper, we investigate a more realistic semirandom model, where
the probability of observing each entry is at least p. Even with this
mild semirandom perturbation, we can construct counterexamples where
existing nonconvex algorithms get stuck in bad local optima.
In light of the negative results, we propose a preprocessing step that
tries to reweight the semirandom input, so that it becomes "similar"
to a random input. We give a nearlylinear time algorithm for this
problem, and show that after our preprocessing, all the local minima
of the nonconvex objective can be used to approximately recover the
underlying groundtruth matrix.

Beyond Logconcavity: Provable
Guarantees
for Sampling Multimodal Distributions using Simulated Tempering
Langevin Monte
Carlo
with Holden Lee and Andrej Risteski. NIPS 2017 Bayesian
Inference Workshop. In NeurIPS 2018.
A key task in Bayesian statistics is sampling from
distributions that
are only specified up to a partition function (i.e., constant of
proportionality). However, without any assumptions, sampling (even
approximately) can be #Phard, and few works have provided "beyond
worstcase" guarantees for such settings. For logconcave
distributions, classical results going back to Bakry and \'Emery (1985)
show that natural continuoustime Markov chains called Langevin
diffusions mix in polynomial time. The most salient feature of
logconcavity violated in practice is unimodality: commonly, the
distributions we wish to sample from are multimodal. In the presence
of multiple deep and wellseparated modes, Langevin diffusion suffers
from torpid mixing. We address this problem by combining Langevin
diffusion with simulated tempering. The result is a Markov chain that
mixes more rapidly by transitioning between different temperatures of
the distribution. We analyze this Markov chain for the canonical
multimodal distribution: a mixture of gaussians (of equal variance).
The algorithm based on our Markov chain provably samples from
distributions that are close to mixtures of gaussians, given access to
the gradient of the logpdf. For the analysis, we use a spectral
decomposition theorem for graphs (Gharan and Trevisan, 2014) and a
Markov chain decomposition technique (Madras and Randall, 2002).

Global Convergence of
Policy Gradient Methods for Linearized Control Problems
with Maryam Fazel, Sham M. Kakade and Mehran Mesbahi. In ICML
2018.
Direct policy gradient methods for reinforcement
learning and continuous control problems are a popular approach for a
variety of reasons: 1) they are easy to implement without explicit
knowledge of the underlying model 2) they are an "endtoend" approach,
directly optimizing the performance metric of interest 3) they
inherently allow for richly parameterized policies. A notable drawback
is that even in the most basic continuous control problem (that of
linear quadratic regulators), these methods must solve a nonconvex
optimization problem, where little is understood about their efficiency
from both computational and statistical perspectives. In contrast,
system identification and model based planning in optimal control
theory have a much more solid theoretical footing, where much is known
with regards to their computational and statistical properties. This
work bridges this gap showing that (model free) policy gradient methods
globally converge to the optimal solution and are efficient
(polynomially so in relevant problem dependent quantities) with regards
to their sample and computational complexities.

Learning Onehiddenlayer Neural
Networks
with Landscape Design.
with Tengyu Ma and Jason D. Lee. To appear in ICLR
2018.
We consider the problem of learning a
onehiddenlayer neural network: we assume the input x is from Gaussian
distribution and the label y = a*\sigma(Bx), where a is a nonnegative
vector in m dimensions, B is a fullrank weight matrix. We first give
an analytic formula for the population risk of the standard squared
loss and demonstrate that it implicitly attempts to decompose a
sequence of lowrank tensors simultaneously. Inspired by the formula,
we design a nonconvex objective function G whose landscape is
guaranteed to have the following properties:
1. All local minima of G are also global minima.
2. All global minima of G correspond to the ground truth parameters.
3. The value and gradient of G can be estimated using samples.
With these properties, stochastic gradient descent on G provably
converges to the global minimum and learn the groundtruth parameters.
We also prove finite sample complexity result and validate the results
by simulations.

Generalization and Equilibrium
in Generative Adversarial Nets (GANs)
with Sanjeev Arora, Yingyu Liang, Tengyu Ma and Yi Zhang. In ICML
2017.
This paper makes progress on several open
theoretical
issues related to Generative Adversarial Networks. A definition is
provided for what it means for the training to generalize, and it is
shown that generalization is not guaranteed for the popular distances
between distributions such as JensenShannon or Wasserstein. We
introduce a new metric called neural net distance for which
generalization does occur. We also show that an approximate pure
equilibrium in the 2player game exists for a natural training
objective (Wasserstein). Showing such a result has been an open problem
(for any training objective). Finally, the above theoretical ideas lead
us to propose a new training protocol, MIX+GAN, which can be combined
with any existing method. We present experiments showing that it
stabilizes and improves some existing methods.

On the ability of neural nets to
express distributions
with Holden Lee, Andrej Risteski, Tengyu Ma and Sanjeev
Arora. In COLT
2017.
Deep neural nets have caused a revolution in many
classification tasks. A related ongoing revolutionalso theoretically
not understoodconcerns their ability to serve as generative models
for complicated types of data such as images and texts. These models
are trained using ideas like variational autoencoders and Generative
Adversarial Networks.
We take a first cut at explaining the
expressivity of multilayer nets by giving a sufficient criterion for a
function to be approximable by a neural network with n hidden layers. A
key ingredient is Barron's Theorem [Barron1993], which gives a
Fourier criterion for approximability of a function by a neural network
with 1 hidden layer. We show that a composition of n functions which
satisfy certain Fourier conditions ("Barron functions") can be
approximated by a n+1layer neural network. For probability
distributions, this translates into a criterion for a probability
distribution to be approximable in Wasserstein distancea natural
metric on probability distributionsby a neural network applied to a
fixed base distribution (e.g., multivariate gaussian).
Building up
recent lower bound work, we also give an example function that shows
that composition of Barron functions is more expressive than Barron
functions alone.

No Spurious Local Minima in
Nonconvex Low Rank Problems: A Unified Geometric Analysis
with Chi Jin, Yi Zheng. In ICML
2017.
In this paper we develop a new framework that
captures
the common landscape underlying the common nonconvex lowrank matrix
problems including matrix sensing, matrix completion and robust PCA. In
particular, we show for all above problems (including asymmetric
cases): 1) all local minima are also globally optimal; 2) no highorder
saddle points exists. These results explain why simple algorithms such
as stochastic gradient descent have global converge, and efficiently
optimize these nonconvex objective functions in practice. Our
framework connects and simplifies the existing analyses on optimization
landscapes for matrix sensing and symmetric matrix completion. The
framework naturally leads to new results for asymmetric matrix
completion and robust PCA.

How to Escape Saddle Points
Efficiently
with Chi Jin, Praneeth Netrapalli, Sham M. Kakade, Michael I. Jordan.
In ICML
2017.
This paper shows that a perturbed form of
gradient
descent converges to a secondorder stationary point in a number
iterations which depends only polylogarithmically on dimension (i.e.,
it is almost "dimensionfree"). The convergence rate of this procedure
matches the wellknown convergence rate of gradient descent to
firstorder stationary points, up to log factors. When all saddle
points are nondegenerate, all secondorder stationary points are local
minima, and our result thus shows that perturbed gradient descent can
escape saddle points almost for free. Our results can be directly
applied to many machine learning applications, including deep learning.
As a particular concrete example of such an application, we show that
our results can be used directly to establish sharp global convergence
rates for matrix factorization. Our results rely on a novel
characterization of the geometry around saddle points, which may be of
independent interest to the nonconvex optimization community.

Online Service with Delay
with Yossi Azar, Arun Ganesh and Debmalya Panigrahi. In
STOC 2017.
In this paper, we introduce the online
service with delay problem.
In this problem, there are $n$ points in a metric space that issue
service requests over time, and a server that serves these requests.
The goal is to minimize the
sum of distance traveled by the server and the total delay in serving
the requests.
This problem models the fundamental
tradeoff between batching requests to improve locality and reducing
delay to improve response time, that has many applications
in operations management, operating systems, logistics, supply chain
management, and scheduling.
Our main result is to show a polylogarithmic competitive ratio for the
online service with delay
problem. This result is obtained by an algorithm that we call the preemptive
service algorithm. The salient feature of this algorithm
is a process called preemptive service, which uses a novel combination
of (recursive) time forwarding and spatial exploration on a metric
space. We hope this
technique will be useful for related problems such as reordering buffer
management, online TSP, vehicle routing, etc. We also generalize our
results to k >1 servers.

Nonconvex optimization with local search
heuristics has been widely used in machine
learning, achieving many stateofart results. It becomes increasingly
important
to understand why they can work for these NPhard problems on typical
data. The
landscape of many objective functions in learning have been conjectured
to have the
geometric property that “all local optima are (approximately) global
optima”, and thus
they can be solved efficiently by local search algorithms. However,
establishing such
property can be very difficult.
In this paper we analyze the optimization landscape of the random
overcomplete
tensor decomposition problem, which has many applications in
unsupervised leaning,
especially in learning latent variable models. In practice, it can be
efficiently solved by
gradient ascent on a nonconvex objective. We show that for any small
constant c > 0,
among the set of points with function values (1 + c)factor larger than
the expectation
of the function, all the local maxima are approximate global maxima.
Previously, the
best known result only characterizes the geometry in small
neighborhoods around the
true components. Our result implies that even with an initialization
that is barely
better than the random guess, the gradient ascent algorithm is
guaranteed to solve this
problem

Provable Learning of Noisyor
Networks
with Sanjeev Arora, Tengyu Ma and Andrej Risteski. In
STOC 2017.
Many machine learning applications use latent
variable models to explain structure in data, whereby visible variables
(= coordinates of the given datapoint) are explained as a probabilistic
function of some hidden variables. Finding parameters with the maximum
likelihood is NPhard even in very simple settings. In recent years,
provably efficient algorithms were nevertheless developed for models
with linear structures: topic models, mixture models, hidden markov
models, etc. These algorithms use matrix or tensor decomposition, and
make some reasonable assumptions about the parameters of the underlying
model.
But matrix or tensor decomposition seems of little use when the latent
variable model has nonlinearities. The current paper shows how to make
progress: tensor decomposition is applied for learning the singlelayer
noisy or network, which is a textbook
example of a Bayes net, and
used for example in the classic QMRDT software for diagnosing which
disease(s) a patient may have by observing the symptoms he/she
exhibits.
The technical novelty here, which should be useful in other settings in
future, is analysis of tensor decomposition in presence of systematic
error (i.e., where the noise/error is correlated with the signal, and
doesn't decrease as number of samples goes to infinity). This requires
rethinking all steps of tensor decomposition methods from the ground
up.
For simplicity our analysis is stated assuming that the network
parameters were chosen from a probability distribution but the method
seems more generally applicable.

Homotopy Analysis for Tensor PCA
with Anima Anandkumar, Yuan Deng and Hossein Mobahi. To appear in COLT
2017.
Developing efficient and guaranteed nonconvex
algorithms has been an important challenge in modern machine learning.
Algorithms with good empirical performance such as stochastic gradient
descent often lack theoretical guarantees. In this paper, we analyze
the class of homotopy or continuation methods for global optimization
of nonconvex functions. These methods start from an objective function
that is efficient to optimize (e.g. convex), and progressively modify
it to obtain the required objective, and the solutions are passed along
the homotopy path. For the challenging problem of tensor PCA, we prove
global convergence of the homotopy method in the "high noise" regime.
The signaltonoise requirement for our algorithm is tight in the sense
that it matches the recovery guarantee for the best degree4
sumofsquares algorithm. In addition, we prove a phase transition
along the homotopy path for tensor PCA. This allows to simplify the
homotopy method to a local search algorithm, viz., tensor power
iterations, with a specific initialization and a noise injection
procedure, while retaining the theoretical guarantees.

Matrix Completion has No
Spurious Local Minimum
with Jason D. Lee and Tengyu Ma. In NIPS
2016. Best
Student Paper Award.
Matrix completion is a basic machine learning
problem that has wide applications, especially in collaborative
filtering and recommender systems. Simple nonconvex optimization
algorithms are popular and effective in practice. Despite recent
progress in proving various nonconvex algorithms converge from a good
initial point, it remains unclear why random or arbitrary
initialization suffices in practice. We prove that the commonly used
nonconvex objective function for matrix completion has no spurious
local minima  all local minima must also be global. Therefore, many
popular optimization algorithms such as (stochastic) gradient descent
can provably solve matrix completion with arbitrary
initialization in polynomial time.

Provable Algorithms for
Inference in Topic Models
with Sanjeev Arora, Frederic Koehler, Tengyu Ma and Ankur Moitra. In ICML
2016.
Recently, there has been considerable progress on
designing algorithms with provable guarantees  typically using linear
algebraic methods  for parameter learning in latent variable models.
But designing provable algorithms for inference has proven to be more
challenging. Here we take a first step towards provable inference in
topic models. We leverage a property of topic models that enables us to
construct simple linear estimators for the unknown topic proportions
that have small variance, and consequently can work with short
documents. Our estimators also correspond to finding an estimate around
which the posterior is wellconcentrated. We show lower bounds that for
shorter documents it can be information theoretically impossible to
find the hidden topics. Finally, we give empirical results that
demonstrate that our algorithm works on realistic topic models. It
yields good solutions on synthetic data and runs in time comparable to
a single iteration of Gibbs sampling.

Efficient
approaches for escaping higher order saddle points in nonconvex
optimization
with Anima Anandkumar. In COLT
2016.
Local search heuristics for nonconvex
optimizations are popular in applied machine learning. However, in
general it is hard to guarantee that such algorithms even converge to a
local minimum, due to the existence of complicated saddle point
structures in high dimensions. Many functions have degenerate saddle
points such that the first and second order derivatives cannot
distinguish them with local optima. In this paper we use higher order
derivatives to escape these saddle points: we design the first
efficient algorithm guaranteed to converge to a third order local
optimum (while existing techniques are at most second order). We also
show that it is NPhard to extend this further to finding fourth order
local optima.

Efficient
Algorithms for Largescale Generalized Eigenvector Computation and
Canonical Correlation Analysis
with Chi Jin, Sham M. Kakade, Praneeth Netrapalli, Aaron Sidford. In ICML
2016.
This paper considers the problem of
canonicalcorrelation analysis (CCA) (Hotelling, 1936) and, more
broadly, the generalized eigenvector problem for a pair of symmetric
matrices. These are two fundamental problems in data analysis and
scientific computing with numerous applications in machine learning and
statistics (Shi and Malik, 2000; Hardoon et al., 2004; Witten et al.,
2009).
We provide simple iterative algorithms, with
improved runtimes, for solving these problems that are globally
linearly convergent with moderate dependencies on the condition numbers
and eigenvalue gaps of the matrices involved.
We obtain our results by reducing CCA to the
topk generalized eigenvector problem. We solve this problem through a
general framework that simply requires black box access to an
approximate linear system solver. Instantiating this framework with
accelerated gradient descent we obtain a near linear running time.
Our algorithm is linear in the input size
and the number of components k up to a log(k) factor. This is essential
for handling largescale matrices that appear in practice. To the best
of our knowledge this is the first such algorithm with global linear
convergence. We hope that our results prompt further research and
ultimately improve the practical running time for performing these
important data analysis procedures on large data sets.

Rich
Component Analysis
with
James Zou. In ICML
2016.
In many settings, we have multiple data sets
(also called views) that capture different and overlapping aspects of
the same phenomenon. We are often interested in finding patterns that
are unique to one or to a subset of the views. For example, we might
have one set of molecular observations and one set of physiological
observations on the same group of individuals, and we want to quantify
molecular patterns that are uncorrelated with physiology. Despite being
a common problem, this is highly challenging when the correlations come
from complex distributions. In this paper, we develop the general
framework of Rich Component Analysis (RCA) to model settings where the
observations from different views are driven by different sets of
latent components, and each component can be a complex,
highdimensional distribution. We introduce algorithms based on
cumulant extraction that provably learn each of the components without
having to model the other components. We show how to integrate RCA with
stochastic gradient descent into a metaalgorithm for learning general
models, and demonstrate substantial improvement in accuracy on several
synthetic and real datasets in both supervised and unsupervised tasks.
Our method makes it possible to learn latent variable models when we
don't have samples from the true model but only samples after complex
perturbations.

Decomposing
Overcomplete 3rd Order Tensors using SumofSquares Algorithms
with Tengyu Ma. In RANDOM 2015.
Tensor rank and lowrank tensor
decompositions have many applications
in learning and complexity theory. Most known algorithms use unfoldings
of tensors and can only handle rank up to n^{p/2} for a pth order
tensor. Previously no efficient algorithm can decompose 3rd order
tensors when the rank is superlinear in the dimension. Using ideas
from sumofsquares hierarchy, we give the first quasipolynomial time
algorithm that can decompose a random 3rd order tensor decomposition
when the rank is as large as n^{3/2}/polylog n.
We also give a polynomial time algorithm for
certifying the injective
norm of random low rank tensors. Our tensor decomposition algorithm
exploits the relationship between injective norm and the tensor
components. The proof relies on interesting tools for decoupling random
variables to prove better matrix concentration bounds, which can be
useful in other settings.

Intersecting
Faces: Nonnegative Matrix Factorization With New Guarantees
with
James Zou. In ICML
2015.
Nonnegative
matrix factorization (NMF) is a natural model of admixture and is
widely used in science and engineering. A plethora of algorithms have
been developed to tackle NMF, but due to the nonconvex nature of the
problem, there is little guarantee on how well these methods work.
Recently a surge of research have focused on a very restricted class of
NMFs, called separable NMF, where provably correct algorithms have been
developed. In this paper, we propose the notion of subsetseparable
NMF, which substantially generalizes the property of separability. We
show that subsetseparability is a natural necessary condition for the
factorization to be unique or to have minimum volume. We developed the
FaceIntersect algorithm which provably and efficiently solves
subsetseparable NMF under natural conditions, and we prove that our
algorithm is robust to small noise. We explored the performance of
FaceIntersect on simulations and discuss settings where it empirically
outperformed the stateofart methods. Our work is a step towards
finding provably correct algorithms that solve large classes of NMF
problems.

Unregularizing:
approximate proximal point and faster stochastic algorithms for
empirical risk minimization
with Roy Frostig, Sham M. Kakade and Aaron Sidford. In ICML 2015.
We develop a family of accelerated
stochastic algorithms that minimize
sums of convex functions. Our algorithms improve upon the fastest
running time for empirical risk minimization (ERM), and in particular
linear leastsquares regression, across a wide range of problem
settings. To achieve this, we establish a framework based on the
classical proximal point algorithm. Namely, we provide several
algorithms that reduce the minimization of a strongly convex function
to approximate minimizations of regularizations of the function. Using
these results, we accelerate recent fast stochastic algorithms in a
blackbox fashion. Empirically, we demonstrate that the resulting
algorithms exhibit notions of stability that are advantageous in
practice. Both in theory and in practice, the provided algorithms reap
the computational benefits of adding a large strongly convex
regularization term, without incurring a corresponding bias to the
original problem.

Escaping
From Saddle Points  Online Stochastic Gradient for Tensor
Decomposition
with Furong Huang, Chi Jin and Yang Yuan. In COLT
2015.
We analyze stochastic gradient descent for
optimizing nonconvex
functions. In many cases for nonconvex functions the goal is to find a
reasonable local minimum, and the main concern is that gradient updates
are trapped in saddle points. In this paper we identify strict saddle
property for nonconvex problem that allows for efficient optimization.
Using this property we show that stochastic gradient descent converges
to a local minimum in a polynomial number of iterations. To the best of
our knowledge this is the first work that gives global convergence
guarantees for stochastic gradient descent on nonconvex functions with
exponentially many local minima and saddle points.
Our
analysis can be applied to orthogonal tensor decomposition, which is
widely used in learning a rich class of latent variable models. We
propose a new optimization formulation for the tensor decomposition
problem that has strict saddle property. As a result we get the first
online algorithm for orthogonal tensor decomposition with global
convergence guarantee.

Learning
Mixtures of Gaussians in High Dimensions
with Qingqing Huang and Sham M. Kakade. In STOC
2015.
Efficiently learning mixture of Gaussians is
a fundamental problem in
statistics and learning theory. Given samples coming from a random one
out of k Gaussian distributions in R^n, the learning problem asks to
estimate the means and the covariance matrices of these Gaussians. This
learning problem arises in many areas ranging from the natural sciences
to the social sciences, and has also found many machine learning
applications.
Unfortunately, learning mixture of
Gaussians is an
information theoretically hard problem: in order to learn the
parameters up to a reasonable accuracy, the number of samples required
is exponential in the number of Gaussian components in the worst case.
In this work, we show that provided we are in high enough dimensions,
the class of Gaussian mixtures is learnable in its most general form
under a smoothed analysis framework, where the parameters are randomly
perturbed from an adversarial starting point. In particular, given
samples from a mixture of Gaussians with randomly perturbed parameters,
when n > {\Omega}(k^2), we give an algorithm that learns the
parameters with polynomial running time and using polynomial number of
samples.
The central algorithmic ideas consist of
new ways
to decompose the moment tensor of the Gaussian mixture by exploiting
its structural properties. The symmetries of this tensor are derived
from the combinatorial structure of higher order moments of Gaussian
distributions (sometimes referred to as Isserlis' theorem or Wick's
theorem). We also develop new tools for bounding smallest singular
values of structured random matrices, which could be useful in other
smoothed analysis settings.

Simple,
Efficient, and Neural Algorithms for Sparse Coding
with Sanjeev Arora, Tengyu Ma and Ankur Moitra. In COLT
2015.
Sparse
coding is a basic task in many fields including signal
processing, neuroscience and machine learning where the goal is to
learn a basis that enables a sparse representation of a given set of
data, if one exists. Its standard formulation is as a nonconvex
optimization problem which is solved in practice by heuristics based on
alternating minimization. Re cent work has resulted in several
algorithms for sparse coding with provable guarantees, but somewhat
surprisingly these are outperformed by the simple alternating
minimization heuristics. Here we give a general framework for
understanding alternating minimization which we leverage to analyze
existing heuristics and to design new ones also with provable
guarantees. Some of these algorithms seem implementable on simple
neural architectures, which was the original motivation of Olshausen
and Field (1997a) in introducing sparse coding. We also give the first
efficient algorithm for sparse coding that works almost up to the
information theoretic limit for sparse recovery on incoherent
dictionaries. All previous algorithms that approached or surpassed this
limit run in time exponential in some natural parameter. Finally, our
algorithms improve upon the sample complexity of existing approaches.
We believe that our analysis framework will have applications in other
settings where simple iterative algorithms are used.

Competing
with the Empirical Risk Minimizer in a Single Pass
with Roy Frostig, Sham M. Kakade and Aaron Sidford. In COLT
2015.
Many optimization problems that arise in
science and engineering are
those in which we only have a stochastic approximation to the
underlying objective (e.g. estimation problems such as linear
regression). That is, given some distribution D over functions f(x), we
wish to minimize P(x)=E[f(x)], using as few samples from D as possible.
In the absence of computational constraints, the empirical risk
minimizer (ERM)  the minimizer on a sample average of observed data
 is widely regarded as the estimation strategy of choice due to its
desirable statistical convergence properties. Our goal is to do as well
as the empirical risk minimizer on every problem while minimizing the
use of computational resources such as running time and space usage.
This work provides a simple streaming
algorithm for solving this
problem, with performance guarantees competitive with that of the ERM.
In particular, under standard regularity assumptions on D, our
algorithm enjoys the following properties:
 The algorithm can be implemented in
linear time with a single pass
of the observed data, using space linear in the size of a single
sample.
 The algorithm achieves the same
statistical rate of
convergence as the empirical risk minimizer on every problem D (even
considering constant factors).
 The algorithm's performance depends on
the initial error at a rate that decreases superpolynomially.
Moreover, we quantify the rate of
convergence of both our algorithm
and that of the ERM, showing that, after a number of samples that
depend only on the regularity parameters, the streaming algorithm
rapidly approaches the leading order rate of ERM in expectation.

Understanding
Overcomplete 3rd Order Tensors
with Anima Anandkumar and Majid Janzamin. Short version in COLT
2015.
This
is a series of ongoing works where we try to understand how to
decompose overcomplete thirdorder tensors. In general, decomposing
third order tensors can be useful in learning many latent variable
models (multiview model, mixture of Gaussians, etc.) Previous
algorithms usually can either only handle the undercomplete
regime where the number of components is smaller than the number of of
dimensions, or require access of higher order tensors which often
results in higher sample complexity and running time. Surprisingly,
many heuristics (such as Tensor Power Method or Alternating Least
Squares) work very well in practice. In these works we try to
understand when is decomposing thirdorder tensors tractable. Several
parts are available on arxiv.
The first part [Arxiv]
tries to analyze Tensor Power Method. In this paper we show if there is
a good initialization (that is constantclose to some component),
Tensor Power Method will converge to vectors that are very close to the
true component. We also give an algorithm that converges to the exact
true component given the (already reasonably close) results of the
tensor power method. Good initializations can be found if there are
some supervised information, or by spectral method if ther number of
component k is only a constant factor larger than the dimension d.
The second part [Arxiv]
tries to understand how many samples do we need in order to estimate
the moment tensor accurately enough for several models.
We show surprisingly in many natural cases the sample complexity is
nearly linear in the number of components. In general the bounds in
this paper is tighter than many previous works.
The third part (more independent from the first two) [Arxiv]
tries to understand how well does Tensor Power Method perform without
really good initialization.
We use techniques from Approximate Message Passing to analyze the
dynamics of the tensor power method, and show that the initialization
does not need to be very close to the true component for tensor power
method to perform well.
This partly explains why tensor power method still works in the mildly
overcomplete settings.
See also Majid Janzamin's page about this project for more
information.

Provable
Bounds for Learning Some Deep Representations
with Sanjeev Arora, Aditya Bhaskara and Tengyu Ma. In ICML 2014.
We
give algorithms with provable guarantees that learn a class of deep
nets in the generative model view popularized by Hinton and others. Our
generative model is an n node multilayer neural net that has degree at
most n^{\gamma} for some \gamma < 1 and each edge has a random
edge
weight in [1,1]. Our algorithm learns almost all
networks in
this class with polynomial running time. The sample complexity is
quadratic or cubic depending upon the details of the model. The
algorithm uses layerwise learning. It is based upon a novel idea of
observing correlations among features and using these to infer the
underlying edge structure via a global graph recovery procedure. The
analysis of the algorithm reveals interesting structure of neural
networks with random edge weights.

New
Algorithms for Learning Incoherent and Overcomplete Dictionaries
with Sanjeev Arora and Ankur Moitra. In COLT 2014.
A
m*n matrix A is said to be muincoherent if each pair of columns has
inner product at most mu/\sqrt{n}. Starting with the pioneering work of
Donoho and Huo such matrices (often called dictionaries)
have played a central role in signal processing, statistics and machine
learning. They allow sparse recovery:
there are efficient algorithms for representing a given vector as a
sparse linear combination of the columns of A (if such a combination
exists). However, in many applications ranging from sparse
coding
in machine learning to image denoising, the dictionary is unknown and
has to be learned from random examples of the form Y=AX where X is
drawn from an appropriate distribution  this is the dictionary
learning
problem. Existing proposed solutions such as the Method of Optimal
Directions (MOD) or KSVD do not provide any guarantees on their
performance nor do they necessarily learn a dictionary for which one
can solve sparse recovery problems. The only exception is the recent
work of Spielman, Wang and Wright which gives a polynomial time
algorithm for dictionary learning when A has full column rank
(in particular m is at most n). However, in most settings of interest,
dictionaries need to be {\em overcomplete} (i.e., m is larger than n).
Here we give the first polynomial time
algorithm for dictionary
learning when A is overcomplete. It succeeds under natural conditions
on how X is generated, provided that X has at most
k=Cmin(\sqrt{n}/mu
logn,m^{1/2eps})
nonzero entries (for any eps>0). In
other words it can handle
almost as many nonzeros as the best sparse recovery algorithms could
tolerate even if one knew the dictionary A exactly.

Towards
a better approximation for sparsest cut?
with Sanjeev Arora and Ali Kemal Sinop. In FOCS
2013
We
give a new (1+eps)approximation for sparsest cut problem on graphs
where small sets expand significantly more than the sparsest cut (sets
of size n/r expand by a factor \sqrt{\log n\log r} bigger, for some
small r; this condition holds for many natural graph families). We give
two different algorithms. One involves GuruswamiSinop rounding on the
levelr Lasserre relaxation. The other is combinatorial and involves a
new notion called Small Set Expander Flows
(inspired by the expander flows
of ARV) which we show exists in the input graph. Both algorithms run in
time 2^{O(r)} poly(n). We also show similar approximation algorithms in
graphs with genus g with an analogous local expansion condition. This
is the first algorithm we know of that achieves (1+eps)approximation
on such general family of graphs.

A
Tensor Approach to Learning Mixed Membership Community Models
with Anima Anandkumar, Daniel Hsu and Sham M. Kakade. In COLT 2013. Also JMLR Vol 15.
Modeling
community formation and detecting hidden communities in networks is a
well studied problem. However, theoretical analysis of community
detection has been mostly limited to models with nonoverlapping
communities such as the stochastic block model. In this paper, we
remove this restriction, and consider a family of probabilistic network
models with overlapping communities, termed as the mixed membership
Dirichlet model, first introduced in Airoldi et al. This model allows
for nodes to have fractional memberships in multiple communities and
assumes that the community memberships are drawn from a Dirichlet
distribution. We propose a unified approach to learning these models
via a tensor spectral decomposition method. Our estimator is based on
loworder moment tensor of the observed network, consisting of 3star
counts. Our learning method is fast and is based on simple linear
algebra operations, e.g. singular value decomposition and tensor power
iterations. We provide guaranteed recovery of community memberships and
model parameters and present a careful finite sample analysis of our
learning method. Additionally, our results match the best known scaling
requirements in the special case of the stochastic block model.

Topic
models provide a useful method for dimensionality reduction and
exploratory data analysis in large text corpora. Most approaches to
topic model inference have been based on a maximum likelihood
objective. Efficient algorithms exist that approximate this objective,
but they have no provable guarantees. Recently, algorithms have been
introduced that provide provable bounds, but these algorithms are not
practical because they are inefficient and not robust to violations of
model assumptions. In this paper we present an algorithm for topic
model inference that is both provable and practical. The algorithm
produces results comparable to the best MCMC implementations while
running orders of magnitude faster.

Tensor
decompositions for learning latent variable models
with Anima Anandkumar, Daniel Hsu, Sham M. Kakade, Matus Telgarsky. In JMLR Vol 15.
This
work considers a computationally and statistically efficient parameter
estimation method
for a wide class of latent variable modelsincluding Gaussian mixture
models, hidden Markov
models, and latent Dirichlet allocationwhich exploits a certain tensor
structure in their loworder
observable moments (typically, of second and thirdorder).
Specifically, parameter estimation
is reduced to the problem of extracting a certain (orthogonal)
decomposition of a symmetric
tensor derived from the moments; this decomposition can be viewed as a
natural generalization
of the singular value decomposition for matrices. Although tensor
decompositions are generally
intractable to compute, the decomposition of these specially structured
tensors can be efficiently
obtained by a variety of approaches, including power iterations and
maximization approaches
(similar to the case of matrices). A detailed analysis of a robust
tensor power method is provided,
establishing an analogue of Wedin's perturbation theorem for the
singular vectors of
matrices. This implies a robust and computationally tractable
estimation approach for several
popular latent variable models.

Provable
ICA with Unknown Gaussian Noise, and Implications for Gaussian Mixtures
and Autoencoders
with Sanjeev Arora, Ankur Moitra and Sushant Sachdeva. in NIPS 2012
We
present a new algorithm for Independent Component Analysis (ICA) which
has provable performance guarantees. In particular, suppose we are
given samples of the form y = Ax + \eta where A is an unknown n*n
matrix and x is a random variable whose components are independent and
have a fourth moment strictly less than that of a standard Gaussian
random variable, \eta is an ndimensional Gaussian random variable with
unknown covariance \Sigma: We give an algorithm that provable recovers
A and \Sigma up to an additive \epsilon whose running time and sample
complexity are polynomial in n and 1 / \epsilon. To accomplish this, we
introduce a novel "quasiwhitening" step that may be useful in other
contexts in which the covariance of Gaussian noise is not known in
advance. We also give a general framework for finding all local optima
of a function (given an oracle for approximately finding just one) and
this is a crucial step in our algorithm, one that has been overlooked
in previous attempts, and allows us to control the accumulation of
error when we find the columns of A one by one via local search.

Learning
Topic Models  Going beyond SVD
with Sanjeev Arora and Ankur Moitra. In FOCS
2012.
In
this work we revisit the topic modeling problem, which is used for
automatic comprehension and classification of data in a variety of
settings. A number of foundational works both in machine learning and
in theory have suggested a probabilistic model for documents, whereby
documents arise as a convex combination of (i.e. distribution on) a
small number of topic vectors, each topic vector being a distribution
on words (i.e. a vector of wordfrequencies).
Theoretical studies of topic modeling focus on learning the model's
parameters assuming the data is actually generated from it. Existing
approaches for the most part rely on Singular Value Decomposition(SVD),
and consequently have one of two limitations: these works need to
either assume that each document contains only one topic, or else can
only recover the span of the topic vectors instead of the topic vectors
themselves.
This paper formally justifies Nonnegative Matrix Factorization(NMF) as
a main tool in this context, which is an analog of SVD where all
vectors are nonnegative. Using this tool we give the first
polynomialtime algorithm for learning topic models without the above
two limitations. The algorithm uses a fairly mild assumption about the
underlying topic matrix called separability, which is usually found to
hold in reallife data.
We hope that this paper will motivate further theoretical results that
use NMF as a replacement for SVD  just as NMF has come to replace SVD
in many applications.

Computing
a Nonnegative Matrix Factorization  Provably
with Sanjeev Arora, Ravi Kannan and Ankur Moitra. In STOC 2012.
.
The nonnegative matrix factorization problem asks for factorizing an
n*m matrix M into AW where A is n*r and W is r*m, further the entries
of the two matrices A and W are required to be nonnegative. This
problem has a extremely broad range of applications, from machine
learning to communication complexity. We give algorithms for exact and
approximate versions of the problem when r is small (which is the case
for most applications) and a hardness result that shows our results
cannot be substancially improved.

Finding
Overlapping Communities in Social Networks: Toward a Rigorous Approach
with Sanjeev Arora, Sushant Sachdeva and Grant Schoenebeck, In
EC 2012.
A
community in a social network is usually understood to be a group of
nodes more densely connected with each other than with the rest of the
network. Many existing methods for finding them use heuristic
algorithms where the solution concept is defined in terms of an
NPcomplete problem such as CLIQUE or HIERARCHICAL CLUSTERING. We
introduce a more formal approach whereby we make rigorous definitions
that sidestep the pitfall of high computational complexity and allow
efficient algorithms for finding communities.


New
Tools for Graph Coloring
with Sanjeev Arora, In APPROX
2011
We
explore the possibility of better coloring algorithm using the new
concept threshold rank. We are able to analyze high level Lasserre
Relaxations for low threshold rank graphs and distance transitive
graphs, and our algorithm makes significantly better progress (towards
O(log n) coloring) in these cases compared to previous algorithms
(where the number of colors is polynomial). We also purpose a plausible
conjecture, that if true would yield good subexponential time
algorithm for graph coloring.

New
Algorithms for Learning in Presence of Errors
with Sanjeev Arora, In ICALP
2011
We
give new algorithms for a variety of randomlygenerated instances of
computational problems using a linearization technique that reduces to
solving a system of linear equations. Our techniques can be applied to
the low noise case of the wellknown LWE
(Learning With Errors) problem, giving a slightly
subexponential time algorithm which suggests known hardness results are
almost tight.

New
Results on Simple Stochastic Games
with Decheng Dai. In ISAAC
2009
We
give a new algorithm for Simple Stochastic Games when the number of
RANDOM vertices is small. We also show that coarse approximation is as
hard as fine approximation for Simple Stochastic Games.

