Deep/Representation
Learning |
 |
In representation learning,
we
have worked on algorithms for problems such as topic models, sparse
coding, social networks and noisy-or networks. For deep learning, I
have worked on topics related to optimization, generalization and GANs. |
Explaining
Landscape
Connectivity of Low-cost Solutions for Multilayer Nets.   
with Rohith Kuditipudi, Xiang Wang, Holden Lee, Yi Zhang, Zhiyuan Li,
Wei Hu and Sanjeev Arora. To appear in NeurIPS 2019
.
Mode
connectivity is a surprising phenomenon in the loss landscape of deep
nets. Optima---at least those discovered by gradient-based
optimization---turn out to be connected by simple paths on which the
loss function is almost constant. Often, these paths can be chosen to
be piece-wise linear, with as few as two segments.
We give mathematical explanations for this phenomenon, assuming generic
properties (such as dropout stability and noise stability) of
well-trained deep nets, which have previously been identified as part
of understanding the generalization properties of deep nets. Our
explanation holds for realistic multilayer nets, and experiments are
presented to verify the theory.
|
Learning
Two-layer Neural
Networks with Symmetric Inputs.  
with Rohith Kuditipudi, Xiang Wang and Zhize Li. In ICLR
2019.
We give a new algorithm for learning a
two-layer
neural network under a general class of input distributions. Assuming
there is a ground-truth two-layer network
y=Aσ(Wx)+ξ,
where A,W are weight matrices, ξ represents noise, and the number of
neurons in the hidden layer is no larger than the input or output, our
algorithm is guaranteed to recover the parameters A,W of the
ground-truth network. The only requirement on the input x is that it is
symmetric, which still allows highly complicated and structured input.
Our algorithm is based on the method-of-moments framework and extends
several results in tensor decompositions. We use spectral algorithms to
avoid the complicated non-convex optimization in learning neural
networks. Experiments show that our algorithm can robustly learn the
ground-truth neural network with a small number of samples for many
symmetric input distributions.
|
Understanding
Composition of
Word Embeddings via
Tensor Decomposition.  
with Abraham Frandsen. In ICLR
2019.
Word embedding is a powerful tool in natural
language processing. In this paper we consider the problem of word
embedding composition --- given vector representations of two
words, compute a vector for the entire phrase. We give a generative
model that can capture specific syntactic relations between words.
Under our model, we prove that the correlations between three words
(measured by their PMI) form a tensor that has an approximate low rank
Tucker decomposition. The result of the Tucker decomposition gives the
word embeddings as well as a core tensor, which can be used to produce
better compositions of the word embeddings. We also complement our
theoretical results with experiments that verify our assumptions, and
demonstrate the effectiveness of the new composition method.
|
Stronger
generalization
bounds for deep nets via a compression approach.   
with Sanjeev Arora, Behnam Neyshabur and Yi Zhang. In ICML
2018..
Deep nets generalize well despite having
more
parameters than the number of training samples. Recent works try to
give an explanation using PAC-Bayes and Margin-based analyses, but do
not as yet result in sample complexity bounds better than naive
parameter
counting. The current paper shows generalization bounds that're orders
of magnitude better in practice. These rely upon new succinct
reparametrizations of the trained net --- a compression that is
explicit and efficient. These yield generalization bounds via a simple
compression-based framework introduced here. Our results also provide
some theoretical justification for widespread empirical success in
compressing deep nets. Analysis of correctness of our compression
relies upon some newly identified \textquotedblleft noise
stability\textquotedblright properties of trained deep nets, which are
also experimentally verified. The study of these properties and
resulting generalization bounds are also extended to convolutional
nets, which had eluded earlier attempts on proving generalization.
|
Learning
One-hidden-layer Neural
Networks
with Landscape Design.  
with Tengyu Ma and Jason D. Lee. In ICLR
2018.
We consider the problem of learning a
one-hidden-layer neural network: we assume the input x is from Gaussian
distribution and the label y = a*\sigma(Bx), where a is a nonnegative
vector in m dimensions, B is a full-rank weight matrix. We first give
an analytic formula for the population risk of the standard squared
loss and demonstrate that it implicitly attempts to decompose a
sequence of low-rank tensors simultaneously. Inspired by the formula,
we design a non-convex objective function G whose landscape is
guaranteed to have the following properties:
1. All local minima of G are also global minima.
2. All global minima of G correspond to the ground truth parameters.
3. The value and gradient of G can be estimated using samples.
With these properties, stochastic gradient descent on G provably
converges to the global minimum and learn the ground-truth parameters.
We also prove finite sample complexity result and validate the results
by simulations.
|
Generalization
and Equilibrium
in Generative Adversarial Nets (GANs)   
with Sanjeev Arora, Yingyu Liang, Tengyu Ma and Yi Zhang. In ICML
2017.
This paper makes progress on several open
theoretical
issues related to Generative Adversarial Networks. A definition is
provided for what it means for the training to generalize, and it is
shown that generalization is not guaranteed for the popular distances
between distributions such as Jensen-Shannon or Wasserstein. We
introduce a new metric called neural net distance for which
generalization does occur. We also show that an approximate pure
equilibrium in the 2-player game exists for a natural training
objective (Wasserstein). Showing such a result has been an open problem
(for any training objective). Finally, the above theoretical ideas lead
us to propose a new training protocol, MIX+GAN, which can be combined
with any existing method. We present experiments showing that it
stabilizes and improves some existing methods.
|
On
the ability of neural nets to
express distributions  
with Holden Lee, Andrej Risteski, Tengyu Ma and Sanjeev
Arora. In COLT
2017.
Deep neural nets have caused a revolution
in many
classification tasks. A related ongoing revolution---also theoretically
not understood---concerns their ability to serve as generative models
for complicated types of data such as images and texts. These models
are trained using ideas like variational autoencoders and Generative
Adversarial Networks.
We take a first cut at explaining the
expressivity of multilayer nets by giving a sufficient criterion for a
function to be approximable by a neural network with n hidden layers. A
key ingredient is Barron's Theorem [Barron1993], which gives a
Fourier criterion for approximability of a function by a neural network
with 1 hidden layer. We show that a composition of n functions which
satisfy certain Fourier conditions ("Barron functions") can be
approximated by a n+1-layer neural network. For probability
distributions, this translates into a criterion for a probability
distribution to be approximable in Wasserstein distance---a natural
metric on probability distributions---by a neural network applied to a
fixed base distribution (e.g., multivariate gaussian).
Building up
recent lower bound work, we also give an example function that shows
that composition of Barron functions is more expressive than Barron
functions alone.
|
Provable
Learning of Noisy-or
Networks  
with Sanjeev Arora, Tengyu Ma and Andrej Risteski. In
STOC 2017.
Many machine learning applications use
latent
variable models to explain structure in data, whereby visible variables
(= coordinates of the given datapoint) are explained as a probabilistic
function of some hidden variables. Finding parameters with the maximum
likelihood is NP-hard even in very simple settings. In recent years,
provably efficient algorithms were nevertheless developed for models
with linear structures: topic models, mixture models, hidden markov
models, etc. These algorithms use matrix or tensor decomposition, and
make some reasonable assumptions about the parameters of the underlying
model.
But matrix or tensor decomposition seems of little use when the latent
variable model has nonlinearities. The current paper shows how to make
progress: tensor decomposition is applied for learning the single-layer
noisy or network, which is a
textbook
example of a Bayes net, and
used for example in the classic QMR-DT software for diagnosing which
disease(s) a patient may have by observing the symptoms he/she
exhibits.
The technical novelty here, which should be useful in other settings in
future, is analysis of tensor decomposition in presence of systematic
error (i.e., where the noise/error is correlated with the signal, and
doesn't decrease as number of samples goes to infinity). This requires
rethinking all steps of tensor decomposition methods from the ground
up.
For simplicity our analysis is stated assuming that the network
parameters were chosen from a probability distribution but the method
seems more generally applicable.
|
Provable
Algorithms for
Inference in Topic Models  
with Sanjeev Arora, Frederic Koehler, Tengyu Ma and Ankur Moitra. In ICML
2016.
Recently, there has been considerable
progress on
designing algorithms with provable guarantees -- typically using linear
algebraic methods -- for parameter learning in latent variable models.
But designing provable algorithms for inference has proven to be more
challenging. Here we take a first step towards provable inference in
topic models. We leverage a property of topic models that enables us to
construct simple linear estimators for the unknown topic proportions
that have small variance, and consequently can work with short
documents. Our estimators also correspond to finding an estimate around
which the posterior is well-concentrated. We show lower bounds that for
shorter documents it can be information theoretically impossible to
find the hidden topics. Finally, we give empirical results that
demonstrate that our algorithm works on realistic topic models. It
yields good solutions on synthetic data and runs in time comparable to
a single iteration of Gibbs sampling.
|
Rich
Component Analysis  with
James Zou. In ICML 2016.
In many settings, we have multiple data sets
(also called views) that capture different and overlapping aspects of
the same phenomenon. We are often interested in finding patterns that
are unique to one or to a subset of the views. For example, we might
have one set of molecular observations and one set of physiological
observations on the same group of individuals, and we want to quantify
molecular patterns that are uncorrelated with physiology. Despite being
a common problem, this is highly challenging when the correlations come
from complex distributions. In this paper, we develop the general
framework of Rich Component Analysis (RCA) to model settings where the
observations from different views are driven by different sets of
latent components, and each component can be a complex,
high-dimensional distribution. We introduce algorithms based on
cumulant extraction that provably learn each of the components without
having to model the other components. We show how to integrate RCA with
stochastic gradient descent into a meta-algorithm for learning general
models, and demonstrate substantial improvement in accuracy on several
synthetic and real datasets in both supervised and unsupervised tasks.
Our method makes it possible to learn latent variable models when we
don't have samples from the true model but only samples after complex
perturbations.
|
Intersecting
Faces: Non-negative Matrix Factorization With New Guarantees   with
James Zou. In ICML 2015.
Non-negative
matrix factorization (NMF) is a natural model of admixture and is
widely used in science and engineering. A plethora of algorithms have
been developed to tackle NMF, but due to the non-convex nature of the
problem, there is little guarantee on how well these methods work.
Recently a surge of research have focused on a very restricted class of
NMFs, called separable NMF, where provably correct algorithms have been
developed. In this paper, we propose the notion of subset-separable
NMF, which substantially generalizes the property of separability. We
show that subset-separability is a natural necessary condition for the
factorization to be unique or to have minimum volume. We developed the
Face-Intersect algorithm which provably and efficiently solves
subset-separable NMF under natural conditions, and we prove that our
algorithm is robust to small noise. We explored the performance of
Face-Intersect on simulations and discuss settings where it empirically
outperformed the state-of-art methods. Our work is a step towards
finding provably correct algorithms that solve large classes of NMF
problems.
|
Simple,
Efficient, and Neural Algorithms for Sparse Coding 
with Sanjeev Arora, Tengyu Ma and Ankur Moitra. In COLT
2015.
Sparse
coding is a basic task in many fields including signal
processing, neuroscience and machine learning where the goal is to
learn a basis that enables a sparse representation of a given set of
data, if one exists. Its standard formulation is as a non-convex
optimization problem which is solved in practice by heuristics based on
alternating minimization. Re- cent work has resulted in several
algorithms for sparse coding with provable guarantees, but somewhat
surprisingly these are outperformed by the simple alternating
minimization heuristics. Here we give a general framework for
understanding alternating minimization which we leverage to analyze
existing heuristics and to design new ones also with provable
guarantees. Some of these algorithms seem implementable on simple
neural architectures, which was the original motivation of Olshausen
and Field (1997a) in introducing sparse coding. We also give the first
efficient algorithm for sparse coding that works almost up to the
information theoretic limit for sparse recovery on incoherent
dictionaries. All previous algorithms that approached or surpassed this
limit run in time exponential in some natural parameter. Finally, our
algorithms improve upon the sample complexity of existing approaches.
We believe that our analysis framework will have applications in other
settings where simple iterative algorithms are used.
|
Provable
Bounds for Learning Some Deep Representations 
with Sanjeev Arora, Aditya Bhaskara and Tengyu Ma. In ICML 2014.
We
give algorithms with provable guarantees that learn a class of deep
nets in the generative model view popularized by Hinton and others. Our
generative model is an n node multilayer neural net that has degree at
most n^{\gamma} for some \gamma < 1 and each edge has a random
edge
weight in [-1,1]. Our algorithm learns almost all
networks in
this class with polynomial running time. The sample complexity is
quadratic or cubic depending upon the details of the model. The
algorithm uses layerwise learning. It is based upon a novel idea of
observing correlations among features and using these to infer the
underlying edge structure via a global graph recovery procedure. The
analysis of the algorithm reveals interesting structure of neural
networks with random edge weights.
|
New
Algorithms for Learning Incoherent and Overcomplete Dictionaries
 
with Sanjeev Arora and Ankur Moitra. In COLT
2014.
A
m*n matrix A is said to be mu-incoherent if each pair of columns has
inner product at most mu/\sqrt{n}. Starting with the pioneering work of
Donoho and Huo such matrices (often called dictionaries)
have played a central role in signal processing, statistics and machine
learning. They allow sparse recovery:
there are efficient algorithms for representing a given vector as a
sparse linear combination of the columns of A (if such a combination
exists). However, in many applications ranging from sparse
coding
in machine learning to image denoising, the dictionary is unknown and
has to be learned from random examples of the form Y=AX where X is
drawn from an appropriate distribution --- this is the dictionary
learning
problem. Existing proposed solutions such as the Method of Optimal
Directions (MOD) or K-SVD do not provide any guarantees on their
performance nor do they necessarily learn a dictionary for which one
can solve sparse recovery problems. The only exception is the recent
work of Spielman, Wang and Wright which gives a polynomial time
algorithm for dictionary learning when A has full column rank
(in particular m is at most n). However, in most settings of interest,
dictionaries need to be {\em overcomplete} (i.e., m is larger than n).
Here we give the first polynomial time
algorithm for dictionary
learning when A is overcomplete. It succeeds under natural conditions
on how X is generated, provided that X has at most
k=Cmin(\sqrt{n}/mu
logn,m^{1/2-eps})
non-zero entries (for any eps>0). In
other words it can handle
almost as many non-zeros as the best sparse recovery algorithms could
tolerate even if one knew the dictionary A exactly.
|
A
Tensor Approach to Learning Mixed Membership Community Models  
with Anima Anandkumar, Daniel Hsu and Sham M. Kakade. In
COLT 2013. Also JMLR Vol 15.
Modeling
community formation and detecting hidden communities in networks is a
well studied problem. However, theoretical analysis of community
detection has been mostly limited to models with non-overlapping
communities such as the stochastic block model. In this paper, we
remove this restriction, and consider a family of probabilistic network
models with overlapping communities, termed as the mixed membership
Dirichlet model, first introduced in Airoldi et al. This model allows
for nodes to have fractional memberships in multiple communities and
assumes that the community memberships are drawn from a Dirichlet
distribution. We propose a unified approach to learning these models
via a tensor spectral decomposition method. Our estimator is based on
low-order moment tensor of the observed network, consisting of 3-star
counts. Our learning method is fast and is based on simple linear
algebra operations, e.g. singular value decomposition and tensor power
iterations. We provide guaranteed recovery of community memberships and
model parameters and present a careful finite sample analysis of our
learning method. Additionally, our results match the best known scaling
requirements in the special case of the stochastic block model.
|
Topic
models provide a useful method for dimensionality reduction and
exploratory data analysis in large text corpora. Most approaches to
topic model inference have been based on a maximum likelihood
objective. Efficient algorithms exist that approximate this objective,
but they have no provable guarantees. Recently, algorithms have been
introduced that provide provable bounds, but these algorithms are not
practical because they are inefficient and not robust to violations of
model assumptions. In this paper we present an algorithm for topic
model inference that is both provable and practical. The algorithm
produces results comparable to the best MCMC implementations while
running orders of magnitude faster.
|
Provable
ICA with Unknown Gaussian Noise, and Implications for Gaussian Mixtures
and Autoencoders 
with Sanjeev Arora, Ankur Moitra and Sushant Sachdeva. in NIPS 2012
We
present a new algorithm for Independent Component Analysis (ICA) which
has provable performance guarantees. In particular, suppose we are
given samples of the form y = Ax + \eta where A is an unknown n*n
matrix and x is a random variable whose components are independent and
have a fourth moment strictly less than that of a standard Gaussian
random variable, \eta is an n-dimensional Gaussian random variable with
unknown covariance \Sigma: We give an algorithm that provable recovers
A and \Sigma up to an additive \epsilon whose running time and sample
complexity are polynomial in n and 1 / \epsilon. To accomplish this, we
introduce a novel "quasi-whitening" step that may be useful in other
contexts in which the covariance of Gaussian noise is not known in
advance. We also give a general framework for finding all local optima
of a function (given an oracle for approximately finding just one) and
this is a crucial step in our algorithm, one that has been overlooked
in previous attempts, and allows us to control the accumulation of
error when we find the columns of A one by one via local search.
|
Learning
Topic Models - Going beyond SVD 
with Sanjeev Arora and Ankur Moitra. In FOCS
2012.
In
this work we revisit the topic modeling problem, which is used for
automatic comprehension and classification of data in a variety of
settings. A number of foundational works both in machine learning and
in theory have suggested a probabilistic model for documents, whereby
documents arise as a convex combination of (i.e. distribution on) a
small number of topic vectors, each topic vector being a distribution
on words (i.e. a vector of word-frequencies).
Theoretical studies of topic modeling focus on learning the model's
parameters assuming the data is actually generated from it. Existing
approaches for the most part rely on Singular Value Decomposition(SVD),
and consequently have one of two limitations: these works need to
either assume that each document contains only one topic, or else can
only recover the span of the topic vectors instead of the topic vectors
themselves.
This paper formally justifies Nonnegative Matrix Factorization(NMF) as
a main tool in this context, which is an analog of SVD where all
vectors are nonnegative. Using this tool we give the first
polynomial-time algorithm for learning topic models without the above
two limitations. The algorithm uses a fairly mild assumption about the
underlying topic matrix called separability, which is usually found to
hold in real-life data.
We hope that this paper will motivate further theoretical results that
use NMF as a replacement for SVD - just as NMF has come to replace SVD
in many applications.
|
Computing
a Nonnegative Matrix Factorization --- Provably 
with Sanjeev Arora, Ravi Kannan and Ankur Moitra. In STOC 2012.
.
The nonnegative matrix factorization problem asks for factorizing an
n*m matrix M into AW where A is n*r and W is r*m, further the entries
of the two matrices A and W are required to be nonnegative. This
problem has a extremely broad range of applications, from machine
learning to communication complexity. We give algorithms for exact and
approximate versions of the problem when r is small (which is the case
for most applications) and a hardness result that shows our results
cannot be substancially improved.
|
Finding
Overlapping Communities in Social Networks: Toward a Rigorous Approach
 
with Sanjeev Arora, Sushant Sachdeva and Grant Schoenebeck, In
EC 2012.
A
community in a social network is usually understood to be a group of
nodes more densely connected with each other than with the rest of the
network. Many existing methods for finding them use heuristic
algorithms where the solution concept is defined in terms of an
NP-complete problem such as CLIQUE or HIERARCHICAL CLUSTERING. We
introduce a more formal approach whereby we make rigorous definitions
that sidestep the pitfall of high computational complexity and allow
efficient algorithms for finding communities.
|
Non-convex
Optimization |

|
For non-convex optimization,
we try to identify geometric properties of the objective function that
allows the optimum to be found efficiently. We have been able to show
that several interesting problems, including matrix completion and
tensor decompositions have a nice strict-saddle property (see this blog post). I also hope to find
new optimization techniques that go beyond strict-saddle functions. |
Stabilized
SVRG: Simple Variance
Reduction for Nonconvex Optimization.  
with Zhize Li, Weiyao Wang, Xiang Wang. In COLT
2019.
Variance
reduction techniques like SVRG provide simple and fast algorithms for
optimizing a convex finite-sum objective. For nonconvex objectives,
these techniques can also find a first-order stationary point (with
small gradient). However, in nonconvex optimization it is often crucial
to find a second-order stationary point (with small gradient and almost
PSD hessian). In this paper, we show that Stabilized SVRG (a simple
variant of SVRG) can find an ϵ-second-order stationary point using only
O(n^{2/3}/ϵ^2+n/ϵ^{1.5}) stochastic gradients. To our best knowledge,
this is the first second-order guarantee for a simple variant of SVRG.
The running time almost matches the known guarantees for finding
ϵ-first-order stationary points.
|
The
Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning
Rate Schedule for Least Squares  
with Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli. To appear in NeurIPS 2019.
.
There is a stark disparity between the
learning
rate schedules used in the practice of large scale machine learning and
what are considered admissible learning rate schedules prescribed in
the theory of stochastic approximation. Recent results, such as in the
'super-convergence' methods which use oscillating learning rates, serve
to emphasize this point even more.
One plausible explanation is that non-convex neural network training
procedures are better suited to the use of fundamentally different
learning rate schedules, such as the ``cut the learning rate
every constant number of epochs'' method (which more closely resembles
an exponentially decaying learning rate schedule); note that this
widely used schedule is in stark contrast to the polynomial decay
schemes prescribed in the stochastic approximation literature, which
are indeed shown to be (worst case) optimal for classes of convex
optimization problems.
The main contribution of this work shows that the picture is far more
nuanced, where we do not even need to move to non-convex optimization
to show other learning rate schemes can be far more effective. In fact,
even for the simple case of stochastic linear regression with a fixed
time horizon, the rate achieved by any polynomial decay scheme is
sub-optimal compared to the statistical minimax rate (by a factor of
condition number); in contrast the ```''cut the learning rate every
constant number of epochs'' provides an exponential improvement
(depending only logarithmically on the condition number) compared to
any polynomial decay scheme. Finally, it is important to ask
if our theoretical insights are somehow fundamentally tied to quadratic
loss minimization (where we have circumvented minimax lower bounds for
more general convex optimization problems)? Here, we conjecture that
recent results which make the gradient norm small at a near optimal
rate, for both convex and non-convex optimization, may also provide
more insights into learning rate schedules used in practice.
|
Minimizing
Nonconvex Population Risk from Rough Empirical Risk.  
with Chi Jin, Lydia T. Liu, Michael I. Jordan. In NeurIPS 2018.
Population risk---the expectation of
the loss
over the sampling mechanism---is always of primary interest in machine
learning. However, learning algorithms only have access to empirical
risk, which is the average loss over training examples. Although the
two risks are typically guaranteed to be pointwise close, for
applications with nonconvex nonsmooth losses (such as modern deep
networks), the effects of sampling can transform a well-behaved
population risk into an empirical risk with a landscape that is
problematic for optimization. The empirical risk can be nonsmooth, and
it may have many additional local minima. This paper considers a
general optimization framework which aims to find approximate local
minima of a smooth nonconvex function F (population risk) given only
access to the function value of another function f (empirical risk),
which is pointwise close to F. We propose a simple algorithm based on
stochastic gradient descent (SGD) on a smoothed version of f which is
guaranteed to find a second-order stationary point if f and F
are close enough, thus escaping all saddle points of F and all the
additional local minima introduced by f. We also provide an almost
matching lower bound showing that our SGD-based approach achieves the
optimal trade-off between the relevant parameters, as well as the
optimal dependence on problem dimension d, among all algorithms making
a polynomial number of queries. As a concrete example, we show that our
results can be directly used to give sample complexities for learning a
ReLU unit, whose empirical risk is nonsmooth.
|
Non-Convex
Matrix Completion
Against a Semi-Random Adversary.  
with Yu Cheng. In COLT 2018.
Matrix completion is a well-studied
problem with
many machine learning applications. In practice, the problem is often
solved by non-convex optimization algorithms. However, the current
theoretical analysis for non-convex algorithms relies heavily on the
assumption that every entry is observed with exactly the same
probability p, which is not realistic in practice.
In this paper, we investigate a more realistic semi-random model, where
the probability of observing each entry is at least p. Even with this
mild semi-random perturbation, we can construct counter-examples where
existing non-convex algorithms get stuck in bad local optima.
In light of the negative results, we propose a pre-processing step that
tries to re-weight the semi-random input, so that it becomes "similar"
to a random input. We give a nearly-linear time algorithm for this
problem, and show that after our pre-processing, all the local minima
of the non-convex objective can be used to approximately recover the
underlying ground-truth matrix.
|
Beyond
Log-concavity: Provable
Guarantees
for Sampling Multi-modal Distributions using Simulated Tempering
Langevin Monte
Carlo  
with Holden Lee and Andrej Risteski. NIPS 2017 Bayesian
Inference Workshop. In NeurIPS 2018.
A key task in Bayesian statistics is
sampling from
distributions that
are only specified up to a partition function (i.e., constant of
proportionality). However, without any assumptions, sampling (even
approximately) can be #P-hard, and few works have provided "beyond
worst-case" guarantees for such settings. For log-concave
distributions, classical results going back to Bakry and \'Emery (1985)
show that natural continuous-time Markov chains called Langevin
diffusions mix in polynomial time. The most salient feature of
log-concavity violated in practice is uni-modality: commonly, the
distributions we wish to sample from are multi-modal. In the presence
of multiple deep and well-separated modes, Langevin diffusion suffers
from torpid mixing. We address this problem by combining Langevin
diffusion with simulated tempering. The result is a Markov chain that
mixes more rapidly by transitioning between different temperatures of
the distribution. We analyze this Markov chain for the canonical
multi-modal distribution: a mixture of gaussians (of equal variance).
The algorithm based on our Markov chain provably samples from
distributions that are close to mixtures of gaussians, given access to
the gradient of the log-pdf. For the analysis, we use a spectral
decomposition theorem for graphs (Gharan and Trevisan, 2014) and a
Markov chain decomposition technique (Madras and Randall, 2002).
|
Global
Convergence of
Policy Gradient Methods for Linearized Control Problems  
with Maryam Fazel, Sham M. Kakade and Mehran Mesbahi. In ICML 2018..
Direct policy gradient methods for
reinforcement
learning and continuous control problems are a popular approach for a
variety of reasons: 1) they are easy to implement without explicit
knowledge of the underlying model 2) they are an "end-to-end" approach,
directly optimizing the performance metric of interest 3) they
inherently allow for richly parameterized policies. A notable drawback
is that even in the most basic continuous control problem (that of
linear quadratic regulators), these methods must solve a non-convex
optimization problem, where little is understood about their efficiency
from both computational and statistical perspectives. In contrast,
system identification and model based planning in optimal control
theory have a much more solid theoretical footing, where much is known
with regards to their computational and statistical properties. This
work bridges this gap showing that (model free) policy gradient methods
globally converge to the optimal solution and are efficient
(polynomially so in relevant problem dependent quantities) with regards
to their sample and computational complexities.
|
Learning
One-hidden-layer Neural
Networks
with Landscape Design.  
with Tengyu Ma and Jason D. Lee. In ICLR
2018.
We consider the problem of learning a
one-hidden-layer neural network: we assume the input x is from Gaussian
distribution and the label y = a*\sigma(Bx), where a is a nonnegative
vector in m dimensions, B is a full-rank weight matrix. We first give
an analytic formula for the population risk of the standard squared
loss and demonstrate that it implicitly attempts to decompose a
sequence of low-rank tensors simultaneously. Inspired by the formula,
we design a non-convex objective function G whose landscape is
guaranteed to have the following properties:
1. All local minima of G are also global minima.
2. All global minima of G correspond to the ground truth parameters.
3. The value and gradient of G can be estimated using samples.
With these properties, stochastic gradient descent on G provably
converges to the global minimum and learn the ground-truth parameters.
We also prove finite sample complexity result and validate the results
by simulations.
|
No
Spurious Local Minima in
Nonconvex Low Rank Problems: A Unified Geometric Analysis  
with Chi Jin, Yi Zheng. In ICML
2017.
In this paper we develop a new
framework that
captures
the common landscape underlying the common non-convex low-rank matrix
problems including matrix sensing, matrix completion and robust PCA. In
particular, we show for all above problems (including asymmetric
cases): 1) all local minima are also globally optimal; 2) no high-order
saddle points exists. These results explain why simple algorithms such
as stochastic gradient descent have global converge, and efficiently
optimize these non-convex objective functions in practice. Our
framework connects and simplifies the existing analyses on optimization
landscapes for matrix sensing and symmetric matrix completion. The
framework naturally leads to new results for asymmetric matrix
completion and robust PCA.
|
How
to Escape Saddle Points
Efficiently  
with Chi Jin, Praneeth Netrapalli, Sham M. Kakade, Michael I. Jordan.
In ICML
2017.
This paper shows that a perturbed
form of
gradient
descent converges to a second-order stationary point in a number
iterations which depends only poly-logarithmically on dimension (i.e.,
it is almost "dimension-free"). The convergence rate of this procedure
matches the well-known convergence rate of gradient descent to
first-order stationary points, up to log factors. When all saddle
points are non-degenerate, all second-order stationary points are local
minima, and our result thus shows that perturbed gradient descent can
escape saddle points almost for free. Our results can be directly
applied to many machine learning applications, including deep learning.
As a particular concrete example of such an application, we show that
our results can be used directly to establish sharp global convergence
rates for matrix factorization. Our results rely on a novel
characterization of the geometry around saddle points, which may be of
independent interest to the non-convex optimization community.
|
|
Non-convex optimization with local
search
heuristics has been widely used in machine
learning, achieving many state-of-art results. It becomes increasingly
important
to understand why they can work for these NP-hard problems on typical
data. The
landscape of many objective functions in learning have been conjectured
to have the
geometric property that “all local optima are (approximately) global
optima”, and thus
they can be solved efficiently by local search algorithms. However,
establishing such
property can be very difficult.
In this paper we analyze the optimization landscape of the random
over-complete
tensor decomposition problem, which has many applications in
unsupervised leaning,
especially in learning latent variable models. In practice, it can be
efficiently solved by
gradient ascent on a non-convex objective. We show that for any small
constant c > 0,
among the set of points with function values (1 + c)-factor larger than
the expectation
of the function, all the local maxima are approximate global maxima.
Previously, the
best known result only characterizes the geometry in small
neighborhoods around the
true components. Our result implies that even with an initialization
that is barely
better than the random guess, the gradient ascent algorithm is
guaranteed to solve this
problem
|
Homotopy
Analysis for Tensor PCA  
with Anima Anandkumar, Yuan Deng and Hossein Mobahi. In COLT
2017.
Developing efficient and guaranteed
nonconvex
algorithms has been an important challenge in modern machine learning.
Algorithms with good empirical performance such as stochastic gradient
descent often lack theoretical guarantees. In this paper, we analyze
the class of homotopy or continuation methods for global optimization
of nonconvex functions. These methods start from an objective function
that is efficient to optimize (e.g. convex), and progressively modify
it to obtain the required objective, and the solutions are passed along
the homotopy path. For the challenging problem of tensor PCA, we prove
global convergence of the homotopy method in the "high noise" regime.
The signal-to-noise requirement for our algorithm is tight in the sense
that it matches the recovery guarantee for the best degree-4
sum-of-squares algorithm. In addition, we prove a phase transition
along the homotopy path for tensor PCA. This allows to simplify the
homotopy method to a local search algorithm, viz., tensor power
iterations, with a specific initialization and a noise injection
procedure, while retaining the theoretical guarantees.
|
Matrix
Completion has No
Spurious Local Minimum  
with Jason D. Lee and Tengyu Ma. In NIPS
2016. Best
Student Paper Award.
Matrix completion is a basic machine
learning
problem that has wide applications, especially in collaborative
filtering and recommender systems. Simple non-convex optimization
algorithms are popular and effective in practice. Despite recent
progress in proving various non-convex algorithms converge from a good
initial point, it remains unclear why random or arbitrary
initialization suffices in practice. We prove that the commonly used
non-convex objective function for matrix completion has no spurious
local minima -- all local minima must also be global. Therefore, many
popular optimization algorithms such as (stochastic) gradient descent
can provably solve matrix completion with arbitrary
initialization in polynomial time.
|
Efficient
approaches for escaping higher order saddle points in non-convex
optimization 
with Anima Anandkumar. In COLT
2016.
Local search heuristics for
non-convex
optimizations are popular in applied machine learning. However, in
general it is hard to guarantee that such algorithms even converge to a
local minimum, due to the existence of complicated saddle point
structures in high dimensions. Many functions have degenerate saddle
points such that the first and second order derivatives cannot
distinguish them with local optima. In this paper we use higher order
derivatives to escape these saddle points: we design the first
efficient algorithm guaranteed to converge to a third order local
optimum (while existing techniques are at most second order). We also
show that it is NP-hard to extend this further to finding fourth order
local optima.
|
Efficient
Algorithms for Large-scale Generalized Eigenvector Computation and
Canonical Correlation Analysis  
with Chi Jin, Sham M. Kakade, Praneeth Netrapalli, Aaron Sidford. In ICML
2016.
This paper considers the problem of
canonical-correlation analysis (CCA) (Hotelling, 1936) and, more
broadly, the generalized eigenvector problem for a pair of symmetric
matrices. These are two fundamental problems in data analysis and
scientific computing with numerous applications in machine learning and
statistics (Shi and Malik, 2000; Hardoon et al., 2004; Witten et al.,
2009).
We provide simple iterative
algorithms, with
improved runtimes, for solving these problems that are globally
linearly convergent with moderate dependencies on the condition numbers
and eigenvalue gaps of the matrices involved.
We obtain our results by reducing CCA
to the
top-k generalized eigenvector problem. We solve this problem through a
general framework that simply requires black box access to an
approximate linear system solver. Instantiating this framework with
accelerated gradient descent we obtain a near linear running time.
Our algorithm is linear in the input
size
and the number of components k up to a log(k) factor. This is essential
for handling large-scale matrices that appear in practice. To the best
of our knowledge this is the first such algorithm with global linear
convergence. We hope that our results prompt further research and
ultimately improve the practical running time for performing these
important data analysis procedures on large data sets.
|
Un-regularizing:
approximate proximal point and faster stochastic algorithms for
empirical risk minimization 
with Roy Frostig, Sham M. Kakade and Aaron Sidford. In ICML 2015.
We develop a family of accelerated
stochastic algorithms that minimize
sums of convex functions. Our algorithms improve upon the fastest
running time for empirical risk minimization (ERM), and in particular
linear least-squares regression, across a wide range of problem
settings. To achieve this, we establish a framework based on the
classical proximal point algorithm. Namely, we provide several
algorithms that reduce the minimization of a strongly convex function
to approximate minimizations of regularizations of the function. Using
these results, we accelerate recent fast stochastic algorithms in a
black-box fashion. Empirically, we demonstrate that the resulting
algorithms exhibit notions of stability that are advantageous in
practice. Both in theory and in practice, the provided algorithms reap
the computational benefits of adding a large strongly convex
regularization term, without incurring a corresponding bias to the
original problem.
|
Escaping
From Saddle Points --- Online Stochastic Gradient for Tensor
Decomposition  
with Furong Huang, Chi Jin and Yang Yuan. In COLT
2015.
We analyze stochastic gradient descent
for
optimizing non-convex
functions. In many cases for non-convex functions the goal is to find a
reasonable local minimum, and the main concern is that gradient updates
are trapped in saddle points. In this paper we identify strict saddle
property for non-convex problem that allows for efficient optimization.
Using this property we show that stochastic gradient descent converges
to a local minimum in a polynomial number of iterations. To the best of
our knowledge this is the first work that gives global convergence
guarantees for stochastic gradient descent on non-convex functions with
exponentially many local minima and saddle points.
Our
analysis can be applied to orthogonal tensor decomposition, which is
widely used in learning a rich class of latent variable models. We
propose a new optimization formulation for the tensor decomposition
problem that has strict saddle property. As a result we get the first
online algorithm for orthogonal tensor decomposition with global
convergence guarantee.
|
Competing
with the Empirical Risk Minimizer in a Single Pass 
with Roy Frostig, Sham M. Kakade and Aaron Sidford. In COLT
2015.
Many optimization problems that arise
in
science and engineering are
those in which we only have a stochastic approximation to the
underlying objective (e.g. estimation problems such as linear
regression). That is, given some distribution D over functions f(x), we
wish to minimize P(x)=E[f(x)], using as few samples from D as possible.
In the absence of computational constraints, the empirical risk
minimizer (ERM) -- the minimizer on a sample average of observed data
-- is widely regarded as the estimation strategy of choice due to its
desirable statistical convergence properties. Our goal is to do as well
as the empirical risk minimizer on every problem while minimizing the
use of computational resources such as running time and space usage.
This work provides a simple streaming
algorithm for solving this
problem, with performance guarantees competitive with that of the ERM.
In particular, under standard regularity assumptions on D, our
algorithm enjoys the following properties:
- The algorithm can be implemented
in
linear time with a single pass
of the observed data, using space linear in the size of a single
sample.
- The algorithm achieves the same
statistical rate of
convergence as the empirical risk minimizer on every problem D (even
considering constant factors).
- The algorithm's performance
depends on
the initial error at a rate that decreases super-polynomially.
Moreover, we quantify the rate of
convergence of both our algorithm
and that of the ERM, showing that, after a number of samples that
depend only on the regularity parameters, the streaming algorithm
rapidly approaches the leading order rate of ERM in expectation.
|
|
Tensor
Decompositions |

|
Tensors are high order
generalizations of matrices (see this post for an introduction). I
have worked on designing efficient and
robust algorithms for tensor decompositions, as well as applying these
techniques to learning many latent variable models.
|
Learning
Two-layer Neural
Networks with Symmetric Inputs.  
with Rohith Kuditipudi, Xiang Wang and Zhize Li. In ICLR
2019.
We give a new algorithm for learning
a
two-layer
neural network under a general class of input distributions. Assuming
there is a ground-truth two-layer network
y=Aσ(Wx)+ξ,
where A,W are weight matrices, ξ represents noise, and the number of
neurons in the hidden layer is no larger than the input or output, our
algorithm is guaranteed to recover the parameters A,W of the
ground-truth network. The only requirement on the input x is that it is
symmetric, which still allows highly complicated and structured input.
Our algorithm is based on the method-of-moments framework and extends
several results in tensor decompositions. We use spectral algorithms to
avoid the complicated non-convex optimization in learning neural
networks. Experiments show that our algorithm can robustly learn the
ground-truth neural network with a small number of samples for many
symmetric input distributions.
|
Understanding
Composition of
Word Embeddings via
Tensor Decomposition.  
with Abraham Frandsen. In ICLR
2019.
Word embedding is a powerful tool in
natural
language processing. In this paper we consider the problem of word
embedding composition --- given vector representations of two
words, compute a vector for the entire phrase. We give a generative
model that can capture specific syntactic relations between words.
Under our model, we prove that the correlations between three words
(measured by their PMI) form a tensor that has an approximate low rank
Tucker decomposition. The result of the Tucker decomposition gives the
word embeddings as well as a core tensor, which can be used to produce
better compositions of the word embeddings. We also complement our
theoretical results with experiments that verify our assumptions, and
demonstrate the effectiveness of the new composition method.
|
Non-convex optimization with local
search
heuristics has been widely used in machine
learning, achieving many state-of-art results. It becomes increasingly
important
to understand why they can work for these NP-hard problems on typical
data. The
landscape of many objective functions in learning have been conjectured
to have the
geometric property that “all local optima are (approximately) global
optima”, and thus
they can be solved efficiently by local search algorithms. However,
establishing such
property can be very difficult.
In this paper we analyze the optimization landscape of the random
over-complete
tensor decomposition problem, which has many applications in
unsupervised leaning,
especially in learning latent variable models. In practice, it can be
efficiently solved by
gradient ascent on a non-convex objective. We show that for any small
constant c > 0,
among the set of points with function values (1 + c)-factor larger than
the expectation
of the function, all the local maxima are approximate global maxima.
Previously, the
best known result only characterizes the geometry in small
neighborhoods around the
true components. Our result implies that even with an initialization
that is barely
better than the random guess, the gradient ascent algorithm is
guaranteed to solve this
problem
|
Provable
Learning of Noisy-or
Networks  
with Sanjeev Arora, Tengyu Ma and Andrej Risteski. Manuscript.
Many machine learning applications
use latent
variable models to explain structure in data, whereby visible variables
(= coordinates of the given datapoint) are explained as a probabilistic
function of some hidden variables. Finding parameters with the maximum
likelihood is NP-hard even in very simple settings. In recent years,
provably efficient algorithms were nevertheless developed for models
with linear structures: topic models, mixture models, hidden markov
models, etc. These algorithms use matrix or tensor decomposition, and
make some reasonable assumptions about the parameters of the underlying
model.
But matrix or tensor decomposition seems of little use when the latent
variable model has nonlinearities. The current paper shows how to make
progress: tensor decomposition is applied for learning the single-layer
{\em noisy or} network, which is a textbook example of a Bayes net, and
used for example in the classic QMR-DT software for diagnosing which
disease(s) a patient may have by observing the symptoms he/she
exhibits.
The technical novelty here, which should be useful in other settings in
future, is analysis of tensor decomposition in presence of systematic
error (i.e., where the noise/error is correlated with the signal, and
doesn't decrease as number of samples goes to infinity). This requires
rethinking all steps of tensor decomposition methods from the ground
up.
For simplicity our analysis is stated assuming that the network
parameters were chosen from a probability distribution but the method
seems more generally applicable.
|
Homotopy
Analysis for Tensor PCA  
with Anima Anandkumar, Yuan Deng and Hossein Mobahi. Manuscript.
Developing efficient and guaranteed
nonconvex
algorithms has been an important challenge in modern machine learning.
Algorithms with good empirical performance such as stochastic gradient
descent often lack theoretical guarantees. In this paper, we analyze
the class of homotopy or continuation methods for global optimization
of nonconvex functions. These methods start from an objective function
that is efficient to optimize (e.g. convex), and progressively modify
it to obtain the required objective, and the solutions are passed along
the homotopy path. For the challenging problem of tensor PCA, we prove
global convergence of the homotopy method in the "high noise" regime.
The signal-to-noise requirement for our algorithm is tight in the sense
that it matches the recovery guarantee for the best degree-4
sum-of-squares algorithm. In addition, we prove a phase transition
along the homotopy path for tensor PCA. This allows to simplify the
homotopy method to a local search algorithm, viz., tensor power
iterations, with a specific initialization and a noise injection
procedure, while retaining the theoretical guarantees.
|
Rich
Component Analysis  with
James Zou. In ICML 2016.
In many settings, we have multiple
data sets
(also called views) that capture different and overlapping aspects of
the same phenomenon. We are often interested in finding patterns that
are unique to one or to a subset of the views. For example, we might
have one set of molecular observations and one set of physiological
observations on the same group of individuals, and we want to quantify
molecular patterns that are uncorrelated with physiology. Despite being
a common problem, this is highly challenging when the correlations come
from complex distributions. In this paper, we develop the general
framework of Rich Component Analysis (RCA) to model settings where the
observations from different views are driven by different sets of
latent components, and each component can be a complex,
high-dimensional distribution. We introduce algorithms based on
cumulant extraction that provably learn each of the components without
having to model the other components. We show how to integrate RCA with
stochastic gradient descent into a meta-algorithm for learning general
models, and demonstrate substantial improvement in accuracy on several
synthetic and real datasets in both supervised and unsupervised tasks.
Our method makes it possible to learn latent variable models when we
don't have samples from the true model but only samples after complex
perturbations.
|
Decomposing
Overcomplete 3rd Order Tensors using Sum-of-Squares Algorithms  
with Tengyu Ma. In RANDOM
2015.
Tensor rank and low-rank tensor
decompositions have many applications
in learning and complexity theory. Most known algorithms use unfoldings
of tensors and can only handle rank up to n^{p/2} for a p-th order
tensor. Previously no efficient algorithm can decompose 3rd order
tensors when the rank is super-linear in the dimension. Using ideas
from sum-of-squares hierarchy, we give the first quasi-polynomial time
algorithm that can decompose a random 3rd order tensor decomposition
when the rank is as large as n^{3/2}/polylog n.
We also give a polynomial time
algorithm for
certifying the injective
norm of random low rank tensors. Our tensor decomposition algorithm
exploits the relationship between injective norm and the tensor
components. The proof relies on interesting tools for decoupling random
variables to prove better matrix concentration bounds, which can be
useful in other settings.
|
Escaping
From Saddle Points --- Online Stochastic Gradient for Tensor
Decomposition  
with Furong Huang, Chi Jin and Yang Yuan. In COLT
2015.
We analyze stochastic gradient descent
for
optimizing non-convex
functions. In many cases for non-convex functions the goal is to find a
reasonable local minimum, and the main concern is that gradient updates
are trapped in saddle points. In this paper we identify strict saddle
property for non-convex problem that allows for efficient optimization.
Using this property we show that stochastic gradient descent converges
to a local minimum in a polynomial number of iterations. To the best of
our knowledge this is the first work that gives global convergence
guarantees for stochastic gradient descent on non-convex functions with
exponentially many local minima and saddle points.
Our
analysis can be applied to orthogonal tensor decomposition, which is
widely used in learning a rich class of latent variable models. We
propose a new optimization formulation for the tensor decomposition
problem that has strict saddle property. As a result we get the first
online algorithm for orthogonal tensor decomposition with global
convergence guarantee.
|
Learning
Mixtures of Gaussians in High Dimensions 
with Qingqing Huang and Sham M. Kakade. In STOC
2015.
Efficiently learning mixture of
Gaussians is
a fundamental problem in
statistics and learning theory. Given samples coming from a random one
out of k Gaussian distributions in R^n, the learning problem asks to
estimate the means and the covariance matrices of these Gaussians. This
learning problem arises in many areas ranging from the natural sciences
to the social sciences, and has also found many machine learning
applications.
Unfortunately, learning mixture of
Gaussians is an
information theoretically hard problem: in order to learn the
parameters up to a reasonable accuracy, the number of samples required
is exponential in the number of Gaussian components in the worst case.
In this work, we show that provided we are in high enough dimensions,
the class of Gaussian mixtures is learnable in its most general form
under a smoothed analysis framework, where the parameters are randomly
perturbed from an adversarial starting point. In particular, given
samples from a mixture of Gaussians with randomly perturbed parameters,
when n > {\Omega}(k^2), we give an algorithm that learns the
parameters with polynomial running time and using polynomial number of
samples.
The central algorithmic ideas consist
of
new ways
to decompose the moment tensor of the Gaussian mixture by exploiting
its structural properties. The symmetries of this tensor are derived
from the combinatorial structure of higher order moments of Gaussian
distributions (sometimes referred to as Isserlis' theorem or Wick's
theorem). We also develop new tools for bounding smallest singular
values of structured random matrices, which could be useful in other
smoothed analysis settings.
|
Understanding
Overcomplete 3rd Order Tensors
with Anima Anandkumar and Majid Janzamin. Short version in COLT
2015.
This
is a series of ongoing works where we try to understand how to
decompose overcomplete third-order tensors. In general, decomposing
third order tensors can be useful in learning many latent variable
models (multi-view model, mixture of Gaussians, etc.) Previous
algorithms usually can either only handle the undercomplete
regime where the number of components is smaller than the number of of
dimensions, or require access of higher order tensors which often
results in higher sample complexity and running time. Surprisingly,
many heuristics (such as Tensor Power Method or Alternating Least
Squares) work very well in practice. In these works we try to
understand when is decomposing third-order tensors tractable. Several
parts are available on arxiv.
The first part [Arxiv]
tries to analyze Tensor Power Method. In this paper we show if there is
a good initialization (that is constant-close to some component),
Tensor Power Method will converge to vectors that are very close to the
true component. We also give an algorithm that converges to the exact
true component given the (already reasonably close) results of the
tensor power method. Good initializations can be found if there are
some supervised information, or by spectral method if ther number of
component k is only a constant factor larger than the dimension d.
The second part [Arxiv]
tries to understand how many samples do we need in order to estimate
the moment tensor accurately enough for several models.
We show surprisingly in many natural cases the sample complexity is
nearly linear in the number of components. In general the bounds in
this paper is tighter than many previous works.
The third part (more independent from the
first two) [Arxiv]
tries to understand how well does Tensor Power Method perform without
really good initialization.
We use techniques from Approximate Message Passing to analyze the
dynamics of the tensor power method, and show that the initialization
does not need to be very close to the true component for tensor power
method to perform well.
This partly explains why tensor power method still works in the mildly
overcomplete settings.
See also Majid Janzamin's page about this project for more
information.
|
A
Tensor Approach to Learning Mixed Membership Community Models  
with Anima Anandkumar, Daniel Hsu and Sham M. Kakade. In
COLT 2013. Also JMLR Vol 15.
Modeling
community formation and detecting hidden communities in networks is a
well studied problem. However, theoretical analysis of community
detection has been mostly limited to models with non-overlapping
communities such as the stochastic block model. In this paper, we
remove this restriction, and consider a family of probabilistic network
models with overlapping communities, termed as the mixed membership
Dirichlet model, first introduced in Airoldi et al. This model allows
for nodes to have fractional memberships in multiple communities and
assumes that the community memberships are drawn from a Dirichlet
distribution. We propose a unified approach to learning these models
via a tensor spectral decomposition method. Our estimator is based on
low-order moment tensor of the observed network, consisting of 3-star
counts. Our learning method is fast and is based on simple linear
algebra operations, e.g. singular value decomposition and tensor power
iterations. We provide guaranteed recovery of community memberships and
model parameters and present a careful finite sample analysis of our
learning method. Additionally, our results match the best known scaling
requirements in the special case of the stochastic block model.
|
Tensor
decompositions for learning latent variable models  
with Anima Anandkumar, Daniel Hsu, Sham M. Kakade, Matus Telgarsky. In JMLR Vol 15.
This
work considers a computationally and statistically efficient parameter
estimation method
for a wide class of latent variable models|including Gaussian mixture
models, hidden Markov
models, and latent Dirichlet allocation|which exploits a certain tensor
structure in their loworder
observable moments (typically, of second- and third-order).
Specifically, parameter estimation
is reduced to the problem of extracting a certain (orthogonal)
decomposition of a symmetric
tensor derived from the moments; this decomposition can be viewed as a
natural generalization
of the singular value decomposition for matrices. Although tensor
decompositions are generally
intractable to compute, the decomposition of these specially structured
tensors can be efficiently
obtained by a variety of approaches, including power iterations and
maximization approaches
(similar to the case of matrices). A detailed analysis of a robust
tensor power method is provided,
establishing an analogue of Wedin's perturbation theorem for the
singular vectors of
matrices. This implies a robust and computationally tractable
estimation approach for several
popular latent variable models.
|
Other
Topics |
|
I am broadly interested in
problems related to Theoretical Computer Science and Machine Learning.I
have also worked on other topics such as approximation algorithms,
complexity of financial derivatives and robust optimization. |
Faster
Algorithms for
High-Dimensional Robust Covariance Estimation.  
with Yu Cheng, Ilias Diakonikolas and David P. Woodruff. In COLT
2019
.
We
study
the problem of estimating the covariance matrix of a high-dimensional
distribution when a small constant fraction of the samples can be
arbitrarily corrupted. Recent work gave the first polynomial time
algorithms for this problem with near-optimal error guarantees for
several natural structured distributions. Our main contribution is to
develop faster algorithms for this problem whose running time nearly
matches that of computing the empirical covariance.
Given N=Ω(d^2/ϵ^2) samples from a d-dimensional Gaussian distribution,
an ϵ-fraction of which may be arbitrarily corrupted, our algorithm runs
in time O(d^{3.26})/poly(ϵ) and approximates the unknown covariance
matrix to optimal error up to a logarithmic factor. Previous robust
algorithms with comparable error guarantees all have runtimes Ω(d^{2ω})
when ϵ=Ω(1), where ω is the exponent of matrix multiplication. We also
provide evidence that improving the running time of our algorithm may
require new algorithmic techniques.
|
High-Dimensional
Robust Mean
Estimation
in Nearly-Linear Time.  
with Yu Cheng and Ilias Diakonikolas. In SODA
2019.
We study the fundamental
problem of
high-dimensional mean estimation in a robust model where a constant
fraction of the samples are adversarially corrupted. Recent work gave
the first polynomial time algorithms for this problem with
dimension-independent error guarantees for several families of
structured distributions. In this work, we give the first nearly-linear
time algorithms for high-dimensional robust mean estimation.
Specifically, we focus on distributions with (i) known covariance and
sub-gaussian tails, and (ii) unknown bounded covariance. Given N
samples on R^d, an ε-fraction of which may be arbitrarily corrupted,
our algorithms run in time O~(Nd)/poly(ε) and approximate the true mean
within the information-theoretically optimal error, up to constant
factors. Previous robust algorithms with comparable error guarantees
have running times Ω~(Nd2), for ε=Ω(1). Our algorithms rely on a
natural family of SDPs parameterized by our current guess ν for the
unknown mean μ⋆. We give a win-win analysis establishing the following:
either a near-optimal solution to the primal SDP yields a good
candidate for μ⋆ -- independent of our current guess ν -- or the dual
SDP yields a new guess ν′ whose distance from μ⋆ is smaller by a
constant factor. We exploit the special structure of the corresponding
SDPs to show that they are approximately solvable in nearly-linear
time. Our approach is quite general, and we believe it can also be
applied to obtain nearly-linear time algorithms for other
high-dimensional robust learning problems.
|
Online
Service with Delay 
with Yossi Azar, Arun Ganesh and Debmalya Panigrahi. In
STOC 2017.
In this paper, we introduce the online
service with delay problem.
In this problem, there are $n$ points in a metric space that issue
service requests over time, and a server that serves these requests.
The goal is to minimize the
sum of distance traveled by the server and the total delay in serving
the requests.
This problem models the fundamental
tradeoff between batching requests to improve locality and reducing
delay to improve response time, that has many applications
in operations management, operating systems, logistics, supply chain
management, and scheduling.
Our main result is to show a
poly-logarithmic competitive ratio for the
online service with delay
problem. This result is obtained by an algorithm that we call the preemptive
service algorithm. The salient feature of this algorithm
is a process called preemptive service, which uses a novel combination
of (recursive) time forwarding and spatial exploration on a metric
space. We hope this
technique will be useful for related problems such as reordering buffer
management, online TSP, vehicle routing, etc. We also generalize our
results to k >1 servers.
|
Towards
a better approximation for sparsest cut?  
with Sanjeev Arora and Ali Kemal Sinop. In FOCS
2013
We
give a new (1+eps)-approximation for sparsest cut problem on graphs
where small sets expand significantly more than the sparsest cut (sets
of size n/r expand by a factor \sqrt{\log n\log r} bigger, for some
small r; this condition holds for many natural graph families). We give
two different algorithms. One involves Guruswami-Sinop rounding on the
level-r Lasserre relaxation. The other is combinatorial and involves a
new notion called Small Set Expander Flows
(inspired by the expander flows
of ARV) which we show exists in the input graph. Both algorithms run in
time 2^{O(r)} poly(n). We also show similar approximation algorithms in
graphs with genus g with an analogous local expansion condition. This
is the first algorithm we know of that achieves (1+eps)-approximation
on such general family of graphs.
|
|
New
Tools for Graph Coloring 
with Sanjeev Arora, In APPROX
2011
We
explore the possibility of better coloring algorithm using the new
concept threshold rank. We are able to analyze high level Lasserre
Relaxations for low threshold rank graphs and distance transitive
graphs, and our algorithm makes significantly better progress (towards
O(log n) coloring) in these cases compared to previous algorithms
(where the number of colors is polynomial). We also purpose a plausible
conjecture, that if true would yield good sub-exponential time
algorithm for graph coloring.
|
New
Algorithms for Learning in Presence of Errors 
with Sanjeev Arora, In ICALP
2011
We
give new algorithms for a variety of randomly-generated instances of
computational problems using a linearization technique that reduces to
solving a system of linear equations. Our techniques can be applied to
the low noise case of the well-known LWE
(Learning With Errors) problem, giving a slightly
subexponential time algorithm which suggests known hardness results are
almost tight.
|
New
Results on Simple Stochastic Games 
with Decheng Dai. In ISAAC
2009
We
give a new algorithm for Simple Stochastic Games when the number of
RANDOM vertices is small. We also show that coarse approximation is as
hard as fine approximation for Simple Stochastic Games.
|
|
|
|
|