
Research Interest 
I'm interested in applying techniques from theoretical computer
science to problems in machine learning, with the hope of developing
new algorithms with provable guarantees. The meta problem is: machine
learning people have been applying heuristic algorithms such as local
search or EM type algorithm to find solutions to many hard problems.
Are these problems really easy or hard? If they are easy can we use
algorithms with provable guarantees; if they are hard is there any
reasonable model where they become easy or even the known heuristic
algorithms can be proved to work?
My thesis: Provable Algorithms for Machine Learning Problems


Publications

Efficient approaches for escaping higher order saddle points in nonconvex optimization Arxiv
with Anima Anandkumar. To appear in COLT 2016.
[Abstract]
Local search heuristics for nonconvex optimizations are popular in applied machine learning. However, in general it is hard to guarantee that such algorithms even converge to a local minimum, due to the existence of complicated saddle point structures in high dimensions. Many functions have degenerate saddle points such that the first and second order derivatives cannot distinguish them with local optima. In this paper we use higher order derivatives to escape these saddle points: we design the first efficient algorithm guaranteed to converge to a third order local optimum (while existing techniques are at most second order). We also show that it is NPhard to extend this further to finding fourth order local optima.

Efficient Algorithms for Largescale Generalized Eigenvector Computation and Canonical Correlation Analysis
Arxiv
with Chi Jin, Sham M. Kakade, Praneeth Netrapalli, Aaron Sidford. To appear in ICML 2016.
[Abstract]
This paper considers the problem of canonicalcorrelation analysis (CCA) (Hotelling, 1936) and, more broadly, the generalized eigenvector problem for a pair of symmetric matrices. These are two fundamental problems in data analysis and scientific computing with numerous applications in machine learning and statistics (Shi and Malik, 2000; Hardoon et al., 2004; Witten et al., 2009).
We provide simple iterative algorithms, with improved runtimes, for solving these problems that are globally linearly convergent with moderate dependencies on the condition numbers and eigenvalue gaps of the matrices involved.
We obtain our results by reducing CCA to the topk generalized eigenvector problem. We solve this problem through a general framework that simply requires black box access to an approximate linear system solver. Instantiating this framework with accelerated gradient descent we obtain a near linear running time.
Our algorithm is linear in the input size and the number of components k up to a log(k) factor. This is essential for handling largescale matrices that appear in practice. To the best of our knowledge this is the first such algorithm with global linear convergence. We hope that our results prompt further research and ultimately improve the practical running time for performing these important data analysis procedures on large data sets.

In many settings, we have multiple data sets (also called views) that capture different and overlapping aspects of the same phenomenon. We are often interested in finding patterns that are unique to one or to a subset of the views. For example, we might have one set of molecular observations and one set of physiological observations on the same group of individuals, and we want to quantify molecular patterns that are uncorrelated with physiology. Despite being a common problem, this is highly challenging when the correlations come from complex distributions. In this paper, we develop the general framework of Rich Component Analysis (RCA) to model settings where the observations from different views are driven by different sets of latent components, and each component can be a complex, highdimensional distribution. We introduce algorithms based on cumulant extraction that provably learn each of the components without having to model the other components. We show how to integrate RCA with stochastic gradient descent into a metaalgorithm for learning general models, and demonstrate substantial improvement in accuracy on several synthetic and real datasets in both supervised and unsupervised tasks. Our method makes it possible to learn latent variable models when we don't have samples from the true model but only samples after complex perturbations.

Tensor rank and lowrank tensor decompositions have many applications
in learning and complexity theory. Most known algorithms use unfoldings
of tensors and can only handle rank up to n^{p/2} for a pth order
tensor. Previously no efficient algorithm can decompose 3rd order
tensors when the rank is superlinear in the dimension. Using ideas
from sumofsquares hierarchy, we give the first quasipolynomial time
algorithm that can decompose a random 3rd order tensor decomposition
when the rank is as large as n^{3/2}/polylog n.
We also give a polynomial time algorithm for certifying the injective
norm of random low rank tensors. Our tensor decomposition algorithm
exploits the relationship between injective norm and the tensor
components. The proof relies on interesting tools for decoupling random
variables to prove better matrix concentration bounds, which can be
useful in other settings.

Intersecting Faces: Nonnegative Matrix Factorization With New Guarantees Arxiv
with James Zou. In ICML 2015.
[Abstract]
Nonnegative
matrix factorization (NMF) is a natural model of admixture and is
widely used in science and engineering. A plethora of algorithms have
been developed to tackle NMF, but due to the nonconvex nature of the
problem, there is little guarantee on how well these methods work.
Recently a surge of research have focused on a very restricted class of
NMFs, called separable NMF, where provably correct algorithms have been
developed. In this paper, we propose the notion of subsetseparable
NMF, which substantially generalizes the property of separability. We
show that subsetseparability is a natural necessary condition for the
factorization to be unique or to have minimum volume. We developed the
FaceIntersect algorithm which provably and efficiently solves
subsetseparable NMF under natural conditions, and we prove that our
algorithm is robust to small noise. We explored the performance of
FaceIntersect on simulations and discuss settings where it empirically
outperformed the stateofart methods. Our work is a step towards
finding provably correct algorithms that solve large classes of NMF
problems.

Unregularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization Arxiv
with Roy Frostig, Sham M. Kakade and Aaron Sidford. In ICML 2015.
[Abstract]
We develop a family of accelerated stochastic algorithms that minimize
sums of convex functions. Our algorithms improve upon the fastest
running time for empirical risk minimization (ERM), and in particular
linear leastsquares regression, across a wide range of problem
settings. To achieve this, we establish a framework based on the
classical proximal point algorithm. Namely, we provide several
algorithms that reduce the minimization of a strongly convex function
to approximate minimizations of regularizations of the function. Using
these results, we accelerate recent fast stochastic algorithms in a
blackbox fashion. Empirically, we demonstrate that the resulting
algorithms exhibit notions of stability that are advantageous in
practice. Both in theory and in practice, the provided algorithms reap
the computational benefits of adding a large strongly convex
regularization term, without incurring a corresponding bias to the
original problem.

Escaping From Saddle Points  Online Stochastic Gradient for Tensor Decomposition Arxiv
with Furong Huang, Chi Jin and Yang Yuan. In COLT 2015.
[Abstract]
We analyze stochastic gradient descent for optimizing nonconvex
functions. In many cases for nonconvex functions the goal is to find a
reasonable local minimum, and the main concern is that gradient updates
are trapped in saddle points. In this paper we identify strict saddle
property for nonconvex problem that allows for efficient optimization.
Using this property we show that stochastic gradient descent converges
to a local minimum in a polynomial number of iterations. To the best of
our knowledge this is the first work that gives global convergence
guarantees for stochastic gradient descent on nonconvex functions with
exponentially many local minima and saddle points. Our
analysis can be applied to orthogonal tensor decomposition, which is
widely used in learning a rich class of latent variable models. We
propose a new optimization formulation for the tensor decomposition
problem that has strict saddle property. As a result we get the first
online algorithm for orthogonal tensor decomposition with global
convergence guarantee.

Learning Mixtures of Gaussians in High Dimensions Arxiv
with Qingqing Huang and Sham M. Kakade. In STOC 2015.
[Abstract]
Efficiently learning mixture of Gaussians is a fundamental problem in
statistics and learning theory. Given samples coming from a random one
out of k Gaussian distributions in R^n, the learning problem asks to
estimate the means and the covariance matrices of these Gaussians. This
learning problem arises in many areas ranging from the natural sciences
to the social sciences, and has also found many machine learning
applications.
Unfortunately, learning mixture of Gaussians is an
information theoretically hard problem: in order to learn the
parameters up to a reasonable accuracy, the number of samples required
is exponential in the number of Gaussian components in the worst case.
In this work, we show that provided we are in high enough dimensions,
the class of Gaussian mixtures is learnable in its most general form
under a smoothed analysis framework, where the parameters are randomly
perturbed from an adversarial starting point. In particular, given
samples from a mixture of Gaussians with randomly perturbed parameters,
when n > {\Omega}(k^2), we give an algorithm that learns the
parameters with polynomial running time and using polynomial number of
samples.
The central algorithmic ideas consist of new ways
to decompose the moment tensor of the Gaussian mixture by exploiting
its structural properties. The symmetries of this tensor are derived
from the combinatorial structure of higher order moments of Gaussian
distributions (sometimes referred to as Isserlis' theorem or Wick's
theorem). We also develop new tools for bounding smallest singular
values of structured random matrices, which could be useful in other
smoothed analysis settings.

Simple, Efficient, and Neural Algorithms for Sparse Coding Arxiv
with Sanjeev Arora, Tengyu Ma and Ankur Moitra. In COLT 2015.
[Abstract]
Sparse coding is a basic task in many fields including signal
processing, neuroscience and machine learning where the goal is to
learn a basis that enables a sparse representation of a given set of
data, if one exists. Its standard formulation is as a nonconvex
optimization problem which is solved in practice by heuristics based on
alternating minimization. Re cent work has resulted in several
algorithms for sparse coding with provable guarantees, but somewhat
surprisingly these are outperformed by the simple alternating
minimization heuristics. Here we give a general framework for
understanding alternating minimization which we leverage to analyze
existing heuristics and to design new ones also with provable
guarantees. Some of these algorithms seem implementable on simple
neural architectures, which was the original motivation of Olshausen
and Field (1997a) in introducing sparse coding. We also give the first
efficient algorithm for sparse coding that works almost up to the
information theoretic limit for sparse recovery on incoherent
dictionaries. All previous algorithms that approached or surpassed this
limit run in time exponential in some natural parameter. Finally, our
algorithms improve upon the sample complexity of existing approaches.
We believe that our analysis framework will have applications in other
settings where simple iterative algorithms are used.

Competing with the Empirical Risk Minimizer in a Single Pass Arxiv
with Roy Frostig, Sham M. Kakade and Aaron Sidford. In COLT 2015.
[Abstract]
Many optimization problems that arise in science and engineering are
those in which we only have a stochastic approximation to the
underlying objective (e.g. estimation problems such as linear
regression). That is, given some distribution D over functions f(x), we
wish to minimize P(x)=E[f(x)], using as few samples from D as possible.
In the absence of computational constraints, the empirical risk
minimizer (ERM)  the minimizer on a sample average of observed data
 is widely regarded as the estimation strategy of choice due to its
desirable statistical convergence properties. Our goal is to do as well
as the empirical risk minimizer on every problem while minimizing the
use of computational resources such as running time and space usage.
This work provides a simple streaming algorithm for solving this
problem, with performance guarantees competitive with that of the ERM.
In particular, under standard regularity assumptions on D, our
algorithm enjoys the following properties:
 The algorithm can be implemented in linear time with a single pass
of the observed data, using space linear in the size of a single
sample.
 The algorithm achieves the same statistical rate of
convergence as the empirical risk minimizer on every problem D (even
considering constant factors).
 The algorithm's performance depends on the initial error at a rate that decreases superpolynomially.
Moreover, we quantify the rate of convergence of both our algorithm
and that of the ERM, showing that, after a number of samples that
depend only on the regularity parameters, the streaming algorithm
rapidly approaches the leading order rate of ERM in expectation.

Understanding Overcomplete 3rd Order Tensors
with Anima Anandkumar and Majid Janzamin. Short version in COLT 2015.
[Details]
This
is a series of ongoing works where we try to understand how to
decompose overcomplete thirdorder tensors. In general, decomposing
third order tensors can be useful in learning many latent variable
models (multiview model, mixture of Gaussians, etc.) Previous
algorithms usually can either only handle the undercomplete
regime where the number of components is smaller than the number of of
dimensions, or require access of higher order tensors which often
results in higher sample complexity and running time. Surprisingly,
many heuristics (such as Tensor Power Method or Alternating Least
Squares) work very well in practice. In these works we try to
understand when is decomposing thirdorder tensors tractable. Several
parts are available on arxiv.
The first part [Arxiv]
tries to analyze Tensor Power Method. In this paper we show if there is
a good initialization (that is constantclose to some component),
Tensor Power Method will converge to vectors that are very close to the
true component. We also give an algorithm that converges to the exact
true component given the (already reasonably close) results of the
tensor power method. Good initializations can be found if there are
some supervised information, or by spectral method if ther number of
component k is only a constant factor larger than the dimension d.
The second part [Arxiv]
tries to understand how many samples do we need in order to estimate
the moment tensor accurately enough for several models.
We show surprisingly in many natural cases the sample complexity is
nearly linear in the number of components. In general the bounds in
this paper is tighter than many previous works.
The third part (more independent from the first two) [Arxiv]
tries to understand how well does Tensor Power Method perform without
really good initialization.
We use techniques from Approximate Message Passing to analyze the
dynamics of the tensor power method, and show that the initialization
does not need to be very close to the true component for tensor power
method to perform well.
This partly explains why tensor power method still works in the mildly
overcomplete settings.
See also Majid Janzamin's page about this project for more information.

Provable Bounds for Learning Some Deep Representations Arxiv
with Sanjeev Arora, Aditya Bhaskara and Tengyu Ma. In ICML 2014.
[Abstract]
We
give algorithms with provable guarantees that learn a class of deep
nets in the generative model view popularized by Hinton and others. Our
generative model is an n node multilayer neural net that has degree at
most n^{\gamma} for some \gamma < 1 and each edge has a random edge
weight in [1,1]. Our algorithm learns almost all networks in
this class with polynomial running time. The sample complexity is
quadratic or cubic depending upon the details of the model. The
algorithm uses layerwise learning. It is based upon a novel idea of
observing correlations among features and using these to infer the
underlying edge structure via a global graph recovery procedure. The
analysis of the algorithm reveals interesting structure of neural
networks with random edge weights.

New Algorithms for Learning Incoherent and Overcomplete Dictionaries Arxiv
with Sanjeev Arora and Ankur Moitra. In COLT 2014.
[Abstract]
A
m*n matrix A is said to be muincoherent if each pair of columns has
inner product at most mu/\sqrt{n}. Starting with the pioneering work of
Donoho and Huo such matrices (often called dictionaries) have played a central role in signal processing, statistics and machine learning. They allow sparse recovery:
there are efficient algorithms for representing a given vector as a
sparse linear combination of the columns of A (if such a combination
exists). However, in many applications ranging from sparse coding
in machine learning to image denoising, the dictionary is unknown and
has to be learned from random examples of the form Y=AX where X is
drawn from an appropriate distribution  this is the dictionary learning
problem. Existing proposed solutions such as the Method of Optimal
Directions (MOD) or KSVD do not provide any guarantees on their
performance nor do they necessarily learn a dictionary for which one
can solve sparse recovery problems. The only exception is the recent
work of Spielman, Wang and Wright which gives a polynomial time
algorithm for dictionary learning when A has full column rank
(in particular m is at most n). However, in most settings of interest,
dictionaries need to be {\em overcomplete} (i.e., m is larger than n).
Here we give the first polynomial time algorithm for dictionary
learning when A is overcomplete. It succeeds under natural conditions
on how X is generated, provided that X has at most
k=Cmin(\sqrt{n}/mu logn,m^{1/2eps})
nonzero entries (for any eps>0). In other words it can handle
almost as many nonzeros as the best sparse recovery algorithms could
tolerate even if one knew the dictionary A exactly.

We
give a new (1+eps)approximation for sparsest cut problem on graphs
where small sets expand significantly more than the sparsest cut (sets
of size n/r expand by a factor \sqrt{\log n\log r} bigger, for some
small r; this condition holds for many natural graph families). We give
two different algorithms. One involves GuruswamiSinop rounding on the
levelr Lasserre relaxation. The other is combinatorial and involves a
new notion called Small Set Expander Flows (inspired by the expander flows
of ARV) which we show exists in the input graph. Both algorithms run in
time 2^{O(r)} poly(n). We also show similar approximation algorithms in
graphs with genus g with an analogous local expansion condition. This
is the first algorithm we know of that achieves (1+eps)approximation
on such general family of graphs.

A Tensor Approach to Learning Mixed Membership Community Models Arxiv
with Anima Anandkumar, Daniel Hsu and Sham M. Kakade. In COLT 2013. Also JMLR Vol 15.
[Abstract]
Modeling
community formation and detecting hidden communities in networks is a
well studied problem. However, theoretical analysis of community
detection has been mostly limited to models with nonoverlapping
communities such as the stochastic block model. In this paper, we
remove this restriction, and consider a family of probabilistic network
models with overlapping communities, termed as the mixed membership
Dirichlet model, first introduced in Airoldi et al. This model allows
for nodes to have fractional memberships in multiple communities and
assumes that the community memberships are drawn from a Dirichlet
distribution. We propose a unified approach to learning these models
via a tensor spectral decomposition method. Our estimator is based on
loworder moment tensor of the observed network, consisting of 3star
counts. Our learning method is fast and is based on simple linear
algebra operations, e.g. singular value decomposition and tensor power
iterations. We provide guaranteed recovery of community memberships and
model parameters and present a careful finite sample analysis of our
learning method. Additionally, our results match the best known scaling
requirements in the special case of the stochastic block model.

A Practical Algorithm for Topic Modeling with Provable Guarantees Arxiv
with Sanjeev Arora, Yoni Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, Michael Zhu, in ICML 2013
[Abstract]
Topic
models provide a useful method for dimensionality reduction and
exploratory data analysis in large text corpora. Most approaches to
topic model inference have been based on a maximum likelihood
objective. Efficient algorithms exist that approximate this objective,
but they have no provable guarantees. Recently, algorithms have been
introduced that provide provable bounds, but these algorithms are not
practical because they are inefficient and not robust to violations of
model assumptions. In this paper we present an algorithm for topic
model inference that is both provable and practical. The algorithm
produces results comparable to the best MCMC implementations while
running orders of magnitude faster.

Tensor decompositions for learning latent variable models Arxiv
with Anima Anandkumar, Daniel Hsu, Sham M. Kakade, Matus Telgarsky. In JMLR Vol 15.
[Abstract]
This work considers a computationally and statistically efficient parameter estimation method
for a wide class of latent variable modelsincluding Gaussian mixture models, hidden Markov
models, and latent Dirichlet allocationwhich exploits a certain tensor structure in their loworder
observable moments (typically, of second and thirdorder). Specifically, parameter estimation
is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric
tensor derived from the moments; this decomposition can be viewed as a natural generalization
of the singular value decomposition for matrices. Although tensor decompositions are generally
intractable to compute, the decomposition of these specially structured tensors can be efficiently
obtained by a variety of approaches, including power iterations and maximization approaches
(similar to the case of matrices). A detailed analysis of a robust tensor power method is provided,
establishing an analogue of Wedin's perturbation theorem for the singular vectors of
matrices. This implies a robust and computationally tractable estimation approach for several
popular latent variable models.

Provable ICA with Unknown Gaussian Noise, and Implications for Gaussian Mixtures and Autoencoders Arxiv
with Sanjeev Arora, Ankur
Moitra and Sushant Sachdeva. in NIPS 2012[Abstract]
We
present a new algorithm for Independent Component Analysis (ICA) which
has provable performance guarantees. In particular, suppose we are
given samples of the form y = Ax + \eta where A is an unknown n*n
matrix and x is a random variable whose components are independent and
have a fourth moment strictly less than that of a standard Gaussian
random variable, \eta is an ndimensional Gaussian random variable with
unknown covariance \Sigma: We give an algorithm that provable recovers
A and \Sigma up to an additive \epsilon whose running time and sample
complexity are polynomial in n and 1 / \epsilon. To accomplish this, we
introduce a novel "quasiwhitening" step that may be useful in other
contexts in which the covariance of Gaussian noise is not known in
advance. We also give a general framework for finding all local optima
of a function (given an oracle for approximately finding just one) and
this is a crucial step in our algorithm, one that has been overlooked
in previous attempts, and allows us to control the accumulation of
error when we find the columns of A one by one via local search.

In this work we
revisit the topic modeling problem, which is used for automatic
comprehension and classification of data in a variety of settings. A
number of foundational works both in machine learning and in theory
have suggested a probabilistic model for documents, whereby documents
arise as a convex combination of (i.e. distribution on) a small number
of topic vectors, each topic vector being a distribution on words (i.e.
a vector of wordfrequencies).
Theoretical studies of topic
modeling focus on learning the model's parameters assuming the data is
actually generated from it. Existing approaches for the most part rely
on Singular Value Decomposition(SVD), and consequently have one of two
limitations: these works need to either assume that each document
contains only one topic, or else can only recover the span of the topic
vectors instead of the topic vectors themselves.
This paper
formally justifies Nonnegative Matrix Factorization(NMF) as a main tool
in this context, which is an analog of SVD where all vectors are
nonnegative. Using this tool we give the first polynomialtime
algorithm for learning topic models without the above two limitations.
The algorithm uses a fairly mild assumption about the underlying topic
matrix called separability, which is usually found to hold in reallife
data.
We hope that this paper will motivate further theoretical
results that use NMF as a replacement for SVD  just as NMF has come to
replace SVD in many applications.

Computing
a Nonnegative Matrix Factorization  Provably Arxiv
with Sanjeev Arora, Ravi Kannan and
Ankur
Moitra. In STOC
2012. [Abstract]
. The nonnegative matrix factorization
problem asks for factorizing an n*m matrix M into AW where A is n*r and
W is r*m, further the entries of the two matrices A and W are required
to be nonnegative. This problem has a extremely broad range of
applications, from machine learning to communication complexity. We
give algorithms for exact and approximate versions of the problem when
r is small (which is the case for most applications) and a hardness
result that shows our results cannot be substancially improved.

Finding
Overlapping Communities in Social Networks: Toward a
Rigorous Approach Arxiv
with Sanjeev Arora, Sushant Sachdeva and
Grant Schoenebeck, In EC 2012.[Abstract]
A community in a social network is usually understood to be a group of
nodes more densely connected with each other than with the rest of the
network. Many existing methods for finding them use heuristic
algorithms where the solution concept is defined in terms of an
NPcomplete problem such as CLIQUE or HIERARCHICAL CLUSTERING. We
introduce a more formal approach whereby we make rigorous definitions
that sidestep the pitfall of high computational complexity and allow
efficient algorithms for finding communities.

Computational
Complexity and Information Asymmetry in
Financial Products PDF
with Sanjeev Arora, Boaz Barak and Markus
Brunnermeier. In ICS
2010[More Info]
For some frequently asked questions about the paper, see our FAQ page.
A short version appeared in Communications
of the ACM,
2011 Issue 5.
Full text PDF
See some blog comments on this paper: Intractability Center, Freedom
to Tinker, Boing
Boing, Daily
Kos, In
Theory, Healthy
Algorithms, Lipton's
blog
We put forward the issue of Computational
Complexity in the analysis of Financial Derivatives, and show that
there is a fundamental difficulty in pricing financial derivatives even
in very simple settings.
This is still a working paper. The latest version (updated Jan. 2012)
can be found here:
Computational
Complexity and Information Asymmetry in
Financial Products

We explore the possibility of better coloring algorithm using the new
concept threshold rank. We are able to analyze high level
Lasserre Relaxations for low threshold rank graphs and distance
transitive graphs, and our algorithm makes significantly better
progress
(towards O(log n) coloring) in these cases compared to previous
algorithms (where the number of colors is polynomial). We also purpose
a plausible conjecture, that if true would yield good subexponential
time algorithm for graph coloring.

We give new algorithms for a variety of randomlygenerated instances of
computational
problems using a linearization technique that reduces to solving a
system of linear equations. Our techniques can be applied to the low
noise case of the wellknown LWE
(Learning With Errors) problem, giving a slightly
subexponential time algorithm which suggests known hardness results are
almost tight.

We give a new algorithm for Simple Stochastic Games when the number of
RANDOM vertices is small. We also show that coarse approximation
is as hard as fine approximation for Simple Stochastic Games.



Contact

Email: 
rongge AT cs
DOT duke DOT edu 
Tel: 
+1 (919) 6607330 
Mail: 
Duke University
Campus Box 90129
308 Research Drive (LSRC Building)
Room D226
Durham, NC 27708 USA 



