Home Research Publications Teaching

Research
I am broadly interested in theoretical computer science and machine learning. Most of my research is in one of the following directions
• Representation Learning: Modern machine learning algorithms such as deep learning try to automatically learn useful hidden representations of documents, images or other forms of data. We try to find out what structures are useful, and how to design provable algorithms for these hidden structures.
• Non-convex Optimization: Most of the machine learning problems can be formalized as non-convex optimization problems. However, in the worst case non-convex optimization is NP-hard. We try to identify properties of the problems that make them easy'' to solve, and design more efficient algorithms.
• Tensor Decompositions: Tensors are high order generalizations of matrices. Tensor decomposition is a powerful algebraic tool that can be used to learn many latent variable models. We try to design faster, more robust algorithms for tensor decomposition, and apply them to new problems.
I have also worked on other topics such as approximation algorithms and complexity of financial derivatives.
Representation Learning
In representation learning, we have worked on algorithms for problems such as topic models, sparse coding, social networks and noisy-or networks. I hope to work on representation learning for more applications, including deep representations.

 Provable Learning of Noisy-or Networks with Sanjeev Arora, Tengyu Ma and Andrej Risteski. Manuscript. Many machine learning applications use latent variable models to explain structure in data, whereby visible variables (= coordinates of the given datapoint) are explained as a probabilistic function of some hidden variables. Finding parameters with the maximum likelihood is NP-hard even in very simple settings. In recent years, provably efficient algorithms were nevertheless developed for models with linear structures: topic models, mixture models, hidden markov models, etc. These algorithms use matrix or tensor decomposition, and make some reasonable assumptions about the parameters of the underlying model. But matrix or tensor decomposition seems of little use when the latent variable model has nonlinearities. The current paper shows how to make progress: tensor decomposition is applied for learning the single-layer {\em noisy or} network, which is a textbook example of a Bayes net, and used for example in the classic QMR-DT software for diagnosing which disease(s) a patient may have by observing the symptoms he/she exhibits. The technical novelty here, which should be useful in other settings in future, is analysis of tensor decomposition in presence of systematic error (i.e., where the noise/error is correlated with the signal, and doesn't decrease as number of samples goes to infinity). This requires rethinking all steps of tensor decomposition methods from the ground up. For simplicity our analysis is stated assuming that the network parameters were chosen from a probability distribution but the method seems more generally applicable. Provable Algorithms for Inference in Topic Models with Sanjeev Arora, Frederic Koehler, Tengyu Ma and Ankur Moitra. In ICML 2016. Recently, there has been considerable progress on designing algorithms with provable guarantees -- typically using linear algebraic methods -- for parameter learning in latent variable models. But designing provable algorithms for inference has proven to be more challenging. Here we take a first step towards provable inference in topic models. We leverage a property of topic models that enables us to construct simple linear estimators for the unknown topic proportions that have small variance, and consequently can work with short documents. Our estimators also correspond to finding an estimate around which the posterior is well-concentrated. We show lower bounds that for shorter documents it can be information theoretically impossible to find the hidden topics. Finally, we give empirical results that demonstrate that our algorithm works on realistic topic models. It yields good solutions on synthetic data and runs in time comparable to a single iteration of Gibbs sampling. Rich Component Analysis with James Zou. In ICML 2016. In many settings, we have multiple data sets (also called views) that capture different and overlapping aspects of the same phenomenon. We are often interested in finding patterns that are unique to one or to a subset of the views. For example, we might have one set of molecular observations and one set of physiological observations on the same group of individuals, and we want to quantify molecular patterns that are uncorrelated with physiology. Despite being a common problem, this is highly challenging when the correlations come from complex distributions. In this paper, we develop the general framework of Rich Component Analysis (RCA) to model settings where the observations from different views are driven by different sets of latent components, and each component can be a complex, high-dimensional distribution. We introduce algorithms based on cumulant extraction that provably learn each of the components without having to model the other components. We show how to integrate RCA with stochastic gradient descent into a meta-algorithm for learning general models, and demonstrate substantial improvement in accuracy on several synthetic and real datasets in both supervised and unsupervised tasks. Our method makes it possible to learn latent variable models when we don't have samples from the true model but only samples after complex perturbations. Intersecting Faces: Non-negative Matrix Factorization With New Guarantees with James Zou. In ICML 2015. Non-negative matrix factorization (NMF) is a natural model of admixture and is widely used in science and engineering. A plethora of algorithms have been developed to tackle NMF, but due to the non-convex nature of the problem, there is little guarantee on how well these methods work. Recently a surge of research have focused on a very restricted class of NMFs, called separable NMF, where provably correct algorithms have been developed. In this paper, we propose the notion of subset-separable NMF, which substantially generalizes the property of separability. We show that subset-separability is a natural necessary condition for the factorization to be unique or to have minimum volume. We developed the Face-Intersect algorithm which provably and efficiently solves subset-separable NMF under natural conditions, and we prove that our algorithm is robust to small noise. We explored the performance of Face-Intersect on simulations and discuss settings where it empirically outperformed the state-of-art methods. Our work is a step towards finding provably correct algorithms that solve large classes of NMF problems. Simple, Efficient, and Neural Algorithms for Sparse Coding with Sanjeev Arora, Tengyu Ma and Ankur Moitra. In COLT 2015. Sparse coding is a basic task in many fields including signal processing, neuroscience and machine learning where the goal is to learn a basis that enables a sparse representation of a given set of data, if one exists. Its standard formulation is as a non-convex optimization problem which is solved in practice by heuristics based on alternating minimization. Re- cent work has resulted in several algorithms for sparse coding with provable guarantees, but somewhat surprisingly these are outperformed by the simple alternating minimization heuristics. Here we give a general framework for understanding alternating minimization which we leverage to analyze existing heuristics and to design new ones also with provable guarantees. Some of these algorithms seem implementable on simple neural architectures, which was the original motivation of Olshausen and Field (1997a) in introducing sparse coding. We also give the first efficient algorithm for sparse coding that works almost up to the information theoretic limit for sparse recovery on incoherent dictionaries. All previous algorithms that approached or surpassed this limit run in time exponential in some natural parameter. Finally, our algorithms improve upon the sample complexity of existing approaches. We believe that our analysis framework will have applications in other settings where simple iterative algorithms are used. Provable Bounds for Learning Some Deep Representations with Sanjeev Arora, Aditya Bhaskara and Tengyu Ma. In ICML 2014. We give algorithms with provable guarantees that learn a class of deep nets in the generative model view popularized by Hinton and others. Our generative model is an n node multilayer neural net that has degree at most n^{\gamma} for some \gamma < 1 and each edge has a random edge weight in [-1,1]. Our algorithm learns almost all networks in this class with polynomial running time. The sample complexity is quadratic or cubic depending upon the details of the model. The algorithm uses layerwise learning. It is based upon a novel idea of observing correlations among features and using these to infer the underlying edge structure via a global graph recovery procedure. The analysis of the algorithm reveals interesting structure of neural networks with random edge weights. New Algorithms for Learning Incoherent and Overcomplete Dictionaries with Sanjeev Arora and Ankur Moitra. In COLT 2014. A m*n matrix A is said to be mu-incoherent if each pair of columns has inner product at most mu/\sqrt{n}. Starting with the pioneering work of Donoho and Huo such matrices (often called dictionaries) have played a central role in signal processing, statistics and machine learning. They allow sparse recovery: there are efficient algorithms for representing a given vector as a sparse linear combination of the columns of A (if such a combination exists). However, in many applications ranging from sparse coding in machine learning to image denoising, the dictionary is unknown and has to be learned from random examples of the form Y=AX where X is drawn from an appropriate distribution --- this is the dictionary learning problem. Existing proposed solutions such as the Method of Optimal Directions (MOD) or K-SVD do not provide any guarantees on their performance nor do they necessarily learn a dictionary for which one can solve sparse recovery problems. The only exception is the recent work of Spielman, Wang and Wright which gives a polynomial time algorithm for dictionary learning when A has full column rank (in particular m is at most n). However, in most settings of interest, dictionaries need to be {\em overcomplete} (i.e., m is larger than n). Here we give the first polynomial time algorithm for dictionary learning when A is overcomplete. It succeeds under natural conditions on how X is generated, provided that X has at most k=Cmin(\sqrt{n}/mu logn,m^{1/2-eps}) non-zero entries (for any eps>0). In other words it can handle almost as many non-zeros as the best sparse recovery algorithms could tolerate even if one knew the dictionary A exactly. A Tensor Approach to Learning Mixed Membership Community Models with Anima Anandkumar, Daniel Hsu and Sham M. Kakade. In COLT 2013. Also JMLR Vol 15. Modeling community formation and detecting hidden communities in networks is a well studied problem. However, theoretical analysis of community detection has been mostly limited to models with non-overlapping communities such as the stochastic block model. In this paper, we remove this restriction, and consider a family of probabilistic network models with overlapping communities, termed as the mixed membership Dirichlet model, first introduced in Airoldi et al. This model allows for nodes to have fractional memberships in multiple communities and assumes that the community memberships are drawn from a Dirichlet distribution. We propose a unified approach to learning these models via a tensor spectral decomposition method. Our estimator is based on low-order moment tensor of the observed network, consisting of 3-star counts. Our learning method is fast and is based on simple linear algebra operations, e.g. singular value decomposition and tensor power iterations. We provide guaranteed recovery of community memberships and model parameters and present a careful finite sample analysis of our learning method. Additionally, our results match the best known scaling requirements in the special case of the stochastic block model. A Practical Algorithm for Topic Modeling with Provable Guarantees with Sanjeev Arora, Yoni Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, Michael Zhu, in ICML 2013 Topic models provide a useful method for dimensionality reduction and exploratory data analysis in large text corpora. Most approaches to topic model inference have been based on a maximum likelihood objective. Efficient algorithms exist that approximate this objective, but they have no provable guarantees. Recently, algorithms have been introduced that provide provable bounds, but these algorithms are not practical because they are inefficient and not robust to violations of model assumptions. In this paper we present an algorithm for topic model inference that is both provable and practical. The algorithm produces results comparable to the best MCMC implementations while running orders of magnitude faster. Provable ICA with Unknown Gaussian Noise, and Implications for Gaussian Mixtures and Autoencoders with Sanjeev Arora, Ankur Moitra and Sushant Sachdeva. in NIPS 2012 We present a new algorithm for Independent Component Analysis (ICA) which has provable performance guarantees. In particular, suppose we are given samples of the form y = Ax + \eta where A is an unknown n*n matrix and x is a random variable whose components are independent and have a fourth moment strictly less than that of a standard Gaussian random variable, \eta is an n-dimensional Gaussian random variable with unknown covariance \Sigma: We give an algorithm that provable recovers A and \Sigma up to an additive \epsilon whose running time and sample complexity are polynomial in n and 1 / \epsilon. To accomplish this, we introduce a novel "quasi-whitening" step that may be useful in other contexts in which the covariance of Gaussian noise is not known in advance. We also give a general framework for finding all local optima of a function (given an oracle for approximately finding just one) and this is a crucial step in our algorithm, one that has been overlooked in previous attempts, and allows us to control the accumulation of error when we find the columns of A one by one via local search. Learning Topic Models - Going beyond SVD with Sanjeev Arora and Ankur Moitra. In FOCS 2012. In this work we revisit the topic modeling problem, which is used for automatic comprehension and classification of data in a variety of settings. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e. a vector of word-frequencies). Theoretical studies of topic modeling focus on learning the model's parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition(SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the span of the topic vectors instead of the topic vectors themselves. This paper formally justifies Nonnegative Matrix Factorization(NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data. We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD - just as NMF has come to replace SVD in many applications. Computing a Nonnegative Matrix Factorization --- Provably with Sanjeev Arora, Ravi Kannan and Ankur Moitra. In STOC 2012. . The nonnegative matrix factorization problem asks for factorizing an n*m matrix M into AW where A is n*r and W is r*m, further the entries of the two matrices A and W are required to be nonnegative. This problem has a extremely broad range of applications, from machine learning to communication complexity. We give algorithms for exact and approximate versions of the problem when r is small (which is the case for most applications) and a hardness result that shows our results cannot be substancially improved. Finding Overlapping Communities in Social Networks: Toward a Rigorous Approach with Sanjeev Arora, Sushant Sachdeva and Grant Schoenebeck, In EC 2012. A community in a social network is usually understood to be a group of nodes more densely connected with each other than with the rest of the network. Many existing methods for finding them use heuristic algorithms where the solution concept is defined in terms of an NP-complete problem such as CLIQUE or HIERARCHICAL CLUSTERING. We introduce a more formal approach whereby we make rigorous definitions that sidestep the pitfall of high computational complexity and allow efficient algorithms for finding communities.
Non-convex Optimization
For non-convex optimization, we try to identify geometric properties of the objective function that allows the optimum to be found efficiently. We have been able to show that several interesting problems, including matrix completion and tensor decompositions have a nice strict-saddle property (see this blog post). I also hope to find new optimization techniques that go beyond strict-saddle functions.