Home Research Publications Teaching

Guarantees for Tuning the Step Size using a Learning-to-Learn Approacharxivabstract
with Xiang Wang, Shuai Yuan and Chenwei Wu.
Learning-to-learn (using optimization algorithms to learn a new optimizer) has successfully trained efficient optimizers in practice. This approach relies on meta-gradient descent on a meta-objective based on the trajectory that the optimizer generates. However, there were few theoretical guarantees on how to avoid meta-gradient explosion/vanishing problems, or how to train an optimizer with good generalization performance. In this paper, we study the learning-to-learn approach on a simple problem of tuning the step size for quadratic loss. Our results show that although there is a way to design the meta-objective so that the meta-gradient remain polynomially bounded, computing the meta-gradient directly using backpropagation leads to numerical issues that look similar to gradient explosion/vanishing problems. We also characterize when it is necessary to compute the meta-objective on a separate validation set instead of the original training set. Finally, we verify our results empirically and show that a similar phenomenon appears even for more complicated learned optimizers parametrized by neural networks.

Mildly Overparametrized Neural Nets can Memorize Training Data Efficiently arxivabstract
with Haoyu Zhao and Runzhe Wang
It has been observed  that deep neural networks can memorize: they achieve 100\% accuracy on training data. Recent theoretical results explained such behavior in highly overparametrized regimes, where the number of neurons in each layer is larger than the number of training samples. In this paper, we show that neural networks can be trained to memorize training data perfectly in a mildly overparametrized regime, where the number of parameters is just a constant factor more than the number of training samples, and the number of neurons is much smaller.

Extracting Latent State Representations with Linear Dynamics from Rich Observations arxivabstract
with Abraham Frandsen.

Recently, many reinforcement learning techniques were shown to have provable guarantees in the simple case of linear dynamics, especially in problems like linear quadratic regulators. However, in practice, many reinforcement learning problems try to learn a policy directly from rich, high dimensional representations such as images. Even if there is an underlying dynamics that is linear in the correct latent representations (such as position and velocity), the rich representation is likely to be nonlinear and can contain irrelevant features. In this work we study a model where there is a hidden linear subspace in which the dynamics is linear. For such a model we give an efficient algorithm for extracting the linear subspace with linear dynamics. We then extend our idea to extracting a nonlinear mapping, and empirically verify the effectiveness of our approach in simple settings with rich observations.

High-Dimensional Robust Mean Estimation via Gradient Descent arxivabstract
with Yu Cheng, Ilias Diakonikolas and Mahdi Soltanoltabi. In ICML 2020
We study the problem of high-dimensional robust mean estimation in the presence of a constant fraction of adversarial outliers. A recent line of work has provided sophisticated polynomial-time algorithms for this problem with dimension-independent error guarantees for a range of natural distribution families.
In this work, we show that a natural non-convex formulation of the problem can be solved directly by gradient descent. Our approach leverages a novel structural lemma, roughly showing that any approximate stationary point of our non-convex objective gives a near-optimal solution to the underlying robust estimation task. Our work establishes an intriguing connection between algorithmic high-dimensional robust statistics and non-convex optimization, which may have broader applications to other robust estimation tasks.

Customizing ML Predictions For Online Algorithms arxivabstract
with Keerti Anand and Debmalya Panigrahi. In ICML 2020
A popular line of recent research incorporates ML advice in the design of online algorithms to improve their performance in typical instances. These papers treat the ML algorithm as a blackbox, and redesign online algorithms to take advantage of ML predictions. In this paper, we ask the complementary question: can we redesign ML algorithms to provide better predictions for online algorithms? We explore this question in the
context of the classic rent-or-buy problem, and show that incorporating optimization benchmarks in ML loss functions leads to significantly better performance, while maintaining a worst-case adversarial result when the advice is completely wrong. We support this finding both through theoretical bounds and numerical simulations.

Optimization landscape of Tucker decomposition arxivabstract
with Abraham Frandsen. In Mathematical Programming 2020. 

Tucker decomposition is a popular technique for many data analysis and machine learning applications. Finding a Tucker decomposition is a nonconvex optimization problem. As the scale of the problems increases, local search algorithms such as stochastic gradient descent have become popular in practice. In this paper, we characterize the optimization landscape of the Tucker decomposition problem. In particular, we show that if the tensor has an exact Tucker decomposition, for a standard nonconvex objective of Tucker decomposition, all local minima are also globally optimal. We also give a local search algorithm that can find an approximate local (and global) optimal solution in polynomial time.

Estimating Normalizing Constants for Log-Concave Distributions: Algorithms and Lower Bounds arxivabstractvideo
with Holden Lee and Jianfeng Lu. In STOC 2020.
Estimating the normalizing constant of an unnormalized probability distribution has important applications in computer science, statistical physics, machine learning, and statistics. In this work, we consider the problem of estimating the normalizing constant to within a multiplication factor of 1+\eps for a -strongly convex and L-smooth function f, given query access to f(x) and ∇f(x). We give both algorithms and lowerbounds for this problem. Using an annealing algorithm combined with a multilevel Monte Carlo method based on underdamped Langevin dynamics, we show that O(d^{4/3}/\eps^2) queries to ∇f are sufficient. Moreover, we provide an information theoretic lowerbound, showing that at least d^{1−o(1)}/\eps^{2−o(1)} queries are necessary. This provides a first nontrivial lowerbound for the problem.

Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets. arxivabstractblog
with Rohith Kuditipudi, Xiang Wang, Holden Lee, Yi Zhang, Zhiyuan Li, Wei Hu and Sanjeev Arora. In NeurIPS 2019 .
Mode connectivity is a surprising phenomenon in the loss landscape of deep nets. Optima---at least those discovered by gradient-based optimization---turn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piece-wise linear, with as few as two segments.
We give mathematical explanations for this phenomenon, assuming generic properties (such as dropout stability and noise stability) of well-trained deep nets, which have previously been identified as part of understanding the generalization properties of deep nets. Our explanation holds for realistic multilayer nets, and experiments are presented to verify the theory.

Faster Algorithms for High-Dimensional Robust Covariance Estimation. arxivabstract
with Yu Cheng, Ilias Diakonikolas and David P. Woodruff. In COLT 2019 .
We study the problem of estimating the covariance matrix of a high-dimensional distribution when a small constant fraction of the samples can be arbitrarily corrupted. Recent work gave the first polynomial time algorithms for this problem with near-optimal error guarantees for several natural structured distributions. Our main contribution is to develop faster algorithms for this problem whose running time nearly matches that of computing the empirical covariance.
Given N=Ω(d^2/ϵ^2) samples from a d-dimensional Gaussian distribution, an ϵ-fraction of which may be arbitrarily corrupted, our algorithm runs in time O(d^{3.26})/poly(ϵ) and approximates the unknown covariance matrix to optimal error up to a logarithmic factor. Previous robust algorithms with comparable error guarantees all have runtimes Ω(d^{2ω}) when ϵ=Ω(1), where ω is the exponent of matrix multiplication. We also provide evidence that improving the running time of our algorithm may require new algorithmic techniques.

Stabilized SVRG: Simple Variance Reduction for Nonconvex Optimization. arxivabstract
with Zhize Li, Weiyao Wang, Xiang Wang. In COLT 2019.
Variance reduction techniques like SVRG provide simple and fast algorithms for optimizing a convex finite-sum objective. For nonconvex objectives, these techniques can also find a first-order stationary point (with small gradient). However, in nonconvex optimization it is often crucial to find a second-order stationary point (with small gradient and almost PSD hessian). In this paper, we show that Stabilized SVRG (a simple variant of SVRG) can find an ϵ-second-order stationary point using only O(n^{2/3}/ϵ^2+n/ϵ^{1.5}) stochastic gradients. To our best knowledge, this is the first second-order guarantee for a simple variant of SVRG. The running time almost matches the known guarantees for finding ϵ-first-order stationary points.

The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Schedule for Least Squares arxivabstract
with Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli. To appear in NeurIPS 2019.

There is a stark disparity between the learning rate schedules used in the practice of large scale machine learning and what are considered admissible learning rate schedules prescribed in the theory of stochastic approximation. Recent results, such as in the 'super-convergence' methods which use oscillating learning rates, serve to emphasize this point even more.
One plausible explanation is that non-convex neural network training procedures are better suited to the use of fundamentally different learning rate  schedules, such as the ``cut the learning rate every constant number of epochs'' method (which more closely resembles an exponentially decaying learning rate schedule); note that this widely used schedule is in stark contrast to the polynomial decay schemes prescribed in the stochastic approximation literature, which are indeed shown to be (worst case) optimal for classes of convex optimization problems.

The main contribution of this work shows that the picture is far more nuanced, where we do not even need to move to non-convex optimization to show other learning rate schemes can be far more effective. In fact, even for the simple case of stochastic linear regression with a fixed time horizon, the rate achieved by any polynomial decay scheme is sub-optimal compared to the statistical minimax rate (by a factor of condition number); in contrast the ```''cut the learning rate every constant number of epochs'' provides an exponential improvement (depending only logarithmically on the condition number) compared to any polynomial decay scheme.  Finally, it is important to ask if our theoretical insights are somehow fundamentally tied to quadratic loss minimization (where we have circumvented minimax lower bounds for more general convex optimization problems)? Here, we conjecture that recent results which make the gradient norm small at a near optimal rate, for both convex and non-convex optimization, may also provide more insights into learning rate schedules used in practice.

Learning Two-layer Neural Networks with Symmetric Inputs. arxivabstract
with Rohith Kuditipudi, Xiang Wang and Zhize Li. In ICLR 2019.

We give a new algorithm for learning a two-layer neural network under a general class of input distributions. Assuming there is a ground-truth two-layer network y=Aσ(Wx)+ξ, where A,W are weight matrices, ξ represents noise, and the number of neurons in the hidden layer is no larger than the input or output, our algorithm is guaranteed to recover the parameters A,W of the ground-truth network. The only requirement on the input x is that it is symmetric, which still allows highly complicated and structured input. Our algorithm is based on the method-of-moments framework and extends several results in tensor decompositions. We use spectral algorithms to avoid the complicated non-convex optimization in learning neural networks. Experiments show that our algorithm can robustly learn the ground-truth neural network with a small number of samples for many symmetric input distributions.

Understanding Composition of Word Embeddings via Tensor Decomposition. arxivabstract
with Abraham Frandsen. In ICLR 2019.

Word embedding is a powerful tool in natural language processing. In this paper we consider the problem of word embedding composition --- given vector representations of two words, compute a vector for the entire phrase. We give a generative model that can capture specific syntactic relations between words. Under our model, we prove that the correlations between three words (measured by their PMI) form a tensor that has an approximate low rank Tucker decomposition. The result of the Tucker decomposition gives the word embeddings as well as a core tensor, which can be used to produce better compositions of the word embeddings. We also complement our theoretical results with experiments that verify our assumptions, and demonstrate the effectiveness of the new composition method.

High-Dimensional Robust Mean Estimation in Nearly-Linear Time. arxivabstract
with Yu Cheng and Ilias Diakonikolas. In SODA 2019.

We study the fundamental problem of high-dimensional mean estimation in a robust model where a constant fraction of the samples are adversarially corrupted. Recent work gave the first polynomial time algorithms for this problem with dimension-independent error guarantees for several families of structured distributions. In this work, we give the first nearly-linear time algorithms for high-dimensional robust mean estimation. Specifically, we focus on distributions with (i) known covariance and sub-gaussian tails, and (ii) unknown bounded covariance. Given N samples on R^d, an ε-fraction of which may be arbitrarily corrupted, our algorithms run in time O~(Nd)/poly(ε) and approximate the true mean within the information-theoretically optimal error, up to constant factors. Previous robust algorithms with comparable error guarantees have running times Ω~(Nd2), for ε=Ω(1). Our algorithms rely on a natural family of SDPs parameterized by our current guess ν for the unknown mean μ⋆. We give a win-win analysis establishing the following: either a near-optimal solution to the primal SDP yields a good candidate for μ⋆ -- independent of our current guess ν -- or the dual SDP yields a new guess ν′ whose distance from μ⋆ is smaller by a constant factor. We exploit the special structure of the corresponding SDPs to show that they are approximately solvable in nearly-linear time. Our approach is quite general, and we believe it can also be applied to obtain nearly-linear time algorithms for other high-dimensional robust learning problems.

Stronger generalization bounds for deep nets via a compression approach. arxivabstractblog
with Sanjeev Arora, Behnam Neyshabur and Yi Zhang. In ICML 2018.

Deep nets generalize well despite having more parameters than the number of training samples. Recent works try to give an explanation using PAC-Bayes and Margin-based analyses, but do not as yet result in sample complexity bounds better than naive parameter counting. The current paper shows generalization bounds that're orders of magnitude better in practice. These rely upon new succinct reparametrizations of the trained net --- a compression that is explicit and efficient. These yield generalization bounds via a simple compression-based framework introduced here. Our results also provide some theoretical justification for widespread empirical success in compressing deep nets. Analysis of correctness of our compression relies upon some newly identified \textquotedblleft noise stability\textquotedblright properties of trained deep nets, which are also experimentally verified. The study of these properties and resulting generalization bounds are also extended to convolutional nets, which had eluded earlier attempts on proving generalization.

Minimizing Nonconvex Population Risk from Rough Empirical Risk. arxivabstract
with Chi Jin, Lydia T. Liu, Michael I. Jordan. In NeurIPS 2018.

Population risk---the expectation of the loss over the sampling mechanism---is always of primary interest in machine learning. However, learning algorithms only have access to empirical risk, which is the average loss over training examples. Although the two risks are typically guaranteed to be pointwise close, for applications with nonconvex nonsmooth losses (such as modern deep networks), the effects of sampling can transform a well-behaved population risk into an empirical risk with a landscape that is problematic for optimization. The empirical risk can be nonsmooth, and it may have many additional local minima. This paper considers a general optimization framework which aims to find approximate local minima of a smooth nonconvex function F (population risk) given only access to the function value of another function f (empirical risk), which is pointwise close to F. We propose a simple algorithm based on stochastic gradient descent (SGD) on a smoothed version of f which is guaranteed to find a second-order stationary point if f and F are close enough, thus escaping all saddle points of F and all the additional local minima introduced by f. We also provide an almost matching lower bound showing that our SGD-based approach achieves the optimal trade-off between the relevant parameters, as well as the optimal dependence on problem dimension d, among all algorithms making a polynomial number of queries. As a concrete example, we show that our results can be directly used to give sample complexities for learning a ReLU unit, whose empirical risk is nonsmooth.

Non-Convex Matrix Completion Against a Semi-Random Adversary. arxivabstract
with Yu Cheng. In COLT 2018.

Matrix completion is a well-studied problem with many machine learning applications. In practice, the problem is often solved by non-convex optimization algorithms. However, the current theoretical analysis for non-convex algorithms relies heavily on the assumption that every entry is observed with exactly the same probability p, which is not realistic in practice.
In this paper, we investigate a more realistic semi-random model, where the probability of observing each entry is at least p. Even with this mild semi-random perturbation, we can construct counter-examples where existing non-convex algorithms get stuck in bad local optima.
In light of the negative results, we propose a pre-processing step that tries to re-weight the semi-random input, so that it becomes "similar" to a random input. We give a nearly-linear time algorithm for this problem, and show that after our pre-processing, all the local minima of the non-convex objective can be used to approximately recover the underlying ground-truth matrix.

Beyond Log-concavity: Provable Guarantees for Sampling Multi-modal Distributions using Simulated Tempering Langevin Monte Carlo arxivabstract
with Holden Lee and Andrej Risteski. NIPS 2017 Bayesian Inference Workshop. In NeurIPS 2018.

A key task in Bayesian statistics is sampling from distributions that are only specified up to a partition function (i.e., constant of proportionality). However, without any assumptions, sampling (even approximately) can be #P-hard, and few works have provided "beyond worst-case" guarantees for such settings. For log-concave distributions, classical results going back to Bakry and \'Emery (1985) show that natural continuous-time Markov chains called Langevin diffusions mix in polynomial time. The most salient feature of log-concavity violated in practice is uni-modality: commonly, the distributions we wish to sample from are multi-modal. In the presence of multiple deep and well-separated modes, Langevin diffusion suffers from torpid mixing. We address this problem by combining Langevin diffusion with simulated tempering. The result is a Markov chain that mixes more rapidly by transitioning between different temperatures of the distribution. We analyze this Markov chain for the canonical multi-modal distribution: a mixture of gaussians (of equal variance). The algorithm based on our Markov chain provably samples from distributions that are close to mixtures of gaussians, given access to the gradient of the log-pdf. For the analysis, we use a spectral decomposition theorem for graphs (Gharan and Trevisan, 2014) and a Markov chain decomposition technique (Madras and Randall, 2002).

Global Convergence of Policy Gradient Methods for Linearized Control Problems arxivabstract
with Maryam Fazel, Sham M. Kakade and Mehran Mesbahi. In ICML 2018.

Direct policy gradient methods for reinforcement learning and continuous control problems are a popular approach for a variety of reasons: 1) they are easy to implement without explicit knowledge of the underlying model 2) they are an "end-to-end" approach, directly optimizing the performance metric of interest 3) they inherently allow for richly parameterized policies. A notable drawback is that even in the most basic continuous control problem (that of linear quadratic regulators), these methods must solve a non-convex optimization problem, where little is understood about their efficiency from both computational and statistical perspectives. In contrast, system identification and model based planning in optimal control theory have a much more solid theoretical footing, where much is known with regards to their computational and statistical properties. This work bridges this gap showing that (model free) policy gradient methods globally converge to the optimal solution and are efficient (polynomially so in relevant problem dependent quantities) with regards to their sample and computational complexities.

Learning One-hidden-layer Neural Networks with Landscape Design. arxivabstract
with Tengyu Ma and Jason D. Lee. To appear in ICLR 2018.

We consider the problem of learning a one-hidden-layer neural network: we assume the input x is from Gaussian distribution and the label y = a*\sigma(Bx), where a is a nonnegative vector in m dimensions, B is a full-rank weight matrix. We first give an analytic formula for the population risk of the standard squared loss and demonstrate that it implicitly attempts to decompose a sequence of low-rank tensors simultaneously. Inspired by the formula, we design a non-convex objective function G whose landscape is guaranteed to have the following properties:
1. All local minima of G are also global minima.
2. All global minima of G correspond to the ground truth parameters.
3. The value and gradient of G can be estimated using samples.
With these properties, stochastic gradient descent on G provably converges to the global minimum and learn the ground-truth parameters. We also prove finite sample complexity result and validate the results by simulations.

Generalization and Equilibrium in Generative Adversarial Nets (GANs) arxivabstractblog
with Sanjeev Arora, Yingyu Liang, Tengyu Ma and Yi Zhang. In ICML 2017.

This paper makes progress on several open theoretical issues related to Generative Adversarial Networks. A definition is provided for what it means for the training to generalize, and it is shown that generalization is not guaranteed for the popular distances between distributions such as Jensen-Shannon or Wasserstein. We introduce a new metric called neural net distance for which generalization does occur. We also show that an approximate pure equilibrium in the 2-player game exists for a natural training objective (Wasserstein). Showing such a result has been an open problem (for any training objective). Finally, the above theoretical ideas lead us to propose a new training protocol, MIX+GAN, which can be combined with any existing method. We present experiments showing that it stabilizes and improves some existing methods.

On the ability of neural nets to express distributions arxivabstract
with Holden Lee, Andrej Risteski, Tengyu Ma and Sanjeev Arora. In COLT 2017.

Deep neural nets have caused a revolution in many classification tasks. A related ongoing revolution---also theoretically not understood---concerns their ability to serve as generative models for complicated types of data such as images and texts. These models are trained using ideas like variational autoencoders and Generative Adversarial Networks.

We take a first cut at explaining the expressivity of multilayer nets by giving a sufficient criterion for a function to be approximable by a neural network with n hidden layers. A key ingredient is Barron's Theorem [Barron1993], which gives a Fourier criterion for approximability of a function by a neural network with 1 hidden layer. We show that a composition of n functions which satisfy certain Fourier conditions ("Barron functions") can be approximated by a n+1-layer neural network. For probability distributions, this translates into a criterion for a probability distribution to be approximable in Wasserstein distance---a natural metric on probability distributions---by a neural network applied to a fixed base distribution (e.g., multivariate gaussian).

Building up recent lower bound work, we also give an example function that shows that composition of Barron functions is more expressive than Barron functions alone.

No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis arxivabstract
with Chi Jin, Yi Zheng. In ICML 2017.

In this paper we develop a new framework that captures the common landscape underlying the common non-convex low-rank matrix problems including matrix sensing, matrix completion and robust PCA. In particular, we show for all above problems (including asymmetric cases): 1) all local minima are also globally optimal; 2) no high-order saddle points exists. These results explain why simple algorithms such as stochastic gradient descent have global converge, and efficiently optimize these non-convex objective functions in practice. Our framework connects and simplifies the existing analyses on optimization landscapes for matrix sensing and symmetric matrix completion. The framework naturally leads to new results for asymmetric matrix completion and robust PCA.

How to Escape Saddle Points Efficiently arxivabstract
with Chi Jin, Praneeth Netrapalli, Sham M. Kakade, Michael I. Jordan. In ICML 2017.

This paper shows that a perturbed form of gradient descent converges to a second-order stationary point in a number iterations which depends only poly-logarithmically on dimension (i.e., it is almost "dimension-free"). The convergence rate of this procedure matches the well-known convergence rate of gradient descent to first-order stationary points, up to log factors. When all saddle points are non-degenerate, all second-order stationary points are local minima, and our result thus shows that perturbed gradient descent can escape saddle points almost for free. Our results can be directly applied to many machine learning applications, including deep learning. As a particular concrete example of such an application, we show that our results can be used directly to establish sharp global convergence rates for matrix factorization. Our results rely on a novel characterization of the geometry around saddle points, which may be of independent interest to the non-convex optimization community.

Online Service with Delay abstract
with Yossi Azar, Arun Ganesh and Debmalya Panigrahi. In STOC 2017.

In this paper, we introduce the online service with delay problem. In this problem, there are $n$ points in a metric space that issue service requests over time, and a server that serves these requests. The goal is to minimize the sum of distance traveled by the server and the total delay in serving the requests. This problem models the fundamental tradeoff between batching requests to improve locality and reducing delay to improve response time, that has many applications in operations management, operating systems, logistics, supply chain management, and scheduling.

Our main result is to show a poly-logarithmic competitive ratio for the online service with delay problem. This result is obtained by an algorithm that we call the preemptive service algorithm. The salient feature of this algorithm is a process called preemptive service, which uses a novel combination of (recursive) time forwarding and spatial exploration on a metric space. We hope this technique will be useful for related problems such as reordering buffer management, online TSP, vehicle routing, etc. We also generalize our results to k >1 servers.

On the Optimization Landscape of Tensor decompositions arxivabstract
with Tengyu Ma. In NIPS 2017. Best Theoretical Work Award in NIPS 2016 Non-convex Workshop.

Non-convex optimization with local search heuristics has been widely used in machine learning, achieving many state-of-art results. It becomes increasingly important to understand why they can work for these NP-hard problems on typical data. The landscape of many objective functions in learning have been conjectured to have the geometric property that “all local optima are (approximately) global optima”, and thus they can be solved efficiently by local search algorithms. However, establishing such property can be very difficult.
In this paper we analyze the optimization landscape of the random over-complete tensor decomposition problem, which has many applications in unsupervised leaning, especially in learning latent variable models. In practice, it can be efficiently solved by gradient ascent on a non-convex objective. We show that for any small constant c > 0, among the set of points with function values (1 + c)-factor larger than the expectation of the function, all the local maxima are approximate global maxima. Previously, the best known result only characterizes the geometry in small neighborhoods around the true components. Our result implies that even with an initialization that is barely better than the random guess, the gradient ascent algorithm is guaranteed to solve this problem

Provable Learning of Noisy-or Networks arxivabstract
with Sanjeev Arora, Tengyu Ma and Andrej Risteski. In STOC 2017.

Many machine learning applications use latent variable models to explain structure in data, whereby visible variables (= coordinates of the given datapoint) are explained as a probabilistic function of some hidden variables. Finding parameters with the maximum likelihood is NP-hard even in very simple settings. In recent years, provably efficient algorithms were nevertheless developed for models with linear structures: topic models, mixture models, hidden markov models, etc. These algorithms use matrix or tensor decomposition, and make some reasonable assumptions about the parameters of the underlying model.
But matrix or tensor decomposition seems of little use when the latent variable model has nonlinearities. The current paper shows how to make progress: tensor decomposition is applied for learning the single-layer noisy or network, which is a textbook example of a Bayes net, and used for example in the classic QMR-DT software for diagnosing which disease(s) a patient may have by observing the symptoms he/she exhibits.
The technical novelty here, which should be useful in other settings in future, is analysis of tensor decomposition in presence of systematic error (i.e., where the noise/error is correlated with the signal, and doesn't decrease as number of samples goes to infinity). This requires rethinking all steps of tensor decomposition methods from the ground up.
For simplicity our analysis is stated assuming that the network parameters were chosen from a probability distribution but the method seems more generally applicable.

Homotopy Analysis for Tensor PCA arxivabstract
with Anima Anandkumar, Yuan Deng and Hossein Mobahi. In COLT 2017.

Developing efficient and guaranteed nonconvex algorithms has been an important challenge in modern machine learning. Algorithms with good empirical performance such as stochastic gradient descent often lack theoretical guarantees. In this paper, we analyze the class of homotopy or continuation methods for global optimization of nonconvex functions. These methods start from an objective function that is efficient to optimize (e.g. convex), and progressively modify it to obtain the required objective, and the solutions are passed along the homotopy path. For the challenging problem of tensor PCA, we prove global convergence of the homotopy method in the "high noise" regime. The signal-to-noise requirement for our algorithm is tight in the sense that it matches the recovery guarantee for the best degree-4 sum-of-squares algorithm. In addition, we prove a phase transition along the homotopy path for tensor PCA. This allows to simplify the homotopy method to a local search algorithm, viz., tensor power iterations, with a specific initialization and a noise injection procedure, while retaining the theoretical guarantees.

Matrix Completion has No Spurious Local Minimum arxivabstract
with Jason D. Lee and Tengyu Ma. In NIPS 2016. Best Student Paper Award.

Matrix completion is a basic machine learning problem that has wide applications, especially in collaborative filtering and recommender systems. Simple non-convex optimization algorithms are popular and effective in practice. Despite recent progress in proving various non-convex algorithms converge from a good initial point, it remains unclear why random or arbitrary initialization suffices in practice. We prove that the commonly used non-convex objective function for matrix completion has no spurious local minima -- all local minima must also be global. Therefore, many popular optimization algorithms such as (stochastic) gradient descent can provably solve matrix completion with arbitrary initialization in polynomial time.

Provable Algorithms for Inference in Topic Models arxivabstract
with Sanjeev Arora, Frederic Koehler, Tengyu Ma and Ankur Moitra. In ICML 2016.

Recently, there has been considerable progress on designing algorithms with provable guarantees -- typically using linear algebraic methods -- for parameter learning in latent variable models. But designing provable algorithms for inference has proven to be more challenging. Here we take a first step towards provable inference in topic models. We leverage a property of topic models that enables us to construct simple linear estimators for the unknown topic proportions that have small variance, and consequently can work with short documents. Our estimators also correspond to finding an estimate around which the posterior is well-concentrated. We show lower bounds that for shorter documents it can be information theoretically impossible to find the hidden topics. Finally, we give empirical results that demonstrate that our algorithm works on realistic topic models. It yields good solutions on synthetic data and runs in time comparable to a single iteration of Gibbs sampling.

Efficient approaches for escaping higher order saddle points in non-convex optimization arxiv abstract
with Anima Anandkumar. In COLT 2016.

Local search heuristics for non-convex optimizations are popular in applied machine learning. However, in general it is hard to guarantee that such algorithms even converge to a local minimum, due to the existence of complicated saddle point structures in high dimensions. Many functions have degenerate saddle points such that the first and second order derivatives cannot distinguish them with local optima. In this paper we use higher order derivatives to escape these saddle points: we design the first efficient algorithm guaranteed to converge to a third order local optimum (while existing techniques are at most second order). We also show that it is NP-hard to extend this further to finding fourth order local optima.

Efficient Algorithms for Large-scale Generalized Eigenvector Computation and Canonical Correlation Analysis arxivabstract
with Chi Jin, Sham M. Kakade, Praneeth Netrapalli, Aaron Sidford. In ICML 2016.

This paper considers the problem of canonical-correlation analysis (CCA) (Hotelling, 1936) and, more broadly, the generalized eigenvector problem for a pair of symmetric matrices. These are two fundamental problems in data analysis and scientific computing with numerous applications in machine learning and statistics (Shi and Malik, 2000; Hardoon et al., 2004; Witten et al., 2009).

We provide simple iterative algorithms, with improved runtimes, for solving these problems that are globally linearly convergent with moderate dependencies on the condition numbers and eigenvalue gaps of the matrices involved.

We obtain our results by reducing CCA to the top-k generalized eigenvector problem. We solve this problem through a general framework that simply requires black box access to an approximate linear system solver. Instantiating this framework with accelerated gradient descent we obtain a near linear running time.

Our algorithm is linear in the input size and the number of components k up to a log(k) factor. This is essential for handling large-scale matrices that appear in practice. To the best of our knowledge this is the first such algorithm with global linear convergence. We hope that our results prompt further research and ultimately improve the practical running time for performing these important data analysis procedures on large data sets.

Rich Component Analysis arxiv abstractwith James Zou. In ICML 2016.

In many settings, we have multiple data sets (also called views) that capture different and overlapping aspects of the same phenomenon. We are often interested in finding patterns that are unique to one or to a subset of the views. For example, we might have one set of molecular observations and one set of physiological observations on the same group of individuals, and we want to quantify molecular patterns that are uncorrelated with physiology. Despite being a common problem, this is highly challenging when the correlations come from complex distributions. In this paper, we develop the general framework of Rich Component Analysis (RCA) to model settings where the observations from different views are driven by different sets of latent components, and each component can be a complex, high-dimensional distribution. We introduce algorithms based on cumulant extraction that provably learn each of the components without having to model the other components. We show how to integrate RCA with stochastic gradient descent into a meta-algorithm for learning general models, and demonstrate substantial improvement in accuracy on several synthetic and real datasets in both supervised and unsupervised tasks. Our method makes it possible to learn latent variable models when we don't have samples from the true model but only samples after complex perturbations.

Decomposing Overcomplete 3rd Order Tensors using Sum-of-Squares Algorithms arxivabstract
with Tengyu Ma. In RANDOM 2015.

Tensor rank and low-rank tensor decompositions have many applications in learning and complexity theory. Most known algorithms use unfoldings of tensors and can only handle rank up to n^{p/2} for a p-th order tensor. Previously no efficient algorithm can decompose 3rd order tensors when the rank is super-linear in the dimension. Using ideas from sum-of-squares hierarchy, we give the first quasi-polynomial time algorithm that can decompose a random 3rd order tensor decomposition when the rank is as large as n^{3/2}/polylog n.

We also give a polynomial time algorithm for certifying the injective norm of random low rank tensors. Our tensor decomposition algorithm exploits the relationship between injective norm and the tensor components. The proof relies on interesting tools for decoupling random variables to prove better matrix concentration bounds, which can be useful in other settings.

Intersecting Faces: Non-negative Matrix Factorization With New Guarantees arxivabstractwith James Zou. In ICML 2015.

Non-negative matrix factorization (NMF) is a natural model of admixture and is widely used in science and engineering. A plethora of algorithms have been developed to tackle NMF, but due to the non-convex nature of the problem, there is little guarantee on how well these methods work. Recently a surge of research have focused on a very restricted class of NMFs, called separable NMF, where provably correct algorithms have been developed. In this paper, we propose the notion of subset-separable NMF, which substantially generalizes the property of separability. We show that subset-separability is a natural necessary condition for the factorization to be unique or to have minimum volume. We developed the Face-Intersect algorithm which provably and efficiently solves subset-separable NMF under natural conditions, and we prove that our algorithm is robust to small noise. We explored the performance of Face-Intersect on simulations and discuss settings where it empirically outperformed the state-of-art methods. Our work is a step towards finding provably correct algorithms that solve large classes of NMF problems.

Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization arxiv abstract
with Roy Frostig, Sham M. Kakade and Aaron Sidford. In ICML 2015.

We develop a family of accelerated stochastic algorithms that minimize sums of convex functions. Our algorithms improve upon the fastest running time for empirical risk minimization (ERM), and in particular linear least-squares regression, across a wide range of problem settings. To achieve this, we establish a framework based on the classical proximal point algorithm. Namely, we provide several algorithms that reduce the minimization of a strongly convex function to approximate minimizations of regularizations of the function. Using these results, we accelerate recent fast stochastic algorithms in a black-box fashion. Empirically, we demonstrate that the resulting algorithms exhibit notions of stability that are advantageous in practice. Both in theory and in practice, the provided algorithms reap the computational benefits of adding a large strongly convex regularization term, without incurring a corresponding bias to the original problem.

Escaping From Saddle Points --- Online Stochastic Gradient for Tensor Decomposition arxiv abstractblog
with Furong Huang, Chi Jin and Yang Yuan. In COLT 2015.

We analyze stochastic gradient descent for optimizing non-convex functions. In many cases for non-convex functions the goal is to find a reasonable local minimum, and the main concern is that gradient updates are trapped in saddle points. In this paper we identify strict saddle property for non-convex problem that allows for efficient optimization. Using this property we show that stochastic gradient descent converges to a local minimum in a polynomial number of iterations. To the best of our knowledge this is the first work that gives global convergence guarantees for stochastic gradient descent on non-convex functions with exponentially many local minima and saddle points.

Our analysis can be applied to orthogonal tensor decomposition, which is widely used in learning a rich class of latent variable models. We propose a new optimization formulation for the tensor decomposition problem that has strict saddle property. As a result we get the first online algorithm for orthogonal tensor decomposition with global convergence guarantee.

Learning Mixtures of Gaussians in High Dimensions arxiv abstract
with Qingqing Huang and Sham M. Kakade. In STOC 2015.

Efficiently learning mixture of Gaussians is a fundamental problem in statistics and learning theory. Given samples coming from a random one out of k Gaussian distributions in R^n, the learning problem asks to estimate the means and the covariance matrices of these Gaussians. This learning problem arises in many areas ranging from the natural sciences to the social sciences, and has also found many machine learning applications.

Unfortunately, learning mixture of Gaussians is an information theoretically hard problem: in order to learn the parameters up to a reasonable accuracy, the number of samples required is exponential in the number of Gaussian components in the worst case. In this work, we show that provided we are in high enough dimensions, the class of Gaussian mixtures is learnable in its most general form under a smoothed analysis framework, where the parameters are randomly perturbed from an adversarial starting point. In particular, given samples from a mixture of Gaussians with randomly perturbed parameters, when n > {\Omega}(k^2), we give an algorithm that learns the parameters with polynomial running time and using polynomial number of samples.

The central algorithmic ideas consist of new ways to decompose the moment tensor of the Gaussian mixture by exploiting its structural properties. The symmetries of this tensor are derived from the combinatorial structure of higher order moments of Gaussian distributions (sometimes referred to as Isserlis' theorem or Wick's theorem). We also develop new tools for bounding smallest singular values of structured random matrices, which could be useful in other smoothed analysis settings.

Simple, Efficient, and Neural Algorithms for Sparse Coding arxiv abstract
with Sanjeev Arora, Tengyu Ma and Ankur Moitra. In COLT 2015.
Sparse coding is a basic task in many fields including signal processing, neuroscience and machine learning where the goal is to learn a basis that enables a sparse representation of a given set of data, if one exists. Its standard formulation is as a non-convex optimization problem which is solved in practice by heuristics based on alternating minimization. Re- cent work has resulted in several algorithms for sparse coding with provable guarantees, but somewhat surprisingly these are outperformed by the simple alternating minimization heuristics. Here we give a general framework for understanding alternating minimization which we leverage to analyze existing heuristics and to design new ones also with provable guarantees. Some of these algorithms seem implementable on simple neural architectures, which was the original motivation of Olshausen and Field (1997a) in introducing sparse coding. We also give the first efficient algorithm for sparse coding that works almost up to the information theoretic limit for sparse recovery on incoherent dictionaries. All previous algorithms that approached or surpassed this limit run in time exponential in some natural parameter. Finally, our algorithms improve upon the sample complexity of existing approaches. We believe that our analysis framework will have applications in other settings where simple iterative algorithms are used.

Competing with the Empirical Risk Minimizer in a Single Pass arxiv abstract
with Roy Frostig, Sham M. Kakade and Aaron Sidford. In COLT 2015.

Many optimization problems that arise in science and engineering are those in which we only have a stochastic approximation to the underlying objective (e.g. estimation problems such as linear regression). That is, given some distribution D over functions f(x), we wish to minimize P(x)=E[f(x)], using as few samples from D as possible. In the absence of computational constraints, the empirical risk minimizer (ERM) -- the minimizer on a sample average of observed data -- is widely regarded as the estimation strategy of choice due to its desirable statistical convergence properties. Our goal is to do as well as the empirical risk minimizer on every problem while minimizing the use of computational resources such as running time and space usage.

This work provides a simple streaming algorithm for solving this problem, with performance guarantees competitive with that of the ERM. In particular, under standard regularity assumptions on D, our algorithm enjoys the following properties:

  • The algorithm can be implemented in linear time with a single pass of the observed data, using space linear in the size of a single sample.
  • The algorithm achieves the same statistical rate of convergence as the empirical risk minimizer on every problem D (even considering constant factors).
  • The algorithm's performance depends on the initial error at a rate that decreases super-polynomially.

Moreover, we quantify the rate of convergence of both our algorithm and that of the ERM, showing that, after a number of samples that depend only on the regularity parameters, the streaming algorithm rapidly approaches the leading order rate of ERM in expectation.

Understanding Overcomplete 3rd Order Tensors abstract
with Anima Anandkumar and Majid Janzamin. Short version in COLT 2015.

This is a series of ongoing works where we try to understand how to decompose overcomplete third-order tensors. In general, decomposing third order tensors can be useful in learning many latent variable models (multi-view model, mixture of Gaussians, etc.) Previous algorithms usually can either only handle the undercomplete regime where the number of components is smaller than the number of of dimensions, or require access of higher order tensors which often results in higher sample complexity and running time. Surprisingly, many heuristics (such as Tensor Power Method or Alternating Least Squares) work very well in practice. In these works we try to understand when is decomposing third-order tensors tractable. Several parts are available on arxiv.

The first part [Arxiv] tries to analyze Tensor Power Method. In this paper we show if there is a good initialization (that is constant-close to some component), Tensor Power Method will converge to vectors that are very close to the true component. We also give an algorithm that converges to the exact true component given the (already reasonably close) results of the tensor power method. Good initializations can be found if there are some supervised information, or by spectral method if ther number of component k is only a constant factor larger than the dimension d.

The second part [Arxiv] tries to understand how many samples do we need in order to estimate the moment tensor accurately enough for several models. We show surprisingly in many natural cases the sample complexity is nearly linear in the number of components. In general the bounds in this paper is tighter than many previous works.

The third part (more independent from the first two) [Arxiv] tries to understand how well does Tensor Power Method perform without really good initialization. We use techniques from Approximate Message Passing to analyze the dynamics of the tensor power method, and show that the initialization does not need to be very close to the true component for tensor power method to perform well. This partly explains why tensor power method still works in the mildly overcomplete settings.

See also Majid Janzamin's page about this project for more information.

Provable Bounds for Learning Some Deep Representations arxiv abstract
with Sanjeev Arora, Aditya Bhaskara and Tengyu Ma. In ICML 2014.
We give algorithms with provable guarantees that learn a class of deep nets in the generative model view popularized by Hinton and others. Our generative model is an n node multilayer neural net that has degree at most n^{\gamma} for some \gamma < 1 and each edge has a random edge weight in [-1,1]. Our algorithm learns almost all networks in this class with polynomial running time. The sample complexity is quadratic or cubic depending upon the details of the model. The algorithm uses layerwise learning. It is based upon a novel idea of observing correlations among features and using these to infer the underlying edge structure via a global graph recovery procedure. The analysis of the algorithm reveals interesting structure of neural networks with random edge weights.

New Algorithms for Learning Incoherent and Overcomplete Dictionaries arxivabstract
with Sanjeev Arora and Ankur Moitra. In COLT 2014.

A m*n matrix A is said to be mu-incoherent if each pair of columns has inner product at most mu/\sqrt{n}. Starting with the pioneering work of Donoho and Huo such matrices (often called dictionaries) have played a central role in signal processing, statistics and machine learning. They allow sparse recovery: there are efficient algorithms for representing a given vector as a sparse linear combination of the columns of A (if such a combination exists). However, in many applications ranging from sparse coding in machine learning to image denoising, the dictionary is unknown and has to be learned from random examples of the form Y=AX where X is drawn from an appropriate distribution --- this is the dictionary learning problem. Existing proposed solutions such as the Method of Optimal Directions (MOD) or K-SVD do not provide any guarantees on their performance nor do they necessarily learn a dictionary for which one can solve sparse recovery problems. The only exception is the recent work of Spielman, Wang and Wright which gives a polynomial time algorithm for dictionary learning when A has full column rank (in particular m is at most n). However, in most settings of interest, dictionaries need to be {\em overcomplete} (i.e., m is larger than n).

Here we give the first polynomial time algorithm for dictionary learning when A is overcomplete. It succeeds under natural conditions on how X is generated, provided that X has at most

k=Cmin(\sqrt{n}/mu logn,m^{1/2-eps})

non-zero entries (for any eps>0). In other words it can handle almost as many non-zeros as the best sparse recovery algorithms could tolerate even if one knew the dictionary A exactly.

Towards a better approximation for sparsest cut? arxivabstract
with Sanjeev Arora and Ali Kemal Sinop. In FOCS 2013
We give a new (1+eps)-approximation for sparsest cut problem on graphs where small sets expand significantly more than the sparsest cut (sets of size n/r expand by a factor \sqrt{\log n\log r} bigger, for some small r; this condition holds for many natural graph families). We give two different algorithms. One involves Guruswami-Sinop rounding on the level-r Lasserre relaxation. The other is combinatorial and involves a new notion called Small Set Expander Flows (inspired by the expander flows of ARV) which we show exists in the input graph. Both algorithms run in time 2^{O(r)} poly(n). We also show similar approximation algorithms in graphs with genus g with an analogous local expansion condition. This is the first algorithm we know of that achieves (1+eps)-approximation on such general family of graphs.

A Tensor Approach to Learning Mixed Membership Community Models arxivabstract
with Anima Anandkumar, Daniel Hsu and Sham M. Kakade. In COLT 2013. Also JMLR Vol 15.
Modeling community formation and detecting hidden communities in networks is a well studied problem. However, theoretical analysis of community detection has been mostly limited to models with non-overlapping communities such as the stochastic block model. In this paper, we remove this restriction, and consider a family of probabilistic network models with overlapping communities, termed as the mixed membership Dirichlet model, first introduced in Airoldi et al. This model allows for nodes to have fractional memberships in multiple communities and assumes that the community memberships are drawn from a Dirichlet distribution. We propose a unified approach to learning these models via a tensor spectral decomposition method. Our estimator is based on low-order moment tensor of the observed network, consisting of 3-star counts. Our learning method is fast and is based on simple linear algebra operations, e.g. singular value decomposition and tensor power iterations. We provide guaranteed recovery of community memberships and model parameters and present a careful finite sample analysis of our learning method. Additionally, our results match the best known scaling requirements in the special case of the stochastic block model.

A Practical Algorithm for Topic Modeling with Provable Guarantees arxivabstractcode
with Sanjeev Arora, Yoni Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, Michael Zhu, in ICML 2013
Topic models provide a useful method for dimensionality reduction and exploratory data analysis in large text corpora. Most approaches to topic model inference have been based on a maximum likelihood objective. Efficient algorithms exist that approximate this objective, but they have no provable guarantees. Recently, algorithms have been introduced that provide provable bounds, but these algorithms are not practical because they are inefficient and not robust to violations of model assumptions. In this paper we present an algorithm for topic model inference that is both provable and practical. The algorithm produces results comparable to the best MCMC implementations while running orders of magnitude faster.

Tensor decompositions for learning latent variable models arxiv abstractblog
with Anima Anandkumar, Daniel Hsu, Sham M. Kakade, Matus Telgarsky. In JMLR Vol 15.
This work considers a computationally and statistically efficient parameter estimation method
for a wide class of latent variable models|including Gaussian mixture models, hidden Markov
models, and latent Dirichlet allocation|which exploits a certain tensor structure in their loworder
observable moments (typically, of second- and third-order). Specifically, parameter estimation
is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric
tensor derived from the moments; this decomposition can be viewed as a natural generalization
of the singular value decomposition for matrices. Although tensor decompositions are generally
intractable to compute, the decomposition of these specially structured tensors can be efficiently
obtained by a variety of approaches, including power iterations and maximization approaches
(similar to the case of matrices). A detailed analysis of a robust tensor power method is provided,
establishing an analogue of Wedin's perturbation theorem for the singular vectors of
matrices. This implies a robust and computationally tractable estimation approach for several
popular latent variable models.

Provable ICA with Unknown Gaussian Noise, and Implications for Gaussian Mixtures and Autoencoders arxiv abstract
with Sanjeev Arora, Ankur Moitra and Sushant Sachdeva. in NIPS 2012
We present a new algorithm for Independent Component Analysis (ICA) which has provable performance guarantees. In particular, suppose we are given samples of the form y = Ax + \eta where A is an unknown n*n matrix and x is a random variable whose components are independent and have a fourth moment strictly less than that of a standard Gaussian random variable, \eta is an n-dimensional Gaussian random variable with unknown covariance \Sigma: We give an algorithm that provable recovers A and \Sigma up to an additive \epsilon whose running time and sample complexity are polynomial in n and 1 / \epsilon. To accomplish this, we introduce a novel "quasi-whitening" step that may be useful in other contexts in which the covariance of Gaussian noise is not known in advance. We also give a general framework for finding all local optima of a function (given an oracle for approximately finding just one) and this is a crucial step in our algorithm, one that has been overlooked in previous attempts, and allows us to control the accumulation of error when we find the columns of A one by one via local search.

Learning Topic Models - Going beyond SVD arxiv abstract
with Sanjeev Arora and Ankur Moitra. In FOCS 2012.
In this work we revisit the topic modeling problem, which is used for automatic comprehension and classification of data in a variety of settings. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e. a vector of word-frequencies).

Theoretical studies of topic modeling focus on learning the model's parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition(SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the span of the topic vectors instead of the topic vectors themselves.

This paper formally justifies Nonnegative Matrix Factorization(NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data.

We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD - just as NMF has come to replace SVD in many applications.

Computing a Nonnegative Matrix Factorization --- Provably arxiv abstract
with Sanjeev Arora, Ravi Kannan and Ankur Moitra. In STOC 2012.
. The nonnegative matrix factorization problem asks for factorizing an n*m matrix M into AW where A is n*r and W is r*m, further the entries of the two matrices A and W are required to be nonnegative. This problem has a extremely broad range of applications, from machine learning to communication complexity. We give algorithms for exact and approximate versions of the problem when r is small (which is the case for most applications) and a hardness result that shows our results cannot be substancially improved.

Finding Overlapping Communities in Social Networks: Toward a Rigorous Approach arxivabstract
with Sanjeev Arora, Sushant Sachdeva and Grant Schoenebeck, In EC 2012.
A community in a social network is usually understood to be a group of nodes more densely connected with each other than with the rest of the network. Many existing methods for finding them use heuristic algorithms where the solution concept is defined in terms of an NP-complete problem such as CLIQUE or HIERARCHICAL CLUSTERING. We introduce a more formal approach whereby we make rigorous definitions that sidestep the pitfall of high computational complexity and allow efficient algorithms for finding communities.

Computational Complexity and Information Asymmetry in Financial Products pdfabstractfaq
with Sanjeev Arora, Boaz Barak and Markus Brunnermeier. In ICS 2010

A short version appeared in Communications of the ACM, 2011 Issue 5.
Full text PDF

See some blog comments on this paper: Intractability Center, Freedom to Tinker, Boing Boing, Daily Kos, In Theory, Healthy Algorithms, Lipton's blog

We put forward the issue of Computational Complexity in the analysis of Financial Derivatives, and show that there is a fundamental difficulty in pricing financial derivatives even in very simple settings.

This is still a working paper. The latest version (updated Jan. 2012) can be found here:
Computational Complexity and Information Asymmetry in
Financial Products

New Tools for Graph Coloring pdf abstract
with Sanjeev Arora, In APPROX 2011
We explore the possibility of better coloring algorithm using the new concept threshold rank. We are able to analyze high level Lasserre Relaxations for low threshold rank graphs and distance transitive graphs, and our algorithm makes significantly better progress (towards O(log n) coloring) in these cases compared to previous algorithms (where the number of colors is polynomial). We also purpose a plausible conjecture, that if true would yield good sub-exponential time algorithm for graph coloring.

New Algorithms for Learning in Presence of Errors pdf abstract
with Sanjeev Arora, In ICALP 2011
We give new algorithms for a variety of randomly-generated instances of computational problems using a linearization technique that reduces to solving a system of linear equations. Our techniques can be applied to the low noise case of the well-known LWE (Learning With Errors) problem, giving a slightly subexponential time algorithm which suggests known hardness results are almost tight.

New Results on Simple Stochastic Games pdf abstract
with Decheng Dai. In ISAAC 2009
We give a new algorithm for Simple Stochastic Games when the number of RANDOM vertices is small. We also show that coarse approximation is as hard as fine approximation for Simple Stochastic Games.