Home Research Google Scholar Teaching

rong

I am now Cue Family Associate professor at the Computer Science Department of Duke University.

I got my Ph.D. from the Computer Science Department of Princeton University. My advisor is Sanjeev Arora. I was a post-doc at Microsoft Research, New England


My Research
I am broadly interested in theoretical computer science and machine learning. Modern machine learning algorithms such as deep learning try to automatically learn useful hidden representations of the data. How can we formalize hidden structures in the data, and how do we design efficient algorithms to find them? My research aims to answer these questions by studying problems that arise in analyzing text, images and other forms of data, using techniques such as non-convex optimization and tensor decompositions. See the Research page for more details.

My thesis: Provable Algorithms for Machine Learning Problems


Students & Post-docs
Current PhD students:
Muthu Chidambaram
Ruomin Huang

Graduated students
Abraham Frandsen
Xiang Wang
Keerti Anand (co-advised with Debmalya Panigrahi)
Chenwei Wu
Mo Zhou


Post-docs:
Holden Lee (with Jianfeng Lu, now faculty at Johns Hopkins University)
Yu Cheng (with many others in algorithms group, now faculty at Brown University)


Selected Recent Publications
Provably learning diverse features in multi-view data with midpoint mixup arxivabstract
with Muthu Chidambaram, Xiang Wang, Chenwei Wu. In ICML 2023.

Mixup is a data augmentation technique that relies on training using random convex combinations of data points and their labels. In recent years, Mixup has become a standard primitive used in the training of state-of-the-art image classification models due to its demonstrated benefits over empirical risk minimization with regards to generalization and robustness. In this work, we try to explain some of this success from a feature learning perspective. We focus our attention on classification problems in which each class may have multiple associated features (or ) that can be used to predict the class correctly. Our main theoretical results demonstrate that, for a non-trivial class of data distributions with two features per class, training a 2-layer convolutional network using empirical risk minimization can lead to learning only one feature for almost all classes while training with a specific instantiation of Mixup succeeds in learning both features for every class. We also show empirically that these theoretical insights extend to the practical settings of image benchmarks modified to have multiple features.


Depth Separation with Multilayer Mean-Field Networks arxivabstract
with Yunwei Ren, Mo Zhou. In ICLR 2023.

Depth separation -- why a deeper network is more powerful than a shallower one -- has been a major problem in deep learning theory. Previous results often focus on representation power. For example, Safran et al. (2019) constructed a function that is easy to approximate using a 3-layer network but not approximable by any 2-layer network. In this paper, we show that this separation is in fact algorithmic: one can learn the function constructed by Safran et al. (2019) using an overparameterized network with polynomially many neurons efficiently. Our result relies on a new way of extending the mean-field limit to multilayer networks, and a decomposition of loss that factors out the error introduced by the discretization of infinite-width mean-field networks.


Outlier-robust sparse estimation via non-convex optimizationarxivabstract
with Yu Cheng, Ilias Diakonikolas, Shivam Gupta, Daniel Kane, Mahdi Soltanolkotabi. in NeurIPS 2022
We explore the connection between outlier-robust high-dimensional statistics and non-convex optimization in the presence of sparsity constraints, with a focus on the fundamental tasks of robust sparse mean estimation and robust sparse PCA. We develop novel and simple optimization formulations for these problems such that any approximate stationary point of the associated optimization problem yields a near-optimal solution for the underlying robust estimation task. As a corollary, we obtain that any first-order method that efficiently converges to stationarity yields an efficient algorithm for these tasks. The obtained algorithms are simple, practical, and succeed under broader distributional assumptions compared to prior work.

A local convergence theory for mildly over-parameterized two-layer neural networkarxiv abstract
with Mo Zhou and Chi Jin. In COLT 2021.
While over-parameterization is widely believed to be crucial for the success of optimization for the neural networks, most existing theories on over-parameterization do not fully explain the reason -- they either work in the Neural Tangent Kernel regime where neurons don't move much, or require an enormous number of neurons. In practice, when the data is generated using a teacher neural network, even mildly over-parameterized neural networks can achieve 0 loss and recover the directions of teacher neurons. In this paper we develop a local convergence theory for mildly over-parameterized two-layer neural net. We show that as long as the loss is already lower than a threshold (polynomial in relevant parameters), all student neurons in an over-parameterized two-layer neural network will converge to one of teacher neurons, and the loss will go to 0. Our result holds for any number of student neurons as long as it is at least as large as the number of teacher neurons, and our convergence rate is independent of the number of student neurons. A key component of our analysis is the new characterization of local optimization landscape -- we show the gradient satisfies a special case of Lojasiewicz property which is different from local strong convexity or PL conditions used in previous work.

Contact
Email: rongge AT cs DOT duke DOT edu
Tel: +1 (919) 660-7330
Mail: Duke University
Campus Box 90129
308 Research Drive (LSRC Building)
Room D226 Durham, NC 27708 USA