Statistics Seminar Speaker: Abhishek Chakrabortty, 12/14/2018

The Statistics Seminar speaker for Friday, December 14, 2018, is Abhishek Chakrabortty, a postdoctoral researcher at the Department of Statistics and the DBEI, University of Pennsylvania where he is mentored by Prof. Hongzhe Li and Prof. T. Tony Cai. Dr. Chakrabortty received his Ph.D. in Biostatistics from Harvard University, where he was advised by Prof. Tianxi Cai, and his Bachelors and Masters in Statistics from the Indian Statistical Institute, Kolkata. His research interests broadly lie at the interface of semi-parametric inference, high dimensional statistics and statistical learning in semi-supervised or weakly supervised settings, with applications in the analysis of large and complex observational datasets arising in modern biomedical studies.

Talk: Semi-Supervised Inference with Large and High Dimensional Data: A Semi-Parametric Perspective

Abstract: The abundance of large and complex datasets in the current big data era has also created a host of novel statistical challenges for properly harnessing such rich (but often incomplete) information. One such challenge includes statistical inference in semi-supervised (SS) settings, where apart from a moderate sized supervised data (L), one also has a much larger sized unsupervised data (U) available. Such datasets arise naturally when the response, unlike the covariates, is difficult and/or expensive to obtain, a frequent scenario in modern studies involving large databases, including biomedical data like electronic health records (EHR). It is natural to investigate whether and how the information from U can be exploited to improve efficiency over a given supervised approach.

In this talk, I will consider SS inference for a class of standard Z-estimation problems. I will discuss first the subtleties and associated challenges that necessitate a semi-parametric perspective. I will then demonstrate a family of SS Z-estimators that are robust and adaptive, thus ensuring that they are always as efficient as the supervised estimator and more efficient (optimal in some cases) when the information from U actually relates to the parameter of interest. These properties are crucial for advocating ‘safe’ use of unlabeled data and are often unaddressed. Our framework provides a much needed unified understanding of these problems. Multiple EHR data applications are also presented to exhibit the practical benefits of our estimator. In the later part of the talk, I consider SS inference in high dimensional settings, and demonstrate the remarkable benefits the unlabeled data provides in seamlessly obtaining a family of SS estimators with asymptotic linear expansions, without directly requiring any sparsity conditions or debiasing needed in supervised settings. This, in particular, facilitates high dimensional inference under minimal assumptions.