Xihong Lin, PhD is Professor and former Chair of Biostatistics and Coordinating Director of the Program in Quantitative Genomics at the Harvard T. H. Chan School of Public Health, as well as Professor of Statistics at Harvard University. Dr. Lin’s research interests lie in the development and application of scalable statistical and machine learning methods for analysis of massive genetic and genomic data along with complex epidemiological, biobank and health data. Dr. Lin is also known for her contributions to epidemic modeling during the early phase of the COVID-19 pandemic. Dr. Lin was elected to the US National Academy of Sciences in 2018 and the US National Academy of Medicine in 2023. She received the 2002 Mortimer Spiegelman Award from the American Public Health Association, the 2006 Presidents’ Award and the 2017 FN David Award of the Committee of Presidents of Statistical Societies (COPSS), the 2022 Jerome Sacks Award for Outstanding Cross-Disciplinary Research from the National Institute of Statistical Science, and the 2022 Marvin Zelen Leadership in Statistical Science Award. She is an elected fellow of American Statistical Association, Institute of Mathematical Statistics, and International Statistical Institute. She is a recipient of the MERIT Award (2007-2015) and the Outstanding Investigator Award (OIA) (R35) (2015-2029) from the National Cancer Institute (NCI). Dr. Lin has held the position of the Chair of the COPSS (2010-2012) and is a former member of the Committee of Applied and Theoretical Statistics of the National Academy of Sciences. Dr. Lin has served as the former Coordinating Editor of Biometrics and the founding co-editor of Statistics in Biosciences. She has contributed her expertise to numerous NIH and NSF review panels.
Talk: Fast Distributed Principal Component Analysis of Large-Scale Federated Data
Abstract: Principal component analysis (PCA) is one of the most popular methods for dimension reduction. In the light of the rapidly growing large-scale data in federated ecosystems, the traditional PCA method is often not applicable due toprivacy protection considerations and large computational burden. Algorithms were proposed to lower the computational cost, but few can handle both high dimensionality and massive sample size under the distributed setting. In this paper, we propose the FAst DIstributed (FADI) PCA method for federated data when both the dimension d and the sample size n are ultra-large, by simultaneously performing parallel computing along d and distributed computing along n. Specifically, we utilize L parallel copies of p-dimensional fast sketches to divide the computing burden along d and aggregate the results distributively along the split samples. We present FADI under a general framework applicable to multiple statistical problems, and establish comprehensive theoretical results under the general framework. We show that FADI enjoys the same non-asymptotic error rate as the traditional PCA when Lp ≥ d. We also derive inferential results that characterize the asymptotic distribution of FADI, and show a phase-transition phenomenon as Lp increases. We also discuss estimation of the number of low ranks of a covariance matrix by Bulk Eigenvalue Matching Analysis (BEMA). We perform extensive simulations to show that FADI substantially outperforms the existing methods in computational efficiency while preserving accuracy, and validate the distributional phase-transition phenomenon through numerical experiments. We apply FADI to the 1000 Genomes data to study the population structure. This is joint work with Shuting Shen and Junwei Lu.