The Statistics Seminar Speaker for Friday, February 19, 2016, is Wanjie Wang, a Postdoctoral Associate with Department of Biostatistics and Epidemiology and Department of Statistics at the University of Pennsylvania. Wang's main research interest lies in the area of high dimensional statistical inference, including studying the separation boundary of accessibility and impossibility for some statistical problems in high dimensional setting, such as the clustering problem, signal recovery problem, and detection problem; developing new statistical methods designed for high dimensional data with rare and weak signals, such as Important Features (IF) PCA algorithm for clustering problem; Modelling and solving real world problems, such as the detection and evaluation of sparse simultaneous signals for genetic associations between two diseases based on GWAS data.
Wang is also interested in applying statistics to real world problems, such as rank-based tests for Genomics data with excessive zeros; developing Current-Threshold model for neuron coding. She received a master's degree and Ph.D. in Statistics from Carnegie Mellon University, and completed her undergraduate studies in the School of Mathematical Sciences at Peking University.
Title: Important Features PCA (IF-PCA) for Large-Scale Inference, with Applications in Gene Microarrays
Abstract: Clustering is a major problem in statistics with many applications. In the Big Data era, it faces two main challenges: (1). the number of features is much larger than the sample size; (2). the signals are sparse and weak, masked by large amount of noise.
We propose a new tuning-free clustering procedure for large-scale data, Important Features PCA (IF-PCA). IF-PCA consists of a feature selection step, a PCA step, and a k-means step. The first two steps reduce the data dimensions recursively, while the main information is preserved. As a consequence, IF-PCA is fast and accurate, producing competitive performance in application to 10 gene microarray data sets.
We also propose a model that can capture the rarity and weakness of signal. Under this model, the statistical limits for the clustering problem and IF-PCA has been found.
Refreshments will be served after the seminar in 1181 Comstock Hall.