Experts explore and discuss current research.

Statistics Seminar speakers include researchers across fields discussing current topics in Statistics and presenting new research.

Seminars are offered via Zoom (Meeting ID: 984 2423 1705, Passcode: 354857). Recorded talks will be posted on the Statistics Seminar channel as they become available (Cornell NetID required).

If you are interested in joining the seminar mailing list, contact Megan Adams (mls99 [at] cornell.edu (mls99[at]cornell[dot]edu)).

If there are no upcoming events showing on this page, please check out our archives to get an idea of the caliber of guests and talk topics we typically host. 

 

Time: 4:30-5:30 p.m.
Date: Wednesday, January 21, 2026
Location: 350 Computing and Information Science Building
Speaker: Tudor Manole, MIT
Title: A Statistical Framework for Benchmarking Quantum Computers


Time: 4:30-5:30 p.m.
Date: Monday, January 26, 2026
Location: 250 Computing and Information Science Building
Speaker: Licong Lin, UC Berkeley
Title: Towards a Statistical Theory of Contrastive Learning in Modern AI

 

Time: 4:30-5:30 p.m.
Date: Wednesday, January 28, 2026
Location: 350 Computing and Information Science Building
Speaker: Keyon Vafa, Harvard University
Title: Assessing AI’s Implicit World Models

 

Time: 4:30-5:30 p.m.
Date: Monday, February 2, 2026
Location: 350 Computing and Information Science Building
Speaker: Jiawei Ge, Princeton University
Title: Toward Reliable AI: The Statistical Foundations of Out-of-Distribution Generalization

 

Time: 4:30-5:30 p.m.
Date: Wednesday, February 4, 2026
Location: 350 Computing and Information Science Building
Speaker: Yihong Gu, Harvard University
Title: Causality pursuit from heterogeneous environments

 

Time: 4:30-5:30 p.m.
Date: Wednesday, February 18, 2026
Location: 350 Computing and Information Science Building
Speaker: Florentin Guth, Faculty Fellow, Center for Data Science, NYU
Title: Learning normalized probability models with dual score matching

Archive

Time: 4:30-5:30 p.m.
Date: Wednesday, September 10, 2025
Location: 350 Computing and Information Science Building
Speaker: Florentina Bunea, Professor of Statistics and Data Science, Cornell Bowers
Title:From softmax mixture ensembles to mixtures of experts, with applications to LLM output summarization
 


Time: 4:30-5:30 p.m.
Date: Wednesday, September 17, 2025
Location: 350 Computing and Information Science Building
Speaker: Yuchen Wu, Assistant Professor, School of Operations Research and Information Engineering, Cornell University
Title: Modern Sampling Paradigms: from Posterior Sampling to Generative AI

 

Time: 4:30-5:30 p.m.
Date: Wednesday, September 24, 2025
Location: 350 Computing and Information Science Building
Speaker: Kangjie Zhou, Founder's Postdoctoral Fellow, Department of Statistics, Columbia University
Title: Dynamic Factor Analysis of High-dimensional Recurrent Events

 

Time: 4:30-5:30 p.m.
Date: Wednesday, October 1, 2025
Location: 350 Computing and Information Science Building
Speaker: Chenyang Zhong, Assistant Professor, Department of Statistics, Columbia University
Title: Variational Inference for Latent Variable Models in High Dimensions

 

Time: 4:30-5:30 p.m.
Date: Wednesday, October 8, 2025
Location: 350 Computing and Information Science Building
Speaker: Alberto Gonzalez Sanz, Assistant Professor, Department of Statistics, Columbia University
Title: Quadratically regularized optimal transport

 

Time: 4:30-5:30 p.m.
Date: Wednesday, October 15, 2025
Location: 350 Computing and Information Science Building
Speaker: Quinn Simonis, Cornell University
Title: Empirical Bayesian Modeling of Kronecker Product Relevancy in Gaussian Arrays

 

Time: 4:30-5:30 p.m.
Date: Wednesday, October 22, 2025
Location: 350 Computing and Information Science Building
Speaker: Jeff Miller, Associate Professor of Biostatistics, Harvard T.H. Chan School of Public Health
Title:Bayesian model criticism using uniform parametrization checks

 

Time: 4:30-5:30 p.m.
Date: Wednesday, October 29, 2025
Location: 350 Computing and Information Science Building
Speaker: Xiao Wang, Head and J.O. Berger and M.E. Bock Professor of Statistics
Title: Neural Amortized Bayesian Computation

 

Time: 4:30-5:30 p.m.
Date: Wednesday, November 5, 2025
Location: 350 Computing and Information Science Building
Speaker: Gemma Moran, Assistant Professor, Rutgers Statistics Department
Title: Nonlinear Multi-Study Factor Analysis

 

Time: 4:30-5:30 p.m.
Date: Wednesday, November 12, 2025
Location: 350 Computing and Information Science Building
Speaker: Ziv Goldfeld, Associate Professor, School of Electrical and Computer Engineering, Cornell Engineering
Title: Robust and Geometry-Aware Distribution Estimation via Optimal Transport 

 

Time: 4:30-5:30 p.m.
Date: Wednesday, November 19, 2025
Location: 350 Computing and Information Science Building
Speaker: Ilya Shpitser, John C. Malone Associate Professor, John Hopkins Whiting School of Engineering
Title: Graphical models for missing data not at random: identification, inference, and imputation

4.30.25: Revisiting Total Variation Denoising: New Perspectives and Generalizations
Sabyasachi Chatterjee 
Abstract: Total Variation Denoising (TVD) is a fundamental denoising/smoothing method. We will present a new local minmax/maxmin formula producing two estimators which sand-wich the univariate TVD estimator at every point. Operationally, this formula gives a local definition of TVD as a minmax/maxmin of a simple function of local averages. We will show that this minmax/maxmin formula is  generalizeable and can be used to define other TVD like estimators. In particular, we will higher order polynomial versions of TVD which are defined pointwise lying between minmax and maxmin  optimizations of penalized local polynomial regressions over intervals of different scales. 

 

​4.23.25: Optimal vintage factor analysis with deflation varimax
Xin Bing
Abstract: Vintage factor analysis is one important type of factor analysis that aims to first find a low-dimensional representation of the original data, and then to seek a rotation such that the rotated low-dimensional representation is scientifically meaningful. The most widely used vintage factor analysis is the Principal Component Analysis (PCA) followed by the varimax rotation. Despite its popularity, little theoretical guarantee can be provided to date mainly because varimax rotation requires to solve a non-convex optimization over the set of orthogonal matrices.

 

4.16.25: Mitigating Biases in Evaluation: Offline and Online Settings
Jingyan Wang
Abstract: Evaluation -- to estimate the quality of items or people -- is central to many real-world applications such as admissions, grading, hiring, and peer review. In this talk, I will present two vignettes of my research towards understanding and mitigating various sources of biases in evaluation.
(1) Outcome-induced bias: We consider how people’s ratings are affected by experiences irrelevant to the evaluation objective. For example, in teaching evaluation, students who receive higher grades tend to rate their instructors more positively. In such scenarios, we propose mild non-parametric assumptions to model the bias, design an adaptive correction algorithm, and prove its consistency guarantees.
(2) Symbiosis bias: We consider A/B testing that aims to compare the performance of a pair of online algorithms. As a concrete example, consider a company experimenting with two recommendation algorithms and deciding which one to deploy in production. Symbiosis bias refers to the interference where the performance of one algorithm is influenced by the other algorithm through data feedback loops. Through a bandit formulation, we provide preliminary results on sign preservation properties of such A/B tests.

 

4.9.25: Federated Reinforcement Learning: Statistical, Communication and Computation Trade-offs
Yueji Chi
Abstract:  Reinforcement learning (RL), concerning decision making in uncertain environments, lies at the heart of modern artificial intelligence. Due to the high dimensionality, training of RL agents typically requires a significant amount of computation and data to achieve desirable performance. However, data collection can be extremely time-consuming with limited access in real-world applications, especially when performed by a single agent. On the other hand, it is plausible to leverage multiple agents to collect data simultaneously, under the premise that they can learn a global policy collaboratively without the need of sharing local data in a federated manner. This talk addresses the fundamental statistical, communication and computation trade-offs in the algorithmic designs of federated RL algorithms, covering both blessings and curses in the presence of data and task heterogeneities across the agents. 

 

2.27.25: Challenges and Opportunities in Assumption-free and Robust Inference
Yuetian Luo
Abstract: With the growing application of data science to complex high-stakes tasks, ensuring the reliability of statistical inference methods has become increasingly critical. This talk considers two key challenges to achieving this goal: model misspecification and data corruption, highlighting their associated difficulties and potential solutions. In the first part, we investigate the problem of distribution-free algorithm risk evaluation, uncovering fundamental limitations for answering these questions with limited amounts of data. To navigate the challenge, we will also discuss how incorporating an assumption about algorithmic stability might help. The second part focuses on constructing robust confidence intervals in the presence of arbitrary data contamination. We show that when the proportion of contamination is unknown, uncertainty quantification incurs a substantial cost, resulting in optimal robust confidence intervals that must be significantly wider.

 

2.26.25: Power Enhancement in Statistical Inference for Large and Complex Data
Xiufan Yu
Abstract: Statistical inference has been fundamental for data analysis and decision-making. Different tests may vary in performance across different settings, each excelling in distinct high-power regions. Over the past decade, power enhancement techniques have attracted growing attention in both theoretical and applied statistics, aiming to develop robust tests that remain reliably powerful across a broad spectrum of alternative hypotheses.
In this talk, I will present my recent work on power enhancement in high-dimensional heterogeneous mediation analysis, introducing a powerful inferential method to examine the existence of active mediators in high-dimensional linear and generalized mediation models. Existing tests based on the total indirect effect are often underpowered when the mediation effects are non-homogeneous. To address this limitation, we develop enhanced tests that are proven to maintain strong power under various mediation patterns, including homogeneous, heterogeneous and even contrasting mediation settings. 

 

2.25.25: Randomization Tests for Robust Causal Inference in Network Experiments
Panos Toulis
Abstract: Network experiments pose unique challenges for causal inference due to interference, where cause-effect relationships are confounded by network interactions among experimental units. This paper focuses on group formation experiments, where individuals are randomly assigned to groups and their responses are observed—for example, do first-year students achieve better grades when randomly paired with academically stronger roommates? We extend classical Fisher Randomization Tests (FRTs) to this setting, resulting in tests that are exact in finite samples and justified solely by the randomization itself. We also establish sufficient theoretical conditions under which general FRTs for network peer effects reduce to computationally efficient permutation tests. Our analysis identifies equivariance as a key algebraic property ensuring the validity of permutation tests under network interference.

 

2.21.25: Facets of regularization in overparameterized machine learning
Pratik Pati
Abstract: Modern machine learning often operates in an overparameterized regime in which the number of parameters far exceeds the number of observations. In this regime, models can exhibit surprising generalization behaviors: (1) Models can overfit with zero training error yet still generalize well (benign overfitting); furthermore, in some cases, even adding and tuning explicit regularization can favor no regularization at all (obligatory overfitting). (2) The generalization error can vary non-monotonically with the model or sample size (double/multiple descent). These behaviors challenge classical notions of overfitting and the role of explicit regularization.
In this talk, I will present theoretical and methodological results related to these behaviors, primarily focusing on the concrete case of ridge regularization. 

 

2.19.25: Posterior Conformal Prediction
Yao Zhang:
Abstract: Conformal prediction is a popular technique for constructing prediction intervals with distribution-free coverage guarantees. The coverage is marginal, holding on average over the entire population but not necessarily for any specific subgroup. In this talk, I will introduce a new method, posterior conformal prediction (PCP), which generates prediction intervals with both marginal and approximate conditional coverage for clusters (or subgroups) naturally discovered in the data. PCP achieves these guarantees by modelling the conditional conformity score distribution as a mixture of cluster distributions. Compared to other methods with approximate conditional coverage, this approach produces tighter intervals, particularly when the test data is drawn from clusters that are well represented in the validation data. PCP can also be applied to guarantee conditional coverage on user-specified subgroups, in which case it achieves robust coverage on smaller subgroups within the specified subgroups. In classification, the theory underlying PCP allows for adjusting the coverage level based on the classifier’s confidence, achieving significantly smaller sets than standard conformal prediction sets. Experiments demonstrate the performance of PCP on diverse datasets from socio-economic, scientific and healthcare applications.