Department of Statistics & Operations Research

Statistics Seminars

2019/2020

To subscribe to the list, please follow this link or send email to 12345yekutiel@tauex.tau.ac.il54321 (remove numbers unless you are a spammer…)

Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: C:\Users\user\Documents\myWebsite\TAU Statistics Seminar Home Page_files\red2.gif

 

 

 

Second Semester

17 March

Felix Abramovich

 

High-dimensional classification by sparse logistic regression

24 March

Nitay Alon, TAU

 

The effect of drift on the Skorokhod-embedded distribution of stopped Brownian Motion

 

 

 

 

12 May

Jeff Wu,  Georgia Tech.

 

 

19 May

Taeho Kim, Haifa University

 

 

 

 

 

 

2 June

Andrea Saltelli, University of Bergen

 

 

 

 

 

 

 

 

 

 

 

First Semester

29 October

Daniel Yekutieli, TAU

 

Hierarchical Bayes Modeling for Large-Scale Inference

26 November

Noa Molshatzki, USC

 

Methods to Identify Key Predictors and Interactions Using Machine Learning

10 December

Jacob Bien, USC

 

Reluctant Interaction Modeling

24 December

Arbel Harpak, Columbia University

 

Interpreting and deconstructing polygenic scores

31 Decmber

Yoav Zemel, Statistical Laboratory, University of Cambridge

 

Optimal Transport: Fast Probabilistic Approximation with Exact Solvers    

7 January

Christian Müller,  Ludwig-Maximilians-University Munich

 

Perspective M-Estimation: Constructions, optimization, and biological applications

9 January

Yaniv Romano, Stanford

 

Reliability, Equity, and Reproducibility in Modern Machine Learning 

14 January

Eitan Greenshtein, Central bureau of Statistics, Israel

 

Generalized Maximum Likelihood Estimators and their applications to stratified sampling and post-stratification with many unobserved strata

 

 

 

 

 

 

 

Seminars are held on Tuesdays, 10.30 am, Schreiber Building, 309 (see the TAU map ). The seminar organizer is Daniel Yekutieli.

To join the seminar mailing list or any other inquiries - please call (03)-6409612 or email 12345yekutiel@post.tau.ac.il54321 (remove numbers unless you are a spammer…)

 


Seminars from previous years


 

 

 

ABSTRACTS

 

 

 

 

·         Daniel Yekutieli, TAU

 

 

Hierarchical Bayes Modeling for Large-Scale Inference

 

Bayesian modeling is now ubiquitous in problems of large-scale inference even when frequentist criteria are in mind for evaluating the performance of a procedure. By far most popular in the statistical literature of the past decade and a half are empirical Bayes methods, that have shown in practice to improve significantly over strictly-frequentist competitors in many different problems. As an alternative to empirical Bayes methods, in this paper we propose hierarchical Bayes modeling for large-scale problems, and address two separate points that, in our opinion, deserve more attention. The first is nonparametric “deconvolution" methods that are applicable also outside the sequence model. The second point is the adequacy of Bayesian modeling for situations where the parameters are by assumption deterministic. We provide partial answers to both: first, we demonstrate how our methodology applies in the analysis of a logistic regression model. Second, we appeal to Robbins's compound decision theory and provide an extension, to give formal justification for the Bayesian approach in the sequence case.

 

 

 

 

·         Noa Molshatzki, USC

 

Methods to Identify Key Predictors and Interactions Using Machine Learning

 

Gradient Boosting Model (GBM) is a tree-based machine learning method that can be applied to explore large healthcare datasets. GBM automatically accounts for nonlinearities and interactions but the model is not directly interpretable. Importance statistics are often applied to understand key predictors and interactions, but these statistics are biased towards predictors with many categories and have no threshold to extract important associations. A solution is to create a reference null distribution by repeatedly calculating importance statistics under an altered outcome. In this work, we (1) apply existing methods to identify key predictors and interactions with GBM, (2) propose a novel improvement to identify key interactions, and (3) apply the methods to a large healthcare dataset.

We used a simulation study to assess the ability of the methods to correctly identify true associations. We analyzed data from Kaiser Permanente Southern California (KPSC) electronic medical records of ~70,000 pregnant women to detect determinants of gestational diabetes mellitus (GDM).

In the simulation study, we identified important predictors, interactions and nonlinearities while maintaining low false discovery. Our novel method was computationally efficient (short run time) and had good discovery performance compared to the existing approach.  In KPSC data, we identified known GDM risk factors and potentially novel nonlinearity and interactions. In conclusion, the reference null approach is an important tool for GBM interpretation.

 

 

 

 

 

·         Eitan Greenshtein. Central bureau of Statistics, Israel

 

Generalized Maximum Likelihood Estimators and their applications to stratified sampling and post-stratification with many unobserved strata

 

Consider the problem of estimating a weighted average of the means of $n$ strata, based on a random sample  with realized $K_i$ observations from stratum $i, \; i=1,...,n$.

 

This task is non-trivial in cases where for  a significant portion of the strata the corresponding $K_i=0$. Such a situation may happen in post-stratification, when it is desired to have a very fine stratification. A fine stratification could be desired in order that assumptions, or, approximations, like Missing At Random conditional on strata, will be appealing. A fine stratification could also be desired in observational studies, when it is desired to estimate average treatment effect, by averaging  the effects in small and homogenous strata.


Our approach is based on applying Generalized Maximum Likelihood Estimators (GMLE), and ideas that are related to Non-Parametric Empirical Bayes, in order to estimate the means of strata $i$ with corresponding $K_i=0$. There are no assumptions about a relation between the means of the unobserved strata (i.e., with $K_i=0$) and those of the observed strata.

 

The performance of our approach is demonstrated both in simulations and on a real data set. Some consistency and asymptotic  results are also presented. In addition, related basic results
about GMLE estimation of the mean of mixtures of exponential families are provided.

 

 

 

 

 

·         Christian Müller,  Ludwig-Maximilians-University Munich

 

Perspective M-Estimation: Constructions, optimization, and biological applications 

 

In high-dimensional statistics, finding the maximum likelihood estimate associated with a statistical model is often associated with solving a (convex) non-smooth optimization problem. One particular model for maximum likelihood-type estimation (M-estimation) which generalizes a large class of well-known estimators, including Huber's concomitant M-estimators and the scaled Lasso, is the perspective M-estimation model. Perspective M-estimation leverages the observation that convex M-estimators with concomitant scale as well as various regularizers are instances of perspective functions and is thus amenable to efficient global optimization. We extend this model to allow for regression models with compositional covariate data which are commonplace in biology, including microbiome and metabolomics data. We introduce new perspective M-estimators that can handle outliers in outcome variables and heteroscedasticity in the covariates and show how to solve the associated non-smooth optimization problem with proximal algorithms. We find excellent empirical performance of the estimators on synthetic and real-world prediction tasks involving human gut and soil microbiome data.

 

This is joint work with Patrick L. Combettes, NC State

 

 

 

·         Jacob Bien, USC

 

 

 

Reluctant Interaction Modeling

Including pairwise interactions between the predictors of a regression model can produce better predicting models. However, to fit such interaction models on typical data sets in biology and other fields can often require solving enormous variable selection problems with billions of interactions. The scale of such problems demands methods that are computationally cheap (both in time and memory) yet still have sound statistical properties. Motivated by these large-scale problem sizes, we adopt a very simple guiding principle: One should prefer a main effect over an interaction if all else is equal. This "reluctance" to interactions, while reminiscent of the hierarchy principle for interactions, is much less restrictive. We design a computationally efficient method built upon this principle and provide theoretical results indicating favorable statistical properties. Empirical results show dramatic computational improvement without sacrificing statistical properties. For example, the proposed method can solve a problem with 10 billion interactions with 5-fold cross-validation in under 7 hours on a single CPU.  This is joint work with Guo Yu and Ryan Tibshirani.

 

 

 

 

·         Arbel Harpak,  Simons Society Fellow and Postdoctoral Researcher, Columbia University

 

Interpreting and deconstructing polygenic scores

 

A polygenic score is a predictor of a person’s trait value computed from his or her genotype. Polygenic scores sum over the genetic effects of the alleles carried by a person—as estimated in a genome-wide association study (GWAS) for the trait of interest.  Fields as diverse as clinical risk prediction, evolutionary genetics, social sciences and embryo selection are rapidly adopting polygenic scores. 

 

I will show that the prediction accuracy of polygenic scores can be highly sensitive to tiny biases in GWAS effect estimates, and further that that the prediction accuracy of polygenic scores depends on characteristics such as the socio-economic status, age or sex of the people in which the GWAS and the prediction are conducted. These dependencies highlight the complexities of interpreting polygenic scores and the potential for serious inequities in their application in the clinic and beyond.

 

A key reason for these dependencies is in the fact that GWAS estimates are also influenced by factors other than direct genetic effects—including population structure confounding, mating patterns, indirect genetic effects of relatives and other gene-by-environment interactions.  I will discuss the development of tools to tease apart the different factors contributing to GWAS associations, and ultimately improve the prediction ability and the interpretation of polygenic scores.

 

 

·         Yoav Zemel

 

Optimal Transport: Fast Probabilistic Approximation with Exact Solvers

 

We propose a simple subsampling scheme for fast randomized approximate computation of optimal transport distances on finite spaces. This scheme operates on a random subset of the full data and can use any exact algorithm as a black-box back-end, including state-of-the-art solvers and entropically penalized versions. It is based on averaging the exact distances between empirical measures generated from independent samples from the original measures and can easily be tuned towards higher accuracy or shorter computation times. To this end, we give non-asymptotic deviation bounds for its accuracy in the case of discrete optimal transport problems. In particular, we show that in many important instances, including images (2D-histograms), the approximation error is independent of the size of the full problem. We present numerical experiments that demonstrate that a very good approximation in typical applications can be obtained in a computation time that is several orders of magnitude smaller than what is required for exact computation of the full problem.

 

 

·         Yaniv Romano, Stanford


Reliability, Equity, and Reproducibility in Modern Machine Learning 

Modern machine learning algorithms have achieved remarkable performance in a myriad of applications, and are increasingly used to make impactful decisions in the hiring process, criminal sentencing, healthcare diagnostics and even to make new scientific discoveries. The use of data-driven algorithms in high-stakes applications is exciting yet alarming: these methods are extremely complex, often brittle, notoriously hard to analyze and interpret. Naturally, concerns have raised about the reliability, fairness, and reproducibility of the output of such algorithms. This talk introduces statistical tools that can be wrapped around any “black-box" algorithm to provide valid inferential results while taking advantage of their impressive performance. We present novel developments in conformal prediction and quantile regression, which rigorously guarantee the reliability of complex predictive models, and show how these methodologies can be used to treat individuals equitably. Next, we focus on reproducibility and introduce an operational selective inference tool that builds upon the knockoff framework and leverages recent progress in deep generative models. This methodology allows for reliable identification of a subset of important features that is likely to explain a phenomenon under-study in a challenging setting where the data distribution is unknown, e.g., mutations that are truly linked to changes in drug resistance.

 

·         Felix Abramovich, TAU

 

High-dimensional classification by sparse logistic regression

 

In this talk we consider high-dimensional classification.   We discuss first high-dimensional binary classification by sparse logistic regression, propose a model/feature selection procedure based on penalized maximum likelihood with a complexity penalty on the model size and derive the non-asymptotic bounds for the resulting misclassification excess risk. Implementation of any complexity penalty-based criterion, however, requires a combinatorial search over all possible models. To find a model selection procedure computationally feasible for high-dimensional data, we consider logistic Lasso and Slope classifiers and show that they also achieve the optimal rate.  We extend further the proposed approach to multiclass classification by sparse multinomial logistic regression.

 

This is a joint work with Vadim Grinshtein and Tomer Levy.

 

 

·         Nitay Alon, TAU

 

 

The effect of drift on the Skorokhod-embedded distribution of stopped Brownian Motion

 

 

For some solutions of the Skorokhod Embedding Problem of a density f0( ) in standard Brownian motion, if drift changes from zero to a small µ, the embedded density will be approximately proportional to f0(x) exp(µx). This allows the application of HMM in financial type data without assuming a parametric model on the Markov-modulated distributions. This model Skorokhod-Semiparametric-Hidden-Markov-Model is presented in detail, including a cascade for MLE consisting of an outer grid search for µ-related parameters and an inner Baum-Welch type algorithm. Various relevant embedding stopping times Azema-Yor, Chacon-Walsh, Dubins, Rost and Root will be briefly outlined.