17 March |
Felix Abramovich |
|
High-dimensional classification by sparse logistic regression |
24 March |
Nitay Alon, TAU |
|
The effect of drift on the Skorokhod-embedded distribution of stopped Brownian Motion |
|
|
|
|
12 May |
Jeff Wu, Georgia Tech. |
|
|
19 May |
Taeho Kim, Haifa University |
|
|
|
|
|
|
2 June |
Andrea Saltelli, University of Bergen |
|
|
|
|
29 October |
Daniel Yekutieli, TAU |
|
|
26 November |
Noa Molshatzki, USC |
|
Methods to Identify Key Predictors and Interactions Using Machine Learning |
10 December |
Jacob Bien, USC |
|
|
24 December |
Arbel Harpak, Columbia University |
|
|
31 Decmber |
Yoav Zemel, Statistical Laboratory, University of Cambridge |
|
Optimal Transport: Fast Probabilistic Approximation with Exact Solvers |
7 January |
Christian Müller, Ludwig-Maximilians-University Munich |
|
Perspective M-Estimation: Constructions, optimization, and biological applications |
9 January |
Yaniv Romano, Stanford |
|
Reliability, Equity, and Reproducibility in Modern Machine Learning |
14 January |
Eitan Greenshtein, Central bureau of Statistics, Israel |
|
|
|
|
Seminars are held on Tuesdays, 10.30 am, Schreiber Building, 309 (see the TAU map ). The seminar organizer is Daniel Yekutieli.
To join the seminar mailing list or any other inquiries - please call (03)-6409612 or email 12345yekutiel@post.tau.ac.il54321 (remove numbers unless you are a spammer…)
Seminars from previous years
ABSTRACTS
· Daniel Yekutieli, TAU
Hierarchical Bayes Modeling for Large-Scale Inference
Bayesian modeling is now ubiquitous in problems of large-scale inference even when frequentist criteria are in mind for evaluating the performance of a procedure. By far most popular in the statistical literature of the past decade and a half are empirical Bayes methods, that have shown in practice to improve significantly over strictly-frequentist competitors in many different problems. As an alternative to empirical Bayes methods, in this paper we propose hierarchical Bayes modeling for large-scale problems, and address two separate points that, in our opinion, deserve more attention. The first is nonparametric “deconvolution" methods that are applicable also outside the sequence model. The second point is the adequacy of Bayesian modeling for situations where the parameters are by assumption deterministic. We provide partial answers to both: first, we demonstrate how our methodology applies in the analysis of a logistic regression model. Second, we appeal to Robbins's compound decision theory and provide an extension, to give formal justification for the Bayesian approach in the sequence case.
Methods to Identify Key Predictors and Interactions Using Machine Learning
Gradient Boosting Model (GBM) is a tree-based machine learning method that can be applied to explore large healthcare datasets. GBM automatically accounts for nonlinearities and interactions but the model is not directly interpretable. Importance statistics are often applied to understand key predictors and interactions, but these statistics are biased towards predictors with many categories and have no threshold to extract important associations. A solution is to create a reference null distribution by repeatedly calculating importance statistics under an altered outcome. In this work, we (1) apply existing methods to identify key predictors and interactions with GBM, (2) propose a novel improvement to identify key interactions, and (3) apply the methods to a large healthcare dataset.
We used a simulation study to assess the ability of the methods to correctly identify true associations. We analyzed data from Kaiser Permanente Southern California (KPSC) electronic medical records of ~70,000 pregnant women to detect determinants of gestational diabetes mellitus (GDM).
In the simulation study, we identified important predictors, interactions and nonlinearities while maintaining low false discovery. Our novel method was computationally efficient (short run time) and had good discovery performance compared to the existing approach. In KPSC data, we identified known GDM risk factors and potentially novel nonlinearity and interactions. In conclusion, the reference null approach is an important tool for GBM interpretation.
· Eitan Greenshtein. Central bureau of Statistics, Israel
Generalized Maximum Likelihood Estimators and their applications to stratified sampling and post-stratification with many unobserved strata
Consider the problem of estimating a weighted average of the means of $n$ strata, based on a random sample with realized $K_i$ observations from stratum $i, \; i=1,...,n$.
This task is non-trivial in cases where for a significant portion of the strata the corresponding $K_i=0$. Such a situation may happen in post-stratification, when it is desired to have a very fine stratification. A fine stratification could be desired in order that assumptions, or, approximations, like Missing At Random conditional on strata, will be appealing. A fine stratification could also be desired in observational studies, when it is desired to estimate average treatment effect, by averaging the effects in small and homogenous strata.
Our approach is based on applying Generalized Maximum Likelihood Estimators (GMLE),
and ideas that are related to Non-Parametric Empirical Bayes, in order to
estimate the means of strata $i$ with corresponding $K_i=0$. There are no
assumptions about a relation between the means of the unobserved strata (i.e.,
with $K_i=0$) and those of the observed strata.
The performance of our approach is demonstrated both in
simulations and on a real data set. Some consistency and asymptotic results are also
presented. In addition, related basic results
about GMLE estimation of the mean of mixtures of exponential families are
provided.
· Christian Müller, Ludwig-Maximilians-University Munich
Perspective M-Estimation: Constructions, optimization, and biological applications
In high-dimensional statistics, finding the maximum likelihood estimate associated with a statistical model is often associated with solving a (convex) non-smooth optimization problem. One particular model for maximum likelihood-type estimation (M-estimation) which generalizes a large class of well-known estimators, including Huber's concomitant M-estimators and the scaled Lasso, is the perspective M-estimation model. Perspective M-estimation leverages the observation that convex M-estimators with concomitant scale as well as various regularizers are instances of perspective functions and is thus amenable to efficient global optimization. We extend this model to allow for regression models with compositional covariate data which are commonplace in biology, including microbiome and metabolomics data. We introduce new perspective M-estimators that can handle outliers in outcome variables and heteroscedasticity in the covariates and show how to solve the associated non-smooth optimization problem with proximal algorithms. We find excellent empirical performance of the estimators on synthetic and real-world prediction tasks involving human gut and soil microbiome data.
This is joint work with Patrick L. Combettes, NC State
Reluctant Interaction Modeling
Including pairwise interactions between the predictors of a regression model can produce better predicting models. However, to fit such interaction models on typical data sets in biology and other fields can often require solving enormous variable selection problems with billions of interactions. The scale of such problems demands methods that are computationally cheap (both in time and memory) yet still have sound statistical properties. Motivated by these large-scale problem sizes, we adopt a very simple guiding principle: One should prefer a main effect over an interaction if all else is equal. This "reluctance" to interactions, while reminiscent of the hierarchy principle for interactions, is much less restrictive. We design a computationally efficient method built upon this principle and provide theoretical results indicating favorable statistical properties. Empirical results show dramatic computational improvement without sacrificing statistical properties. For example, the proposed method can solve a problem with 10 billion interactions with 5-fold cross-validation in under 7 hours on a single CPU. This is joint work with Guo Yu and Ryan Tibshirani.
· Arbel Harpak, Simons Society Fellow and Postdoctoral Researcher, Columbia University
Interpreting and deconstructing polygenic scores
A polygenic score is a predictor of a person’s trait value computed from his or her genotype. Polygenic scores sum over the genetic effects of the alleles carried by a person—as estimated in a genome-wide association study (GWAS) for the trait of interest. Fields as diverse as clinical risk prediction, evolutionary genetics, social sciences and embryo selection are rapidly adopting polygenic scores.
I will show that the prediction accuracy of polygenic scores can be highly sensitive to tiny biases in GWAS effect estimates, and further that that the prediction accuracy of polygenic scores depends on characteristics such as the socio-economic status, age or sex of the people in which the GWAS and the prediction are conducted. These dependencies highlight the complexities of interpreting polygenic scores and the potential for serious inequities in their application in the clinic and beyond.
A key reason for these dependencies is in the fact that GWAS estimates are also influenced by factors other than direct genetic effects—including population structure confounding, mating patterns, indirect genetic effects of relatives and other gene-by-environment interactions. I will discuss the development of tools to tease apart the different factors contributing to GWAS associations, and ultimately improve the prediction ability and the interpretation of polygenic scores.
Optimal Transport: Fast Probabilistic Approximation with Exact Solvers
We propose a simple subsampling scheme for fast randomized approximate computation of optimal transport distances on finite spaces. This scheme operates on a random subset of the full data and can use any exact algorithm as a black-box back-end, including state-of-the-art solvers and entropically penalized versions. It is based on averaging the exact distances between empirical measures generated from independent samples from the original measures and can easily be tuned towards higher accuracy or shorter computation times. To this end, we give non-asymptotic deviation bounds for its accuracy in the case of discrete optimal transport problems. In particular, we show that in many important instances, including images (2D-histograms), the approximation error is independent of the size of the full problem. We present numerical experiments that demonstrate that a very good approximation in typical applications can be obtained in a computation time that is several orders of magnitude smaller than what is required for exact computation of the full problem.
Reliability, Equity, and Reproducibility in Modern Machine
Learning
Modern machine learning algorithms have achieved remarkable performance in a
myriad of applications, and are increasingly used to make impactful decisions
in the hiring process, criminal sentencing, healthcare diagnostics and
even to make new scientific discoveries. The use of data-driven algorithms in
high-stakes applications is exciting yet alarming: these methods are extremely
complex, often brittle, notoriously hard to analyze and
interpret. Naturally, concerns have raised about the reliability,
fairness, and reproducibility of the output of such algorithms. This talk
introduces statistical tools that can be wrapped around any “black-box"
algorithm to provide valid inferential results while taking advantage of
their impressive performance. We present novel developments in conformal
prediction and quantile regression, which rigorously guarantee the reliability
of complex predictive models, and show how these methodologies can be used
to treat individuals equitably. Next, we focus on reproducibility and introduce
an operational selective inference tool that builds upon the knockoff framework
and leverages recent progress in deep generative models. This methodology
allows for reliable identification of a subset of important features that is
likely to explain a phenomenon under-study in a challenging setting where the
data distribution is unknown, e.g., mutations that are truly linked to
changes in drug resistance.
High-dimensional classification by sparse logistic regression
In this talk we consider high-dimensional classification. We discuss first high-dimensional binary classification by sparse logistic regression, propose a model/feature selection procedure based on penalized maximum likelihood with a complexity penalty on the model size and derive the non-asymptotic bounds for the resulting misclassification excess risk. Implementation of any complexity penalty-based criterion, however, requires a combinatorial search over all possible models. To find a model selection procedure computationally feasible for high-dimensional data, we consider logistic Lasso and Slope classifiers and show that they also achieve the optimal rate. We extend further the proposed approach to multiclass classification by sparse multinomial logistic regression.
This is a joint work with Vadim Grinshtein and Tomer Levy.
The effect of drift on the Skorokhod-embedded distribution of stopped Brownian Motion
For some solutions of the Skorokhod Embedding Problem of a density f0( ) in standard Brownian motion, if drift changes from zero to a small µ, the embedded density will be approximately proportional to f0(x) exp(µx). This allows the application of HMM in financial type data without assuming a parametric model on the Markov-modulated distributions. This model Skorokhod-Semiparametric-Hidden-Markov-Model is presented in detail, including a cascade for MLE consisting of an outer grid search for µ-related parameters and an inner Baum-Welch type algorithm. Various relevant embedding stopping times Azema-Yor, Chacon-Walsh, Dubins, Rost and Root will be briefly outlined.