Department of Statistics & Operations Research

Statistics Seminars  2020/2021

 

To subscribe to the list, please follow this link or send email to 12345yekutiel@tauex.tau.ac.il54321 (remove numbers unless you are a spammer…)

 

Second Semester

16 March

Tirza Routtenberg, BGU

Performance Bounds for Estimation After Model Selection

Abstract

Zoom recording

6 April

Nir Keret, TAU

Optimal Cox Regression Subsampling Procedure with Rare Events

Abstract

Zoom recording

13 April

Malgorzata Bogdan, Wroclaw U. of Science and Technology

Ghost Quantitative Trait Loci and hotspots: What might happen if the signal is not sparse?

Abstract

Zoom recording

18 May

Gil Kur, MIT

On the Minimal Error of Empirical Risk Minimization

Abstract

Zoom recording

25 May

Vladimir Vovk, Royal Holloway, University of London

 

Abstract

Zoom recording

1 June

Assaf Rabinowicz, TAU

 

Abstract

Zoom recording

8 June

David Steinberg, TAU

 

Abstract

Zoom recording

 

 

 

 

 

 

 

First Semester

20 October

Felix Abramovich, TAU

High-dimensional classification by sparse logistic regression

Abstract

Zoom recording

27 October

Taeho Kim, Haifa University

Improved Multiple Confidence Intervals via Thresholding Informed by Prior Information

Abstract

Zoom recording

12 November

Somabha Mukherjee, UPenn

Statistical Inference on Dependent Combinatorial Data: The Ising Model

Abstract

Zoom recording

17 November

Amit Moscovic, Princeton

Nonparametric estimation of high-dimensional shape spaces with applications to structural biology

Abstract

Zoom recording

1 December

Ruth Heller, TAU

Optimal control of false discovery criteria in the two-group model

Abstract

Zoom recording

8 December  (6:30pm)

Alon Kipnis

Two-sample problem for large, sparse, high-dimensional distributions under rare/weak perturbations

Abstract

Zoom recording

15 December

Yves Rozenholc, Paris Descartes

Differential Analysis in Transcriptomic : the Strength of Randomly Picking so-called Reference Genes

Abstract

Zoom recording

29 December

Dan Vilenchik, Ben Gurion U.

Computational-statistical tradeoffs in the problem of finding sparse Principal Components in high-dimensional data

Abstract

 

5 January

Rui Castro

 

 

 

 

 

 

§

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Seminars are held on Tuesdays, 10.30 am, Schreiber Building, 309 (see the TAU map ). The seminar organizer is Daniel Yekutieli.

To join the seminar mailing list or any other inquiries - please call (03)-6409612 or email 12345yekutiel@post.tau.ac.il54321 (remove numbers unless you are a spammer…)

 


Seminars from previous years


 

 

 

ABSTRACTS

 

 

·         Felix Abramovich, TAU

 

 High-dimensional classification by sparse logistic regression

 

 In this talk we consider high-dimensional classification. We discuss first high-dimensional binary classification by sparse logistic regression, propose a model/feature selection procedure based on penalized maximum likelihood with a complexity penalty on the model size and derive the non-asymptotic bounds for the resulting misclassification excess risk. Implementation of any complexity penalty-based criterion, however, requires a combinatorial search over all possible models. To find a model selection procedure computationally feasible for high-dimensional data, we consider logistic Lasso and Slope classifiers and show that they also achieve the optimal order. We extend further the proposed approach to multiclass 

classification by sparse multinomial logistic regression and discuss various possible types of sparsity in the multiclass setup.  

 

This is a joint work with Vadim Grinshtein and Tomer Levy.

 

 

 

·         Taeho Kim, Haifa University

 

 

Improved Multiple Confidence Intervals via Thresholding Informed by Prior Information

 

Consider a statistical decision problem where multiple sets of parameters are of interest. To simultaneously infer about these parameters, a multiple interval estimator (MIE) can be constructed. In this study, an MIE with better performance than existing MIEs, in particular, relative to a z-based MIE, is developed using a thresholding approach. The determination of the thresholds in this MIE is informed by assigning prior distributions to each of the sets of parameters. The performance of the MIE is evaluated using two measures: (i) a global coverage rate and (ii) a global expected content, which are both averages with respect to the prior distribution. The proposed MIE procedure, which is developed with respect to these two performance measures, is called a Bayes MIE with thresholding (BMIE_Thres).

A multivariate normal model with the conjugate prior is utilized to develop the BMIE_Thres for the mean vector. The behavior of BMIE_Thres is then analytically investigated in terms of the performance measures. It is shown that the performance of the BMIE_Thres approaches those of the z-based MIE as the thresholds become large.

In this presentation, in-season baseball batting average data and leukemia gene expression data are used to demonstrate the procedure for the known and unknown standard deviations settings, respectively. In addition, simulation studies are also presented to compare the BMIE_Thres with the classical and Bayes MIEs.

 

 

 

·         Somabha Mukherjee Department of Statistics, The Wharton School, University of Pennsylvania 

 

Statistical Inference on Dependent Combinatorial Data: The Ising Model 

 

Dependent data arise in all avenues of science, technology and society, such as facebook friendship networks, epidemic networks, election data and peer group effects. Analysis of dependent combina- torial data is crucial for understanding the behavior of edge and higher-order motif estimates in very large and inaccessible networks, deriving asymptotics of graph-based tests for equality of distributions, in the study of coincidences, and many more seemingly diverse areas in statistics and probability. In this talk, I am going to focus on the Ising model, which is a useful framework introduced by statistical physicists, and later used by statisticians, for modeling dependent binary data. In its original form, the Ising model can capture only pairwise interactions, which are seldom observed in the real world. For example, in a peer group, the decision of an individual is affected not just by pairwise communications, but by interactions with larger community tuples. It is also known in Physics, that atoms on a crystal surface interact not just in pairs, but in triplets and higher-order tuples. These higher-order interactions can be captured by the so called tensor Ising models, where the Hamiltonian (sufficient statistic) is a multilinear form of degree p. I will show how to estimate the natural parameters of this model, why maximum-likelihood estimation fails in more general Ising models, and will briefly talk about the asymptotics of the parameter estimates in this model. The asymptotics are highly non-standard, characterized by the presence of a critical curve in the interior of the parameter space on which the estimates have a limiting mixture distribution, and a surprising superefficiency phenomenon at the boundary point(s) of this critical curve. I will also consider a more realistic version of the Ising model, which is a generalization of the vanilla logistic regression, and talk briefly about estimating the natural parameters of this model under sparsity assumptions on the parameters. Towards the end, I will talk briefly about some other places where dependent combinatorial data arise, including graph-based nonparametric tests for equality of 

 

 

·         Amit Moscovich, Princeton University.

 

 Nonparametric estimation of high-dimensional shape spaces with applications to structural biology

 

Over the last twenty years, there have been major advances in non-linear dimensionality reduction, or manifold learning, and nonparametric regression of high-dimensional datasets with low intrinsic dimensionality.  A key idea in this field is the use of data-dependent Fourier-like basis vectors given by the eigenvectors of a graph Laplacian.  These eigenvectors provide a natural basis for representing and estimating smooth signals. Their use for estimation over arbitrary domains generalizes the classical notion of regression using orthogonal function series. In this talk, I will discuss the application of such methods for mapping spaces of volumetric shapes with continuous motion. Three lines of research will be presented:

(i) High-dimensional nonparametric estimation of distributions of volumetric signals from noisy linear measurements.

(ii) Leveraging the Wasserstein optimal transport metric for manifold learning and clustering.

(iii) Non-linear independent component analysis for analyzing independent motions.

A key motivation for this work comes from structural biology, where breakthrough advances in cryo-electron microscopy have led to thousands of atomic-resolution reconstructions of various proteins in their native states.  However, the success of this field has been mostly limited to the estimation of rigid structures, while many important macromolecules contain several parts that can move in a continuous fashion, thus forming a manifold of conformations which cannot be estimated using existing tools.  The methods described in this talk present progress towards the solution of this grand challenge, namely the extension of point-estimation methods which output a single 3D conformation to estimators of entire manifolds of conformations.

 

 

·         Ruth Heller, TAU

 

Optimal control of false discovery criteria in the two-group model

The highly influential two-group model in testing a large number of statistical hypotheses assumes that the test statistics are drawn independently from a mixture of a high probability null distribution and a low probability alternative. Optimal control of the marginal false discovery rate (mFDR), in the sense that it provides maximal power (expected true discoveries) subject to mFDR control, is known to be achieved by thresholding the local false discovery rate (locFDR), the probability of the hypothesis being null given the set of test statistics, with a fixed threshold. We address the challenge of controlling optimally the popular false discovery rate (FDR) or positive FDR (pFDR) in the general two-group model, which also allows for dependence between the test statistics. These criteria are less conservative than the mFDR criterion, so they make more rejections in expectation.
We derive their optimal multiple testing (OMT) policies, which turn out to be thresholding the locFDR with a threshold that is a function of the entire set of statistics. We develop an efficient algorithm for finding these policies, and use it for problems with thousands of hypotheses. We illustrate these procedures on gene expression studies.

Joint work with Saharon Rosset

 

 

·         Alon Kipnis, Stanford

 

Two-sample problem for large, sparse, high-dimensional distributions under rare/weak perturbations

Consider two samples, each obtained by independent draws from two possibly different distributions over the same finite and large set of features. We would like to test whether the two distributions are identical, or not. We propose a method to perform a two-sample test of this form by taking feature-by-feature p-values based on a binomial allocation model, combining the p-values using Higher Criticism. Performance on real-world data (e.g. authorship attribution challenges) shows this to be an effective unsupervised, untrained discriminator even under violations of the binomial allocation model.

We analyze the method in a `rare/weak departures' setting where, if two distributions are actually different, they differ only in relatively few features and only by relatively subtle amounts. We perform a phase diagram analysis in which the phase space quantifies how rare and how weak such departures are. Although our proposal does not require any formal specification of an alternative hypothesis, nor does it require any specification of a baseline or null hypothesis, in the limit where counts are high, the method delivers the optimal phase diagram in the rare/weak setting: it is asymptotically fully powerful inside the region of phase space where a formally specified test would have been fully powerful. In the limit where counts are low, we derive the phase diagram as well, although the optimality of the resulting diagram remains an open question.

 

 

·         Yves Rozenholc, Paris Descartes

 

 

Differential Analysis in Transcriptomic : the Strength of Randomly Picking so-called Reference Genes.

Transcriptomic analysis are characterized by being not directly quantitative and to only provide relative measurements of expression levels up to an unknown individual scaling factor. Assuming that some housekeeeping genes are known, one can use their observed relative expression levels to get a normalization (Vandesompele et al. 2002). However, in exploratory differential analysis, it is easily understandable that reference genes cannot always be known in advance. Apart from the crude normalization by the total count (Marioni et al. 2008), several
methods have been proposed to circumvent this issue : upper quantile (Bullard et al. 2010), trimmed mean of M values (TMM) (Mark D. Robinson and Oshlack 2010) and interindividual median count ratio accross gene (Anders and Huber 2010), which can be found in the Bioconductor packages DESeq2 (Love, Huber, and Anders 2014) and EdgeR (Mark D Robinson, McCarthy, and Smyth 2010). More recently, Li et al. (2012) propose to use log-linear fits to detect DE genes, however it also relies on a scaling factor estimation achieved by starting from the total count to selected iteratively a subset of genes associated with small values of a Poisson goodness-of-fit statistic. All these methods are based on the belief that reference genes may be identified as their expression levels are expected to be more stable in the overall population. However, first, one can easily understand that the unknown scaling factors may have a strong deleterious effect on this belief, second, one can build counter-examples to such approaches by considering reference genes showing more variability than non-housekepping ones.

In brief, actual procedures for differential analysis in such high-throuput transcriptomic experiments are build on a preliminary step, which consists in finding some non differential expressions to estimate the scaling factors. Then data are reused for testing. It is not only unsatisfactory to lack a good recipe for this first step, but also unproper and statistically worst, to do a differential analysis by having to run at first a non-differential analysis on the same data. 

Our intensive iterative random procedure for detection can be summarized as follow. At each step of the iteration, a random subset of genes is selected and considered to be made of reference genes, used to get a normalization. After this normalization, the non-selected genes are tested for differential behaviors. Along the iterations, the detections for each gene are pooled. After the iterations, the pooled detections are compared to the rates of potential wrong detections due to miss-picking randomly genes in the unknown set of DE genes. Our method controls the FWER for any test procedure having its level and power controled when the scaling factors are known. It is adaptive to the unknown number of genes which would be detectable, given the observations, if the scaling factors were known, assuming only that the number of DE genes is less than half of the total number of genes.

Moreover, enjoying that our procedure behaves as if reference genes were available, we propose and study a unified testing procedure for differential analysis, adapted to our random detector for the two classical modelizations Poisson and Negative binomial. This test derives from a procedure where scaling factors would be known and in this sense satisfies the requirements in term of type I and II errors of our random procedure. Assuming that the expressions levels are high enough, we study its properties. It is shown to be approximately a standard Gaussian and we derive non-asymptotic control for this approximation such that the test can have its level well controlled at finite distance.

 

 

·         Dan Vilenchik, Ben Gurion U.

 

Computational-statistical tradeoffs in the problem of finding sparse Principal Components in high-dimensional data 

The problem of consistently estimating the covariance matrix of a p-dimensional random variable X is well understood when the ratio p/n goes to zero “sufficiently fast” (n being the number of samples). In many applied scenarios one is trying to solve an easier task, estimating the leading eigenvector(s) of the covariance matrix (known as its leading Principal Component(s)). However, in a high-dimensional setting, where the ratio p/n is a constant or even grows to infinity with n, both tasks becomes much tricker. In some cases, efficiency and statistical consistency need to be traded off. One popular approach to settle this trade-off is by using various efficient estimators that guarantee consistency only in a certain regime of parameters. In this talk we consider a different approach. We suggest a hierarchy of estimators, each level in the hierarchy is an estimator that spends more computational resources than its predecessors. We provide a rigorous analysis of our approach in the spiked-covariance model, where we explicate the required level in the hierarchy to guarantee a statistically consistent solution, as a function of the SNR. We also provide simulation results that demonstrate the usefulness of our approach. 

 Paper appeared in COLT 2020.

 

·         Tirza Routtenberg, BGU

 

Performance Bounds for Estimation After Model Selection

In many practical parameter estimation problems, such as coefficient estimation of polynomial regression and direction-of-arrival (DOA) estimation, the exact model is unknown and a model selection stage is performed prior to estimation. This data-based model selection stage affects the subsequent estimation, e.g. by introducing a selection bias. Thus, new methodologies are needed for both frequentist and Bayesian estimation.  In this study, the problem of estimating unknown parameters after a data-based model selection stage is considered. In the considered setup, the selection of a model is equivalent to the recovery of the deterministic support of the unknown parameter vector. We assume that the data-based model selection criterion is given and analyze the consequent Bayesian and frequentist estimation properties for this specific criterion.  For Bayesian parameter estimation after model selection, we develop the selective Bayesian Cramér-Rao bound (CRB) on the mean-squared-error (MSE) of coherent estimators that force unselected parameters to zero. Similarly, for the frequentist (non-Bayesian) estimation of deterministic unknown parameters, we derive the corresponding frequentist CRB on the MSE of any coherent estimator, which is also Lehmann-unbiased. To this end, the relevant Lehmann-unbiasedness is defined, with respect to the model selection rule.  We analyze the properties of the proposed selective CRBs including the order relation with the oracle CRBs that assume knowledge of the model. The selective CRBs are evaluated in simulations and are shown as an informative lower bound on the performance of practical coherent estimators. As time permits, I will discuss similar ideas that can be applied to estimation in Good-Turing models.

 

 

·         Nir Keret, TAU

 

Optimal Cox Regression Subsampling Procedure with Rare Events

Massive sized survival datasets are becoming increasingly prevalent with the development of the healthcare industry. Such datasets pose computational challenges unprecedented in traditional survival analysis use-cases. A popular way for coping with massive datasets is downsampling them to a more manageable size, such that the computational resources can be afforded by the researcher. Cox proportional hazards regression has remained one of the most popular statistical models for the analysis of survival data to-date. This work addresses the settings of right censored and possibly left truncated data with rare events, such that the observed failure times constitute only a small portion of the overall sample. We propose Cox regression subsampling-based estimators that approximate their full-data partial-likelihood-based counterparts, by assigning optimal sampling probabilities to censored observations, and including all observed failures in the analysis. Asymptotic properties of the proposed estimators are established under suitable regularity conditions, and simulation studies are carried out to evaluate the finite sample performance of the estimators. We further apply our procedure on UK-biobank colorectal cancer genetic and environmental risk factors.

 

·         Malgorzata Bogdan, Wroclaw U. of Science and Technology

 

Ghost Quantitative Trait Loci and hotspots: What might happen if the signal is not sparse?

Ghost quantitative trait loci (QTL) are the false discoveries in QTL mapping that arise due to the “accumulation” of the polygenic effects, uniformly distributed over the genome. The locations of ghost QTL depend on a specific sample correlation structure determined by the genotypes at all loci and have a tendency to replicate when the same genotypes are used to study multiple QTL, e.g. using recombinant inbred lines or studying the expression QTL. In this case, the ghost QTL phenomenon can lead to false hotspots, where multiple QTL show apparent linkage to the same locus. We illustrate the problem using the classic backcross design and propose a solution based on the extended mixed effect model, where the random effects are allowed to have a nonzero mean. We provide formulas for estimating the thresholds for the corresponding t-test statistics and use them in the stepwise selection strategy, which allows for a simultaneous detection of several QTL. We report the results of extensive simulation studies which illustrate that our approach can eliminate ghost QTL/false hotspots, while preserving a high power of true QTL detection. This is a joint work with Jonas Wallin (Lund University), Piotr Szulc (University of Wroclaw), Rebecca Doerge (CMU) and David Siegmund (Stanford).

 

 

 

·         Gil Kur, MIT

 

On the Minimal Error of Empirical Risk Minimization

In recent years, highly expressive machine learning models, i.e. mod- els that can express rich classes of functions, are becoming more and more commonly used due their success both in regression and classifica- tion tasks, such models are deep neural nets, kernel machines and more. From the classical theory statistics point of view (the minimax theory), rich models tend to have a higher minimax rate, i.e. any estimator must have a high risk (a “worst case scenario” error). Therefore, it seems that for modern models the classical theory may be too conservative and strict.

In this talk, we consider the most popular procedure for regression task, that is Empirical Risk Minimization with squared loss (ERM) and we shall analyze its minimal squared error both in the random and the fixed design settings, under the assumption of a convex family of functions. Namely, the minimal squared error that the ERM attains on estimating any function in our class in both settings. In the fixed design setting, we show that the error is governed by the global complexity of the entire class. In contrast, in random design, the ERM may only adapt to simpler models if the local neighborhoods around the regression function are nearly as complex as the class itself, a somewhat counter-intuitive conclusion. We provide sharp lower bounds for performance of ERM for both Donsker and non-Donsker classes. This talk is based on joint work with Alexander Rakhlin.