Department of Statistics & Operations Research

Statistics Seminars

2016/2017

To subscribe to the list, please follow this link or send email to 12345yekutiel@post.tau.ac.il54321 (remove numbers unless you are a spammer…)

Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: C:\Users\user\Documents\myWebsite\TAU Statistics Seminar Home Page_files\red2.gif

 

Second Semester

14 March
Jonathan Rozenblatt, BGU
 

A hypothesis testing view of searchlight pattern analysis

21 March
David Horn, TAU
 
The Weight-Shape Decomposition of Scale-Space Distributions: a Framework for Clustering Algorithms
28 March
Hilary Finucane, MIT
 

Heritability enrichment of specifically expressed genes identifies disease-relevant cell types and tissues

4 April
Yakir Reshef, MIT
 
Estimating functional correlation from genome-wide association study summary statistics
16 May
Amit Moscovich Eiger, TAU
 

Minimax-optimal semi-supervised regression on unknown manifolds

23 May
Marianna Pensky, University of Central Florida
 

Classification with many classes: challenges and pluses.

6 June

Zhaohui Qin, Emory University

 
Utilizing Big Data to solve small data inference problem

27 June

Ofer Harel, Uconn

 

Imputation of Race and Ethnicity in Health Insurance Claims

3 July

Nathan Srebro, TTI

 

Supervised Learning without Discrimination

 

 

 

 

 

First Semester

1 November
Daniel Yekutieli, TAU
 

From post-hoc analysis to post-selection inference

8 November

Dan Garber, Dan Garber, Toyota Technological Institute at Chicago

 

Faster Projection-free Machine Learning and Optimization

16 November
Geoff Vining, Virgina Tech –Schreiber 008
 
A Cautionary Note on Bayesian Approaches within Quality Improvement
29 November
Dovi Poznanski, TAU
 

From Kaplun to Schreiber in 24 slides: a non-formal astro-statistics talk

13 December
Daniel Nevo, Harvard
 

Causal mediation analysis for generalized linear models

20 December

Tamar Sofer, University of Washington

 

Novel approaches for analysis of complex genetic data sets

3 January
Roee Guttman, Brown University
 

Beyond Difference-in-Differences – A Bayesian Procedure to Estimate the Effects of Nursing Home Bed-Hold Policies

10 January
Yaakov Malinovsky, UMBC
 

Nested Group Testing Procedures

17 January
Assaf Weinstein, Stanford
 

Empirical Bayes Estimation of a Heteroscedastic Normal Mean 

24 January
Aya Cohen, Technion
 

Evaluation of Within Group Agreement

 

 

 

 

 

 

 

 

 

 


 

 

Seminars are held on Tuesdays, 10.30 am, Schreiber Building, 309 (see the TAU map ). The seminar organizer is Daniel Yekutieli.

To join the seminar mailing list or any other inquiries - please call (03)-6409612 or email 12345yekutiel@post.tau.ac.il54321 (remove numbers unless you are a spammer…)

 


Seminars from previous years


 

 

 

ABSTRACTS

 

 

·         Daniel Yekutieli, TAU

 

From post-hoc analysis to post-selection inference

 

I will give an introductory talk explaining the connection between the work of Tukey and Scheffe on post-hoc analysis,  Benjamini and Hochberg’s work on the FDR, the work of Efron and colleagues on the Bayesian FDR,  my work with Benjamini on selective inference,  the work of Berk et al.  on post-selection inference,  and recent  work on frequentist and Bayesian post-selection inferences based on the conditional likelihood.

 

 

 

·         Dan Garber, Toyota Technological Institute at Chicago

 

Faster Projection-free Machine Learning and Optimization

 

Projected gradient descent (PGD), and its close variants, are often considered the methods of choice for solving a large variety of machine learning optimization problems, including empirical risk minimization, statistical learning, and online convex optimization. This is not surprising, since PGD is often optimal in a very appealing information-theoretic sense. However, for many problems PGD is infeasible both in theory and practice since each step requires to compute an orthogonal projection onto the feasible set. In many important cases, such as when the feasible set is a non-trivial polytope, or a convex surrogate for a low-rank structure, computing the projection is computationally inefficient in high-dimensional settings. An alternative is the conditional gradient method (CG), aka Frank-Wolfe algorithm, that replaces the expensive projection step with a linear optimization step over the feasible set. Indeed in many problems of interest, the linear optimization step admits much more efficient algorithms than the projection step, which is the reason to the substantial regained interest in this method in the past decade. On the downside, the convergence rates of the CG method often fall behind that of PGD and its variants. 

 

In this talk I will survey an ongoing effort to design CG variants that on one hand enjoy the cheap iteration complexity of the original method, and on the other hand converge provably faster, and are applicable to a wider variety of machine learning settings. In particular I will focus on the cases in which the feasible set is either a polytope or a convex surrogate for low-rank matrices. Results will be demonstrated on applications including: LASSO, video co-localization, optical character recognition, matrix completion, and multi-class classification.

 

 

·         Geoff Vining, Virgina Tech

 

A Cautionary Note on Bayesian Approaches within Quality Improvement

Bayesian approaches are increasingly popular within the statistics community.  However, they currently do not seem to find wide application within the industrial statistics/quality improvement community.  This paper
examines some of the basic reasons why.  It begins by reviewing Box's perspective on the scientific method and discovery.  It then examines Deming's concepts of analytic versus enumerative studies.  Together, these
concepts provide a framework for evaluating when Bayesian approaches make good sense, where they make little sense, and where they fall somewhere inbetween.  This paper uses examples based on statistical sampling plans and the design and analysis of experiments to illustrate its basic points

 

 

·         Aya Cohen, Technion

 

Evaluation of Within Group Agreement

 

Complex, multilevel theories are common in the behavioral sciences research where notions of collective phenomena such as group affect, team efficacy, and organizational climate are studied. A major challenge for researchers working in these areas is that higher level phenomena often cannot be assessed directly, but rather inferences must be made from data collected at lower levels of analysis. In many cases, these phenomena are understood conceptually to arise from lower levels, often from individuals, within these collectives. The methodological implication is that measurement should take place at the lower level, (e.g., the individual level) and then the data should be aggregated to the level of interest, (e.g., the group level or organizational level). It is accepted that within group agreement is a pre-requisite for aggregating individual ratings to the group level. Agreement reflects the degree to which the members of the group share a similar view so that the aggregated value can be used to reflect their view. 

When justifying aggregation agreement indices, rWG(J), or AD  are used together with the intra-class correlation  ICC to demonstrate agreement and consistency among lower-level units. Along with the progress on evaluating agreement based on, rWG(J) , or AD are still many practical questions how to infer from the calculated agreement indices whether the agreement is large enough to justify aggregation.

In the seminar I shall introduce the rWG(J) and AD indices, explain their properties and  how they are used ( and misused). I shall describe and discuss the RGR method, (Bliese & Halverson ,1996)  which compares the estimated agreement  indices and ICC obtained for actual team data to that of ‘‘pseudo teams’’ formed by randomly combining individual responses into ‘’teams"  I shall also point out open questions that still remain concerning how to use the observed values of these indices to infer about agreement and briefly describe recent new developments.

Joint work with Etti Doveh.

 

 

·         Dovi Poznanski, Astronomy, TAU

 

From Kaplun to Schreiber in 24 slides: a non-formal astro-statistics talk

 

Prompted by a random lunch discussion I will discuss with this esteemed crowd a few of the projects my team has worked on, or is currently advancing, hoping to give you a glimpse of what we do in the field of "big-data astronomy", and spark some discussion on the methodology. I will discuss how we stack spectra of extragalactic objects in order to recover the tiny imprint of the gas in our own galaxy, a project to study supermassive black holes via instrumental systematic noise, and our discoveries using an anomaly detection algorithm that we developed. 

 

 

 

·         Daniel Nevo, Harvard

 

Causal mediation analysis for generalized linear models

In epidemiological, social science and other scientific studies, mediation analysis is often carried out to assert whether the effect of a treatment or an exposure on an outcome of interest is mediated by another covariate. This task concerns the underlying causal mechanism. In this talk, I will first present the counterfacual framework for causal inference, and provide background on causal mediation analysis while introducing the causal parameters of interest. A common method for mediation analysis, termed "the difference method", compares estimates from models with and without the suspected mediator and results in estimates that can have a causal interpretation under certain assumptions.  I will formulate the problem for generalized linear models, and consider the issue of having the same link function for the conditional and marginal models. Causal mediation effects will be then estimated by utilizing a data duplication algorithm, together with a generalized estimating equations approach that also provides straightforward variance estimation.

This is joint work with Xiaomei Liao and Donna Spiegelman

 

 

·         Tamar Sofer, University of Washington

 

Novel approaches for analysis of complex genetic data sets

 

The Hispanic Community Health Study is a large genetic health study of Hispanic/Latino individuals. Study participants were sampled via a two-stage design, leading to a complicated correlation structure, where people may be both genetically and environmentally correlated. I will present two analysis approaches for studies with such complicated structure: a method for estimating the proportion of outcome variance due to genetic effects (heritability) and particularly, confidence intervals for heritability, and a meta-analysis method for combining association studies conducted on multiple study strata, when individuals are correlated between strata. 

 

 

·         Roee Guttman, Brown University

 

Beyond Difference-in-Differences – A Bayesian Procedure to Estimate the Effects of Nursing Home Bed-Hold Policies

 

Nursing home bed-hold policies provide continuity of care for Medicaid beneficiaries by paying nursing homes to reserve beds so residents can return to their facility of occupancy following an acute hospitalization. Two outcomes that are useful in assessing the effects of these policies on the quality of care are the nursing home's rates of acute hospitalization and mortality. Evaluation of policy implications in the absence of randomized experiments has been an important research question in health services research, quantitative sociology and economics. Difference-In-Differences (DID) methods have been frequently used to account for changes over time unrelated to the policy. Using DID, the change experienced by the group subjected to the policy is adjusted by the change experienced by the group not subjected to the policy. The underlying assumption is that the time trend in the control group is an adequate proxy for the time trend that would have occurred in the treatment group in the absence of the policy. DID may suffer from weaknesses when more than two time points are considered, when the outcomes are not normally distributed or are not scalar, and when the treatment effect is heterogeneous. We propose a new Bayesian procedure that relies on multiply imputing the potential outcomes using past outcomes to overcome these weaknesses. We provide an efficient algorithm to approximate the full Bayesian procedure, and we apply it to estimate the impact of nursing home’s bed-hold policy.

 

 

·         Yaakov Malinovsky, University of Maryland, Baltimore County

 

Nested Group Testing Procedures

 

Group testing has its origin in the identi_cation of syphilis in the US army during World War II. It is a useful method that has broad applications in medicine, engineering, and even in airport security control. Consider a _nite population of N units, where unit i has a probability p to be defective. A group test is a simultaneous test on an arbitrary group of units with two possible outcomes: all units are good or at least one of the units is defective. The group testing problem is to construct a procedure which classifes all units in a given population, with as small as possible expected number of tests. In this talk I review previously known results in the group testing literature andpresent new results characterizing optimality of commonly used nested group testing procedures. If time allows, the generalized group testing problem (where unit i has a probability pi to be defective) will be discussed as well. This is Joint work with Paul Albert, NCI

 

References:

Malinovsky, Y., Albert, P. S. (2016). Revisiting nested group testing procedures: new results, comparisons, and robustness (available in https://arxiv.org/pdf/1608.06330v2.pdf).

Malinovsky, Y. (2016). Sterrett procedure for the generalized group testing problem (available in https://arxiv.org/pdf/1609.04478v2.pdf).

 

·         Assaf Weinstein, Stanford

 

Empirical Bayes Estimation of a Heteroscedastic Normal Mean 

 

I will revisit a classical problem: X_i~N(theta_i,V_i) indep, V_i known, i=1,...,n, and the goal is to estimate the (nonrandom) means theta_i under sum of squared errors. When the variances are all equal, linear empirical Bayes estimators which model the true means as i.i.d. random variables, lead to (essentially) the James-Stein estimator, and have strong frequentist justifications. In the heteroscedastic case such empirical Bayes estimators are less adequate if the V_i and theta_i are dependent in their empirical distribution. We suggest a new empirical Bayes procedure that groups together observations with similar variances and applies a spherically symmetric estimator to each group separately. Our estimator is exactly minimax and at the same time asymptotically achieves the risk of a stronger oracle than the usual one. The motivation for the new estimator comes from extending a compound decision theory argument from equal variances to unequal variances. 

This is joint work with Larry Brown, Zhuang Ma and Cun-Hui Zhang.

 

 

·         Jonathan Rozenblatt, BGU

 

A hypothesis testing view of searchlight pattern analysis

 

Searchlight Multi-Voxel Pattern Analysis (MVPA) has been tremendously popular in the neuroimaging community since its introduction, about 10 years ago. The idea of fitting a local/scan/searchlight classifier, can also be found in the genetics literature. In this talk I will outline a typical MVPA analysis pipeline and cast it as a statistical multivariate hypothesis test so that it may be compared to the mass univariate approach (i.e.- multiple univariate testing). Seen as a multivariate testing problem, I will discuss the implied hypotheses, potential power gains, and computational shortcuts. 

 

Some of the ideas in this talk have been published in [1-3]. Some are still work in progress and are yet to be published.

 

 

[1] Gilron, Roee, Jonathan Rosenblatt, and Roy Mukamel. “Addressing the ‘problem’ of Temporal Correlations in MVPA Analysis.” In Proceeding of the The 6th International Workshop on Pattern Recognition in Neuroimaging, 2016.

 

[2] Gilron, Roee, et al. "What's in a pattern? Examining the type of signal multivariate analysis uncovers at the group level." NeuroImage 146 (2017): 113-120.

 

[3] Rosenblatt, Jonathan, Roee Gilron, and Roy Mukamel. “Better-Than-Chance Classification for Signal Detection.” arXiv:1608.08873 [Stat], August 31, 2016. http://arxiv.org/abs/1608.08873.

 

 

 

·         David Horn, TAU

 

The Weight-Shape Decomposition of Scale-Space Distributions: a Framework for Clustering Algorithms

 

We propose an analysis scheme which addresses the scale-space distribution, based on Gaussian kernels applied to data-points in feature-space. By adding an entropy-like variable we prove that the scale-space probability distribution can be written as a product of a weight-function and a shape-distribution. This weight-shape decomposition allows for the construction of three different clustering schemes. Clustering based on the shape distribution coincides with the Quantum Clustering method.

The clustering methodologies are based on flow of replica points in feature-space, We demonstrate and compare them on natural data-sets. Our scheme provides an analytic demonstration of pure point and line attractors of replica dynamics. The appearance of the latter will be demonstrated in a big-data analysis.

 

 

 

 

 

·         Hilary Finucane, MIT

 

Heritability enrichment of specifically expressed genes identifies disease-relevant cell types and tissues

For many diseases and traits, genome wide association studies (GWAS) have identified a large number of associated regions of the genome, but to move from an associated region of the genome to a better understanding of the relevant biological processes often requires in-vitro experiments done in the right cell types or tissues. The relevant cell types and tissues are often unknown, and identifying them is a key step in learning biology from GWAS. In this talk, I will describe our recent work on identifying disease-relevant cell types and tissues by joint analysis of GWAS data with gene expression data.

I will first describe stratified LD score regression, a method that uses GWAS summary statistics to fit a random effects model. The parameters of this model provide information about the disease such as whether regions of the genome active in a given tissue (e.g., liver) tend to be more associated with disease than regions of the genome active in a second tissue (e.g., brain), adjusting for several confounders and modeling the fact that there are causal variants that are not included in the GWAS. I will then describe our application of this method to gene expression data from several sources, including the GTEx and PsychENCODE consortia, together with GWAS summary statistics for 48 diseases and traits with an average sample size of 86,850. In this analysis, we identified many enrichments, including an enrichment of inhibitory neurons over excitatory neurons for bipolar disorder, and enrichments in the cortex for schizophrenia and in the striatum for migraine. Our results demonstrate that our approach is a powerful way to leverage gene expression data for interpreting GWAS signal.

 

 

 

 

·         Yakir Reshef, MIT

 

Estimating functional correlation from genome-wide association study summary statistics

 

Genome-wide association studies (GWAS) have grown tremendously in recent years, and new large-scale genomics data sets have provided a lens through which to interpret these data to learn disease biology. This is often done by combining GWAS data with genomic annotations containing unsigned information about whether a genetic variant is relevant or not to a biological process such as transcription factor binding. However, there are also many genomic annotations that yield signed information about whether a variant promotes or hinders a biological process. We introduce a method for estimating whether variants with concordant signs according to a genomic annotation also have concordant directions of effect on a trait of interest. Our approach is model-based, requires only GWAS summary statistics, accounts for correlations among genetic variants and the presence of unmeasured variants, and has the advantage of robustness to some plausible types of confounding. We present preliminary findings obtained by applying our method using signed annotations constructed using a sequence-based predictor of transcription factor binding.

 

 

 

 

·         Amit Moscovich Eiger, TAU

 

Minimax-optimal semi-supervised regression on unknown manifolds

 

In recent years, many semi-supervised regression and classification methods have been proposed. These methods demonstrated empirical success on some data sets, whereas on others the unlabeled data did not appear to help.

To analyze semi-supervised learning theoretically, it is often assumed that the data points lie on a low-dimensional manifold. Under this assumption [1] and [2] have shown that classical nonparametric regression methods, using only the labeled data, can achieve optimal rates of convergence. This implies that asymptotically, as the number of labeled points tends to infinity, unlabeled data does not help. However, typical semi-supervised scenarios involve few labeled points, and plenty of unlabeled ones.

In this work ([3]) we clarify the potential benefits of unlabeled data under the manifold assumption, given a fixed amount of labeled points. Specifically, we prove that for a Lipschitz function on a manifold, a simple semi-supervised regression method based on geodesic k-nearest-neighbors achieves the finite-sample minimax bound on the mean squared error, provided that sufficiently many unlabeled points are available. Furthermore, we show that this approach is computationally efficient, requiring only O(k N log N) operations to estimate the regression function for all N labeled and unlabeled points. We illustrate this approach on two datasets with a manifold structure: indoor localization using WiFi fingerprints and facial pose estimation. In both cases, the proposed method is more accurate and much faster than the popular Laplacian eigenvector regressor [4].

The talk should be accessible to anyone with a general background in statistics and machine learning. Specifically, no knowledge of manifold geometry or minimax theory is assumed.

 

[1] Bickel, P. J. and Li, B. "Local polynomial regression on unknown manifolds". Tomography, Networks and Beyond (2007).

[2] Lafferty, J. and Wasserman, L. "Statistical analysis of semi-supervised regression". NIPS (2007).

[3] Moscovich, A. Jaffe, A. and Nadler, B. "Minimax-optimal semi-supervised regression on unknown manifolds". AISTATS (2017). http://proceedings.mlr.press/v54/moscovich17a.html

[4] Belkin, M. and Niyogi, P. "Semi-supervised learning on riemannian manifolds". Machine learning (2004).

 

 

·         Marianna Pensky, University of Central Florida

 

 

Classification with many classes: challenges and pluses.

 

We consider high-dimensional multi-class classification of normal vectors, where unlike standard assumptions, the number of classes may be also large. We derive the (non-asymptotic) conditions    on effects of significant features, and the low and the upper bounds for distances between classes required for  successful feature selection and  classification with a given accuracy. Furthermore, we study an asymptotic setup where the number of classes is growing with the dimension of feature space and sample sizes. To the best of our knowledge, our paper is the first to study this important model. In particular, we present an interesting and, at first glance, somewhat counter-intuitive phenomenon that the precision of classification can improve as the number of classes grows. 

 

 

 

 

·         Zhaohui Qin, Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, USA

 

Utilizing Big Data to solve small data inference problem –  Alternatives to hierarchical models with applications to genomics data

 

Modern high-throughput biotechnologies such as microarray and next generation sequencing produce a massive amount of information for each sample assayed. However, in a typical high throughput experiment, only limited amount of data are observed for each individual feature, thus the classical ‘large p, small n’ problem. Bayesian hierarchical model, capable of borrowing strength across features within the same dataset, has been recognized as an effective tool in analyzing such data. However, the shrinkage effect, the most prominent feature of hierarchical features, can lead to undesirable over-correction for some features. In this work, we discuss possible causes of the over-correction problem and propose several alternative solutions. Our strategy is rooted in the facts that in the Big Data era, large amount of historical data are available which should be taken advantage of. Our strategy presents a new framework to enhance the Bayesian hierarchical model.

 

 

·         Ofer Harel, Uconn

 

Imputation of Race and Ethnicity in Health Insurance Claims

 

The State of Connecticut is currently populating an All Payers Claims Database (APCD) which will hold all healthcare claims data for residents of Connecticut. The APCD will be a valuable resource for the study of healthcare delivery, costs and outcomes. It is also a potential resource for the study of health disparities in Connecticut. However, since very few healthcare claims records include the race and ethnicity of the beneficiary (approximately 3%), their use for the study of health disparities is very limited. The imputation of race and ethnicity in these claims data would greatly increase the value of the data held in the APCD and may lead to better healthcare outcomes for CT residents. Currently no model exists to impute race and ethnicity in CT healthcare claims. This project aims to use previously existing CT birth records data held by the Department of Public Health (DPH) to produce an imputation model that can be used to impute race and ethnicity in CT healthcare claims, thereby greatly increasing the utility of the data in the CT APCD. In addition, the model created for this project can be then extended for use in other states, increasing the general utility of healthcare claims. (This is joint work with Robert Aseltine and Yishu Xue).

 

·         Nathan Srebro, TTI

 

Supervised Learning without Discrimination

As machine learning is increasingly being used in areas protected by anti-discrimination law, or in other domains which are socially and morally sensitive, the problem of algorithmicly measuring and avoiding prohibited discrimination in machine learning is pressing.  What does it mean for a predictor to not discriminate with respect to protected group (e.g. according to race, gender, etc)? We propose a notion of non-discrimination that can be measured statistically, used algorithmically, and avoids many of the pitfalls of previous definitions.  We further study what type of discrimination and non-discrimination can be identified with oblivious tests, which treat the predictor as an opaque black-box, and what different oblivious tests tell us about possible discrimination.

Joint work with Suriya Gunasekar, Mortiz Hardt, Mesrob Ohannessian, Eric Pierce and Blake Woodwoorth