Department of Statistics & Operations Research

Statistics Seminars

2008/2009

To subscribe to the list, please follow this link or send email to 12345saharon@post.tau.ac.il54321 (remove numbers unless you are a spammer…)

 

Second Semester

3 March

Yoav Benjamini, Tel Aviv University

 

Simultaneous and selective inference: current successes and future challenges

10 March

Purim

17 March

Daniel Yekutieli, Tel Aviv University

 

Adjusted Bayesian inference for selected parameters

24 March

No seminar (MCP09)

31 March

Armin Schwartzman, Harvard University

 

The Effect of Correlation in False Discovery Rate Estimation

April

No seminars scheduled (Pesach, Yom Hashoah, Yom Hazikaron)

5 May

Amir Globerson, Hebrew University

 

Exact probabilistic inference via iterative refinement

12 May

Eran Halperin, Tel Aviv University

 

Estimating Local Ancestry in Recently Admixed Populations

19 May

Allan Sampson, University of Pittsburgh

 

Modeling Issues For Multiple Outcomes In Post-Mortem Tissue Studies

 

26 May

Lawrence Brown, Wharton School of Business, University of Pennsylvania

 

In-Season Prediction of Batting Averages: A Field-test of Basic Empirical Bayes and Bayes Methodologies

9 June

Inbal Yahav, University of Maryland

 

Combining Residuals and Monitor Charts to Detect Outbreaks in Syndromic Surveillance Data

16 June

Ofer Harel, University of Connecticut

 

Multiple imputation for correcting verification bias in estimating sensitivity and specificity

23 June

Alex Goldenshluger, Haifa University

 

On Selection/Aggregation of Estimators



First Semester

28 October

Yulia Gavrilov, Tel Aviv University

 

The Multiple Stage FDR Controlling Procedure and Its Use in Model Selection

4 November

Isaac Meilijson, Tel Aviv University

 

The Garman-Klass volatility estimator revisited

11 November

Shahar Mendelson, Australian National University and Technion

 

Aggregation and empirical minimization

18 November

Yuval Nov, Haifa University

 

Minimum-Norm Estimation for Binormal Receiver Operating Characteristic (ROC) Curves

25 November

Andrea De Gaetano, Università Cattolica del Sacro Cuore, Rome

 

Physiological models: design issues and parameter estimation

2 December

No seminar

9 December

Elad Ziv, University of California, San Francisco

 

Effect of Genetic Architecture on Prediction of Complex Diseases

16 December

Lilach Hadany, Tel Aviv University

 

Second order models of genetic variation: sex, stress, and adaptation

23 December

Eitan Greenshtein, Central Bureau of Statistics

 

Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification

30 December

Haim Bar, Cornell University

 

Random effects and shrinkage estimation in comparative microarray experiments

6 January

Elad Yom Tov, IBM Research

 

Towards automatic debugging of concurrent programs

13 January

No seminar

20 January

Meir Smorodinsky, Tel Aviv University

 

Laplace – the father of probability theory and its applications to statistics and other fields: some historical remarks

27 January

No seminar (originally scheduled Alex Goldenshluger, postponed to next semester)

 


 

Seminars are held on Tuesdays, 10.30 am, Schreiber Building, 309 (see the TAU map ).  The seminar organizer is Saharon Rosset.

To join the seminar mailing list or any other inquiries - please call (03)-6408820 or email 12345saharon@post.tau.ac.il54321 (remove numbers unless you are a spammer…)

 


Seminars from previous years, organized by Daniel Yekutieli:

 

Organized by Felix Abramovich:


ABSTRACTS

 

  • Yulia Gavrilov, Tel Aviv University

The Multiple Stage FDR Controlling Procedure and Its Use in Model Selection

Abstract: Our work is going to deal with the problem of variable selection from a point of view offered by multiple hypotheses testing. The connection between these two statistical fields is obvious as dropping the variable from the prediction equation is equivalent to setting its value to zero. Whether the coefficient value is indeed not zero is a question that can be answered by testing. In this work we offer a new adaptive multiple testing procedure with proven FDR control asymptotically as well as in the finite case. We compare the power of the proposed procedure to the power of other adaptive procedures with proven FDR control, and discuss the FDR property of the proposed multiple stage step-down testing procedure for positively dependent test statistics. We introduce this testing procedure as a penalized method for model selection and show in the example its good performance. We also demonstrate how easily it can be implemented using standard statistical software. We then compare the performances of competing model selection procedures on different data sets, with the reference performance being that of a newly defined “random oracle” – the oracle model selection performance on data dependent nested family of potential models. By a simulation study we show that the proposed penalized procedure has empirical minimax performance in this setting. We finish by the example that shows good performance of the multiple stage penalized procedure for the logistic model as well.

Joint work with Yoav Benjamini. This talk is part of Ph. D. thesis defense.

  • Isaac Meilijson, Tel Aviv University

The Garman-Klass volatility estimator revisited

Abstract: The Garman-Klass unbiased estimator of the variance per unit time of a zero-drift Brownian Motion B, based on the usual financial data that reports for time windows of equal length the open, minimum, maximum and close values, is quadratic in the statistic S1=(CLOSE-OPEN, OPEN-MIN, MAX-OPEN). This estimator, with efficiency 7.4 with respect to the classical estimator (CLOSE-OPEN)^2, is widely believed to be of minimal variance. The current report disproves this belief by exhibiting an unbiased estimator with slightly but strictly higher efficiency 7.7322. The essence of the improvement lies in the observation that the data should be compressed to the statistic S2 defined on W(t)= B(0)+[B(t)-B(0)] sign[(B(1)-B(0)] as S1 was defined on the Brownian path B(t). The best S2-based quadratic unbiased estimator is presented explicitly. The Cramer-Rao upper bound for the efficiency of unbiased estimators, corresponding to the efficiency of large-sample Maximum Likelihood estimators, is 8.471. This bound cannot be attained because the distribution is not of exponential type.

  • Shahar Mendelson, Australian National University and Technion

Aggregation and empirical minimization

Abstract: Given a finite set of estimators, the problem of aggregation is to construct a new estimator whose risk is as close as possible to the risk of the best estimator in the set. It was conjectured that empirical minimization performed in the convex hull of the given set is an optimal aggregation method. In this talk I will show that this conjecture is false. I will also show that despite that, empirical minimization in the convex hull of a well chosen, empirically determined subset of the original set is an optimal aggregation method.

  • Yuval Nov, Haifa University

Minimum-Norm Estimation for Binormal Receiver Operating Characteristic (ROC) Curves

Abstract: The Receiver Operating Characteristic (ROC) curve is often used to assess the usefulness of a diagnostic test.  We present a novel method to estimate the parameters of a popular semi-parametric ROC model, called the binormal model.  Our method is based on minimization of the functional distance between two estimators of an unknown transformation postulated by the model, and has a simple, closed-form solution.  We study the asymptotics of our estimators, show via simulation that they compare favorably with existing estimators, and illustrate how covariates may be incorporated into the norm minimization framework.
Joint work with Ori Davidov.


  • Andrea De Gaetano, Università Cattolica del Sacro Cuore, Rome

Physiological models: design issues and parameter estimation

Abstract: In this talk the physiological control system glucose/insulin will be described, together with its connection with the increasing prevalence of diabetes. The criteria whereby a mathematical model can be judged to be appropriate will be discussed. Counter-examples from the literature will be used to show pitfalls in parameter estimation techniques. Finally, the application of GLS and MCMC to the estimation of model parameters will be described.

  • Elad Ziv, University of California, San Francisco

Effect of Genetic Architecture on Prediction of Complex Diseases

Abstract: Complex diseases are considered diseases that are partially influenced by some genetic factors, but that have no clear Mendelian pattern of inheritance.  The vast majority of medical traits are considered complex diseases.  One of the goals of modern genetics is to discover the genetic variants that underlie these traits.  It is assumed that discovery of these genetic variants will lead to genetic tests that  can be used to improve disease prediction in individuals.  A common model of complex diseases, the "common disease common variant" model, suggests that by identifying common variants that have modest effects on disease, geneticists will be able to combine these to introduce accurate disease prediction tools.  In contrast, rare variants that substantially increase risk have traditionally been considered minimally useful from a public health perspective.  We relate the allele frequency, population attributable risk and the C-statistic, a standard tool for quantifying disease prediction, equivalent to the area under the curve of the ROC curve.  Using these relationships we demonstrate that rare variants with a strong effect on disease are actually more useful in a public health setting than common variants with modest effects. Furthermore, we demonstrate that under many plausible circumstances, most of the disease prediction potential in the genome is actually in the rare allele frequency.    Finally, we consider the practical sample size limitations on discovery for both rare and common variants related to disease and, at what point, disease prediction potential in the genome has been optimally reached.  Our results have several important consequences for study design and for clinical practice: (1) Study designs may have currently reached or are close to reaching the limit of common variants useful for disease prediction.  In contrast, the majority rare variants with disease prediction potential remain undiscovered. (2) Designing clinical prediction tools to optimally take advantage of the disease prediction potential of the genome will require sequencing individuals patients at some genetic loci.

  • Lilach Hadany, Tel Aviv University ancisco

Second order models of genetic variation: sex, stress, and adaptation

Abstract: Genetic variation provides the raw material for evolutionary change. In most population genetics models, variation is assumed to be generated at a uniform rate, depending on the genes coding for variation but not on the state of the individual. In this talk I discuss the implications of a new assumption - that the generation of genetic variation is itself plastic, so that genetic variation is generated at higher rates under stress. We found that stress-induced genetic variation can evolve under a wide parameter range, and might help explain the evolution of sex and the mechanisms of complex adaptation. Theoretical models and experimental evidence will be discussed.

  • Eitan Greenshtein, Central Bureau of Statistics

Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification

Abstract: We consider the problem of classification using high dimensional features' space. In a paper by Bickel and Levina (2004), it is recommended to use naive-Bayes classifiers, i.e., to treat the features as if they are statistically independent. Consider now a sparse setup, where only a few of the features are informative for classification. Fan and Fan (2007), suggested a variable selection and classification method, called FAIR. The FAIR method improves the design of naive-Bayes classifiers in sparse setups. The improvement is due to reducing the noise in estimating the features' means. This reduction is since that only the means of a few selected variables should be estimated.
We also consider the design of naive Bayes classifiers. We show that a good alternative to variable selection is estimation of the means through a certain non parametric empirical Bayes procedure. In sparse setups the empirical Bayes implicitly performs an efficient variable selection. It also adapts very well to non sparse setups, and has the advantage of making use of the information from many \weakly informative" variables, which variable selection type of classification procedures give up on using. We compare our method with FAIR and other classification methods in simulation for sparse and non sparse setups, and in real data examples involving classification of normal versus malignant tissues based on microarray data.
Joint work with Junyong Park.

Random effects and shrinkage estimation in comparative microarray experiments

Abstract: mixture of mixed-effects model for the comparison of (normalized) microarray data from two treatment groups is proposed. Approximate maximum likelihood fitting is accomplished via a fast EM-type algorithm. Posterior odds of reatment/gene interactions, derived from the model, involve shrinkage estimates of both the interactions and of the gene specific error variances. Genes are classified as being associated with treatment based on the posterior odds or local FDR with a fixed cutoff. The approach is shown to perform very well, and is more numerically stable, when compared with some well known competitors. In principle the model can be generalized to more complex designs and to multiplatform microarray data..

Towards automatic debugging of concurrent programs

Abstract:
Concurrent computer programs are fast becoming prevalent in many critical applications. Unfortunately, these programs are especially difficult to test and debug. Recently, it has been suggested that injecting random timing noise into multiple concurrency-related points within a program can assist in eliciting bugs within the program. Upon eliciting the bug, it is necessary to identify a minimal set of program locations that indicate the source of the bug to the programmer.
I will show how this problem can be formulated as a sampling and feature selection problem, and propose batch and active sampling methods as a solution to the problem. I will focus on the active sampling algorithm, analyze its convergence properties, and show how our approach can pinpoint specific lines in the code which are related to bugs in the program, even in very large programs.

Simultaneous and selective inference: current successes and future challenges

Abstract: Building upon a short review of historical trends in MCP research, I shall explain why the current decade can be viewed as a second golden era for our field. I argue that much of the success stems from our being able to address real current needs. At the same time, this success generated a plethora of concepts for error-rate and power, as well as multiplicity of methods for addressing them. This may seem all too confusing for the users of our methodology and pose a threat.
To avoid the threat, it is our responsibility to match our theoretical goals to the goals of our clients: scientists, educators, public policy setters, engineers or business people. Only then should we match the methods to the theoretical goals. I shall discuss some of the considerations that are related to the needs of clients: addressing simultaneous inference or selective inference, testing or estimation, decision-making or scientific reporting.
I shall then further argue that the vitality of our field in the future - as a research area - depends upon our ability to continue and address the real needs of statistical analyses in specific current problems. I shall demonstrate these needs in two application areas that offer new challenges and have received less attention in our community to date.

Adjusted Bayesian inference for selected parameters

Abstract: We address the problem of providing inference for parameters selected after viewing the data. A frequentist solution to this problem is False Discovery Rate adjusted inference. We explain the role of selection in controlling the occurrence of false discoveries in Bayesian analysis, and argue that Bayesian inference may also affected by selection – in particular Bayesian inference based on subjective priors. We introduce selection-adjusted Bayesian methodology based on the conditional posterior distribution of the parameters given selection; show how it can be used to specify selection criteria; explain how it relates to the Bayesian FDR approach; and apply it to microarray data.

The Effect of Correlation in False Discovery Rate Estimation

Abstract: Current FDR methods mostly ignore the correlation structure in the data. The objective of this work is to quantify the effect of correlation in FDR analysis. Specifically, we derive practical approximations for the expectation, variance, and quantiles of the FDR estimator for arbitrarily correlated data. This is achieved using a negative binomial model for the number of false discoveries, where the parameters are found empirically from the data. We show that correlation may increase the bias and variance dramatically with respect to the independent case, and that in some extreme cases, such as an exchangeable correlation structure, the FDR estimator fails to be consistent as the number of tests gets large.
This is joint work with Xihong Lin.

Exact probabilistic inference via iterative refinement

Abstract: Graphical models are a powerful tool for representing distributions over complex multivariate objects such as images or documents. Although graphical models have been used with considerable success in many domains, such as machine vision and signal processing, it is theoretically NP hard to infer even simple model properties, such as the most likely assignment (MAP). This difficulty has been addressed in practice by designing approximate inference algorithms (such as belief propagation) that often work well in practice, although with relatively weak theoretical guarantees.
A related approach to approximating the MAP problem is via Linear Programming relaxations. However, these still do not yield an exact solution in the general case. I will present an approach that uses sequences of LP relaxations that yield tighter and tighter approximations, in a manner that is tailored to a specific inference problem.  These approximations can be solved using simple message passing algorithms, which are derived from the convex dual of the LP relaxation.
I will show how this approach can be applied to difficult inference problems such as protein design and stereo vision, and in fact often yields provably exact solutions to these problems.

Based on joint work with:  David Sontag, Talya Meltzer, Tommi Jaakkola and Yair Weiss
 

Estimating Local Ancestry in Recently Admixed Populations

Abstract: Large-scale genotyping of genetic variants (Single Nucleotide Polymorphism - SNPs) has shown a great promise in identifying markers that could be linked to diseases, based on an evidence for correlation between the disease and the marker. One of the major obstacles involved in performing these studies is that the underlying population sub-structure could produce spurious associations. Population sub-structure can be caused by the presence of two distinct sub-populations or a single pool of admixed individuals, such as African Americans or Latinos; such population are formed by the encounter and then mixing of two or more populations some 10-20 generations ago. In such populations, different bases in the genome may have originated from different populations.
In this talk, I will describe methods for the inference of the local ancestry in such individuals. I will describe two methods that we have recently developed to detect admixture, or the locus-specific ancestry in an admixed population. We have run extensive experiments to  haracterize the important parameters that have to be optimized when considering this problem - I will describe the results of these experiments in context with existing tools such as SABER and STRUCTURE.

In-Season Prediction of Batting Averages: A Field-test of Basic Empirical Bayes and Bayes Methodologies

Abstract: Batting average is one of the principle performance measures for an individual baseball player. It has a simple numerical structure as the percentage of successful attempts, “Hits”, as a proportion of the total number of qualifying attempts, “At-Bats”. This situation, with Hits as a number of successes within a qualifying number of attempts, makes it natural to statistically model each player’s batting average as a binomial variable outcome. This is a common data structure in many statistical applications; and so the methodological study here has implications for such a range of applications. [No prior knowledge about baseball is required for this talk, and if you have none then don’t expect to have much more when you leave.]
   We will look at batting records for every Major League player over the course of a single season (2005). The primary focus is on using only the batting record from an earlier part of the season (e.g., the first 3 months) in order to predict the batter’s latent ability, and consequently to predict his batting-average performance for the remainder of the season. Since we are using a season that has already concluded, we can validate our predictive performance by comparing the predicted values to the actual values for the remainder of the season.
   The methodological purpose of this study is to gain experience with a variety of predictive methods applicable to a much wider range of situations. Several of the methods to be investigated derive from empirical Bayes and hierarchical Bayes interpretations. Although the general ideas behind these techniques have been understood for many decades*, some of these methods have only been refined relatively recently in a manner that promises to more accurately fit data such as that at hand.
   One feature of all of the statistical methodologies here is the preliminary use of a particular form of variance stabilizing transformation in order to transform the binomial data problem into a somewhat more familiar structure involving (approximately) Normal random variables with known variances. This transformation technique is also useful in validating the binomial model assumption that is the conceptual basis for all our analyses. If time permits we will also describe how it can be used to test for the presence of “streaky hitters” whose latent ability appears to significantly change over time.

* A particularly relevant background reference is Efron, B. and Morris, C. (1977) Stein’s paradox in statistics” Scientific American 236 119-127, and the earlier, more technical version (1975), “Data analysis using Stein’s estimator and its generalizations” Jour. Amer. Stat. Assoc. 70 311-319.

Combining Residuals and Monitor Charts to Detect Outbreaks in Syndromic Surveillance Data

Abstract: The main goal of biosurveillance is the early detection of disease outbreaks. Advances in technology have allowed the collection, transfer, and storage of pre-diagnostic information in addition to traditional diagnostic data. Such data carry the potential of an earlier outbreak signature.
Analyzing data in the goal of detecting outbreaks composed of two main stages: pre- processing the data to remove explainable patterns and monitoring the `clean' data, using monitor charts, to determine outbreaks. The literature suggests a variety of preprocessing functions and monitor charts to control syndromic surveillance data. However, it is well known that each of these function is tuned for specific outbreaks. For example, Shewhart is optimal for detecting spikes in data; EWMA is a better detector for exponential outbreak.
When considering syndromic surveillance data streams, the shape of an outbreak in still unknown. This leaves the question of which function is optimal for such data and whether there is one function that outperforms all the others. Motivated by such questions, we propose methods to use a combination of functions to analyze and monitor a syndromic surveillance data streams. We consider methods combination in different stages of the monitoring process, i.e. combining residuals Vs. combining monitor charts. For the combination of monitor charts we present a static method where the weight of each chart is predefined. For residuals combination we propose an adaptive method based on recent history.

Joint work with Galit Shmueli.

  • Ofer Harel, University of Connecticut

Multiple imputation for correcting verification bias in estimating sensitivity and specificity

Abstract: Sensitivity and specificity are widely used to describe a diagnostic test. When all subjects have test results and true status, the estimation of the sensitivity and specificity is built on two binomial distributions. This estimation is not a trivial task. In the case in which all subjects are screened using a common test, and a subset of these subjects are tested using a golden standard test, there is a risk for bias, called verification bias. When not all subjects have been verified, special methods of estimation need to be used. There are several methods to estimate the sensitivity, specificity and their standard errors in this kind of situation. The standard methods were developed under some special cases of the verification choices. Approaching this problem from a missing data prospective, allows us to use Multiple Imputation (MI) technique in order to impute the data. We adopt MI framework, and develop different MI procedures using the most common ”complete-data” methods. We compare the procedures between themselves and to the standard (incomplete) methods. We illustrate our procedure using a biomedical data example.
This is joint work with Andrew Zhou.

  • Alex Goldenshluger, Haifa University

On Selection/Aggregation of Estimators

Abstract:
The talk concentrates on the aggregation problem that can be formulated as follows. Assume that we have a family of estimators built on the basis of available observations. The goal is to select an estimator whose risk is as close as possible to that of the best estimator in the family. We propose a general scheme that applies for families of arbitrary estimators and a wide variety of models and global risk measures. We derive oracle inequalities and show that they are unimprovable in some sense. Numerical results demonstrate good practical behavior of the procedure.