Department of Statistics & Operations Research

Statistics Seminars

2009/2010

To subscribe to the list, please follow this link or send email to 12345saharon@post.tau.ac.il54321 (remove numbers unless you are a spammer…)

Second semester

16 February	Yaacov Ritov, Hebrew University
Early bird special	Lost with a MAP in a foreign neighborhood
16 March	Ruth Heller, Technion
	Sensitivity analysis for the cross-match test, with applications in genomics
23 March	Ishay Weissman, Technion
	Dependence Measures for Multivariate Extreme Value Distributions
13 April	Philip Stark, University of California, Berkeley
	Justice and Inequalities Sponsored by the Nathan and Lily Silver Chair in Applied statistics
27 April	Bradly Jones, JMP Software
	Efficient Designs with Minimal Aliasing
4 May	Yoram Gal-Ezer
	Where was Bonferroni logically wrong and does the FDR correct his mistake?
11 May	Boaz Nadler, Weizmann Institute
	Principal Component Analysis in Noisy High Dimensional Settings
20 May	Jason Fine, University of North Carolina, Chapel Hill
Special Thursday time	Sensitivity testing for nonidentifiable models, with application to longitudinal data with noninformative dropout
27 May	Yair Goldberg, University of North Carolina, Chapel Hill
Special Thursday time	Censored quantile regression using inverse probability of censoring weighted average
1 June	Yoav Benjamini, Tel Aviv University
	Some thoughts on replicability

First Semester

3 November	Ronny Luss, Tel Aviv University
	Predicting Abnormal Returns From News Using Text Classification
10 November	Saharon Rosset, Tel Aviv University
	Can we infer historical population movements from principal component analysis of genetic data? A 30-year old argument rages on
24 November	Elad Hazan, IBM Research
	Decision-making under uncertainty for structured problems
15 December	Nayantara Bhatnagar, Hebrew University
	Rapid and Slow Convergence of Simulated Tempering and Swapping
22 December	Yuval Nardi, Technion
11:00 am	Maxima of asymptotically Gaussian random fields *(note special time)*
29 December	Alan Izenman, Temple University
	Regularization, Sparsity, and Rank Restrictions in High-Dimensional Regression
5 January	Gal Elidan, Hebrew University
	The "Ideal Parent" Algorithm
19 January	Malka Gorfine, Technion
	Statistical Methods for Genetic Risk Estimation of Rare Complex Genetic Diseases

Seminars are held on Tuesdays, 10.30 am, Schreiber Building, 309 (see the TAU map ). The seminar organizer is Saharon Rosset.

To join the seminar mailing list or any other inquiries - please call (03)-6408820 or email 12345saharon@post.tau.ac.il54321 (remove numbers unless you are a spammer…)

Seminars from previous years

2008-2009
2007-2008 (organized by Daniel Yekutieli)
2006-2007 (organized by Daniel Yekutieli)
2005-2006 (organized by Daniel Yekutieli)
2004-2005 (organized by Daniel Yekutieli)

ABSTRACTS

Ronny Luss, Tel Aviv University

Predicting Abnormal Returns From News Using Text Classification

Abstract: We show how text from news articles can be used to predict intraday price movements of financial assets using support vector machines. Multiple kernel learning is used to combine equity returns with text as predictive features to increase classification performance and we develop an analytic center cutting plane method to solve the kernel learning problem efficiently. We observe that while the direction of returns is not predictable using either text or returns, their size is, with text features producing significantly better performance than historical returns alone.

Saharon Rosset, Tel Aviv University

Can we infer historical population movements from principal component analysis of genetic data? A 30-year old argument rages on

Abstract: The seminal Science paper by Menozzi, Piazza and Cavalli-Sforza in 1978, and the book by the same authors in 1994, established the use of principal component analysis of genetic data for making inferences about human history and migration. Specifically, the 1978 paper concluded that the Neolithic expansion (circa 6000 BC) had a major effect on the European genetic landscape. In 2008, a Nature Genetics paper by Novembre and Stephens claimed that the apparent patterns in these original works "resemble mathematical artifacts" which are expected if the genetic data were generated by local gene exchange only (i.e., no long-range migration). Their arguments are based on the properties of Toeplitz matrices and their eigen-decompositions. We re-examine the properties of the original data and the relevant mathematical results, and demonstrate that the arguments of Novembre and Stephens do not apply in this case. We also perform a critical re-analysis of the original data with "modern" tools and conclude that the original conclusions are statistically valid, though their historical interpretation is difficult to verify.

Elad Hazan, IBM Research

Decision-making under uncertainty for structured problems

Abstract: Decision-making in the face of uncertainty over future outcomes is a fundamental problem of statistics and operations research with numerous applications. In this talk I'll describe recent algorithmic advances, both in terms of accuracy as well as computational efficiency.
We describe the first efficient algorithm for the problem of online linear optimization in the limited-feedback (bandit) setting which achieves the optimal regret bound. This resolves an open question since the work of Awerbuch and Kleinberg in 2004, and is made possible via a new technique for controlling the exploration-exploitation tradeoff, inspired by convex optimization. Next we describe new prediction algorithms which attain optimal regret bounds in both worst case and stochastic scenarios. Tight performance bounds for prediction which interpolate between the worst-case and stochastic approaches were considered a fundamental open question.
Based on work with Jacob Abernethy, Satyen Kale and Alexander Rakhlin.

Nayantara Bhatnagar, Hebrew University

Rapid and Slow Convergence of Simulated Tempering and Swapping

Abstract: Markov Chain Monte Carlo samplers are ubiquitous in statistical mechanics and Bayesian statistics and have been analyzed extensively in theoretical computer science. When the distribution being sampled from is multimodal, these samplers often require a long running time to converge close to the desired distribution. Multimodal posterior distributions arise very commonly in model selection, mixture models and in statistical mechanical models. Simulated tempering and swapping are two methods designed to sample more effectively from multimodal distributions. In this work we show that even these algorithms can fail to converge quickly and propose modifications that can speed up the convergence.

Yuval Nardi, Technion

Maxima of asymptotically Gaussian random fields

Abstract: The distribution of maxima of asymptotically (in a sense to be made precise in the talk) Gaussian random fields over nice Euclidean sets is investigated. I will describe a novel approach that may be used to yield asymptotic expansions for such extremal probabilities. The approach builds up on a measure transformation argument followed by some local approximation arguments. A specific application from the realm of signal detection will accompany the derivation. If time permits, I will show how to utilize the approach for constructing simultaneous confidence bands for an unknown (multivariate) density function.

Alan J. Izenman, Temple University

Regularization, Sparsity, and Rank Restrictions in High-Dimensional Regression

Abstract: As enormous data sets become the norm rather than the exception, statistics as a scientific discipline is changing to keep up with this development. Of particular interest are regression problems in which attention to high dimensionality has become an important part in determining how to proceed. In multiple regression, regularization and sparsity considerations have led to new methodologies for dealing with the high-dimensionality, low sample-size situation. In multivariate regression, rank restrictions have led to a reduced-rank regression model that incorporates many of the classical dimensionality-reduction methodologies, such as principal component analysis and canonical variate analysis, as special cases. In this talk, we discuss problems of working with regression data when there are a large number of variables and a relatively small number of observations, and we explore some new graphical ideas for determining the effective dimensionality of multivariate regression data.

Gal Elidan, Hebrew University

The "Ideal Parent" Algorithm

Abstract: Bayesian networks are a formalism for encoding high-dimensional structured joint distributions. The appeal of Bayesian networks is that an intuitive graphical representation combined with a principled probabilistic foundation lead to a compact representation of the distribution in a decomposable form. This compact representation also facilitates efficient methods for performing probabilistic computations, and automatic methods for parameter estimation. Indeed, the past two decades have seen an exponential growth in research related to these models. Despite many innovative advances, model selection, or searching for a beneficial structure of a Bayesian network, remains a formidable computational task, which limits most applications to parameter estimation. This problem is even more acute when learning networks in the presence of missing values or hidden variables --- a scenario that is part of many real-life problems.

In this work we present a general method for dramatically speeding model selection for continuous variable Bayesian networks with common parametric distributions. In short, we efficiently evaluate the approximate merit of candidate structure modifications and apply time consuming (exact) computations only to the most promising ones, thereby achieving significant improvement in the running time of the search algorithm, without compromising the quality of the solution. Our method also naturally and efficiently facilitates the addition of useful new hidden variables into the network structure --- an automatic factor analysis like task that is typically considered both conceptually difficult and computationally prohibitive. We demonstrate our method on synthetic and real-life datasets, both for learning structure on fully and partially observable data, and for introducing new hidden variables during structure search.

Malka Gorfine, Technion

Statistical Methods for Genetic Risk Estimation of Rare Complex Genetic Diseases

Abstract: With the advances in the genetic dissection of complex diseases, the public has been increasingly interested in an individual's genetic risk for developing these diseases. Generally, there are two aspects to the estimation of genetic risk: estimation of mutation carriership probability for a disease gene and prediction of disease probability given the mutation status of the disease gene. Residual risk heterogeneity widely exists even after adjusting for the disease gene and thus it is important for obtaining accurate risk estimation. However, residual risk heterogeneity is being ignored in all the current available estimation procedures. We propose to account for the residual risk heterogeneity through the use of frailty models and data from case-control family study. Another common complication in complex diseases is that a disease gene can affect multiple diseases. Thus, a subject censored due to another cause that is related to the same gene is no longer independent of the age at onset of the primary disease under study. We tackle this problem in the competing risks framework. All the new estimation procedures developed in this work are investigated extensively by simulations, and their asymptotic properties are provided. The methods are illustrated with real data sets.

Yaacov Ritov, Hebrew University

Lost with a MAP in a foreign neighborhood

Abstract: We consider the maximal a-posteriori path (MAP) estimator of an HMM process. We show that this estimator may be unreasonable when the state space is non-finite, or the process is in continuous time. We argue that this sheds a doubt on the usefulness of the concept in the standard finite state space in discrete time HMM model. We then discuss a similar phenomena in the completely different model of sparse regression.

Ruth Heller, Technion

Sensitivity analysis for the cross-match test, with applications in genomics

Abstract: The cross-match test is an exact, distribution free test of no treatment effect on a high dimensional outcome in a randomized experiment. The test uses optimal nonbipartite matching to pair 2I subjects into I pairs based on similar outcomes, and the cross-match statistic A is the number of times a treated subject was paired with a control, rejecting for small values of A. If the test is applied in an observational study in which treatments are not randomly assigned, it may be comparing treated and control subjects who are not comparable, and may therefore falsely reject a true null hypothesis of no treatment effect. We develop a sensitivity analysis for the cross-match test, and apply it in an observational study of the effects of smoking on gene expression levels. In addition, we develop a sensitivity analysis for a standard multiple testing procedure using the cross-match test and apply it to 1762 molecular function categories in Gene Ontology.
Based on work with Shane Jensen, Paul Rosenbaum, and Dylan Small.

Ishay Weissman, Technion

Dependence Measures for Multivariate Extreme Value Distributions

Abstract: The dependence structure of multivariate extremes will be discussed first. Then, two dependence measures will be presented. These measures are suitable for any number of dimensions and are invariant under increasing transformations of the components. They possess an additional desired property, lacked by their competitors, which makes them natural dependence measures for multivariate extremes. A surprising connection to the largest spacing among iid uniform random variables will be discussed. This connection is useful as a diagnostic tool for the quality of random number generators.

Philip Stark, University of California, Berkeley

Justice and Inequalities

Abstract: I will discuss some problems in election auditing and litigation that can be solved using probability inequalities. The lead example, illustrated with case studies in auditing elections and estimating damages in civil litigation, is to construct nonparametric one-sided confidence bounds for the mean of a nonnegative population. If time permits, I will also discuss a contested election in which a simple probability inequality provided evidence the court found persuasive. This seminar is partly a plea for help from probabilists: I hope someone in the audience can point me to inequalities that are sharper than those I'm using.

Bradley Jones, JMP Software

Efficient Designs with Minimal Aliasing

Abstract: For some experimenters, a disadvantage of the standard optimal design approach is that it does not consider explicitly the aliasing of specified model terms with terms that are potentially important but are not included in the model. For example, when constructing an optimal design for a first-order model, aliasing of main effects and interactions is not considered. This can lead to designs that are optimal for estimation of the primary effects of interest, yet have undesirable aliasing structures. In this talk, I explain how to construct exact designs that minimize expected squared bias subject to constraints on design efficiency. I will demonstrate use of the method using several examples that allow for comparison with standard textbook approaches.

Yoram Gal-Ezer

Where was Bonferroni logically wrong and does the FDR correct his mistake?

Abstract: In cases of a first positive result after many tries, with large enough sample sizes to detect a considerable effect, the need to correct P value is quite intuitive. But this need is really justified not for the increased chance to encounter a false positive result as usually thought, whereas the chance to encounter a real positive is also increased. The real reason is to make up for a low prior expectation for a real effect in the specific positive result, because the absence of significance in the other tries probably came from a low prior probability.

This pitfall apparently mislead Bonferroni and made him offer a wrong formula for the adjustment of P value. It was wrong because it did not address the mentioned effect of prior expectation for a real effect. The only approach completely capable of handling prior expectation is Bayes's. The only role of a 'family' of multiple comparisons is to serve as a data base for the assessment of prior expectation. For this task the appropriate comparisons to be included are as many as available comparisons of assumed equal prior probability for a real effect.

The FDR approach will be shown to be actually located half the way to a completely prior expectation oriented approach. But this is not really enough for assessing the credibilty of specific finding. The thought it can be used for this purpose will be presented as an "optical illusion".

In any case that either such "data base" for the assessment of the prior expectation, or general guidelines for the prior are available, it is not just a matter of choice to take it into account. Any other option will bring to much less realistic results, unless a uniform distribution of the prior can be considered no less realistic than other distributions.

Boaz Nadler, Weizmann Institute

Principal Component Analysis in Noisy High Dimensional Settings

Abstract: Principal Component Analysis (PCA) is perhaps the most widely used method in multivariate analysis.
In this talk I'll first review some recent results regarding the behavior of the first few largest eigenvalues and eigenvectors of PCA when the observed high dimensional data is of low rank but corrupted by noise.
Second, I'll present some applications of these results, mainly to the problem of non-parametric detection of signals embedded in noise.

Jason Fine, University of North Carolina, Chapel Hill

Sensitivity testing for nonidentifiable models, with application to longitudinal data with noninformative dropout

Abstract: I consider the problem of evaluating a statistical hypothesis when some model characteristics are non-identifiable from observed data. Such scenario is common in meta-analysis for assessing publication bias and in longitudinal studies for evaluating a covariate effect when dropouts are likely to be informative. One possible approach to this problem is to fix a minimal set of sensitivity parameters conditional upon which hypothesized parameters are identifiable. I discuss existing approaches to inference derived by assessing the sensitivity of parameter estimates to the sensitivity parameter.
I propose to formally evaluate the hypothesis of interest using an infimum statistic over the whole support of the sensitivity paramete, along with the associated inferential challenges. I characterize the limiting distribution of the statistic as a process in the sensitivity parameter, which involves a careful theoretical analysis of its behavior under model misspecification. In practice, I suggest a nonparametric bootstrap procedure to implement this infimum test as well as to construct confidence bands for simultaneous pointwise tests across all values of the sensitivity parameter, adjusting for multiple testing. The methodology's practical utility is illustrated in an analysis of a longitudinal psychiatric study.

Yair Goldberg, University of North Carolina

Censored quantile regression using inverse probability of censoring weighted average

Abstract: Quantile regression has recently attracted attention as an alternative to the Cox proportional hazard model for analysis of censored survival data. We propose a novel approach for linear censored quantile regression based on inverse probability of censoring weighted average. The only assumptions required to ensure validity of the proposed method are linearity at the quantile level of interest, and independence of the survival time and the censoring, conditional on the covariates. The regression estimator is found by minimizing a convex objective function. This minimization can be performed using linear programming. We prove consistency and asymptotic normality of the proposed estimator. The simplicity of the proposed approach, its efficient computation, and the relatively weak assumptions under which this approach is valid make it a valuable alternative to existing approaches for quantile regression.
Joint work with Prof. M. R. Kosorok

Yoav Benjamini

Some thoughts on replicability

Abstract: The problems of replicability in scientific investigations that are based on statistical analyses will be reviewed, with examples from behavioral genetics, clinical trials functional magnetic resonance imaging, and microarray analysis. Selective inference, mixed models analysis, and partial conjunction analysis will be presented as important tools in the efforts to assure replicability.