Department of Statistics & Operations Research

Statistics Seminars

2010/2011

To subscribe to the list, please follow this link or send email to 12345saharon@post.tau.ac.il54321 (remove numbers unless you are a spammer…)

 

Second Semester

1 March

Alfred Inselberg, Tel Aviv Univ., Univ. of Singapore and San Diego SuperComputing Center

 

Visualization & Data Mining for High Dimensional Data

22 March

Shulamith Gross, Baruch College

 

Evaluating Probability Forecasts

29 March

Dror Kenett, Tel Aviv University
Hidden features of financial markets

 

12 April

Gadi Fibich, Tel Aviv University
Aggregate Diffusion Dynamics in Agent-Based Models with a Spatial Structure

 

26 April

Matan Gavish, Stanford University

 

Harmonic Analysis of Transposable Arrays

3 May

Marina Bogomolov, Tel Aviv University

 

Hierarchical Testing of Subsets of Hypotheses

17 May

Theofanis Sapatinas, University of Cyprus

 

Functional Time-Series Prediction: Application to the Prediction of the Daily Power Demand Shapes for the Electricity Authority of Cyprus

24 May

Ehud Aharoni, IBM Research

 

Generalized Alpha Investing: Definitions, Optimality Results, and Application to Public Databases

31 May

Ofer Harel, University of Connecticut

 

Strategies for Data Analysis with Two Types of Missing Values

28 June

Adi Wyner, The Wharton School, University of Pennsylvania

** special summer seminar**

Uncertainty in 1000 Year Old Reconstructions of the Earth Temperatures: Criticisms and Rejoinder

 

 

First Semester

12 October

Yulia Gavrilov

 

Peak Detection as Multiple Testing

19 October

Danny Feldman, California Institute of Technology

 

Core-sets and their applications

26 October

Itai Dattner, Haifa University

 

On deconvolution of distribution functions

2 November

Ronny Luss, Tel Aviv University

 

Isotonic Recursive Partitioning

16 November

Felix Abramovich, Tel Aviv University

 

Model Selection in Regression: some new (?) thoughts on the old (?) problem

30 November

Yair Goldberg, University of North Carolina

 

Q-Learning with Censored Data

14 December

Gideon Dror, Academic College of Tel-Aviv-Yaffo

 

Analysis of Swimming Sessions Using Semi-Markov Models

21 December

Nir Atias, Tel Aviv University

 

An Algorithmic Framework for Predicting Side-Effects of Drugs

28 December

Ori Rosen, University of Texas at El Paso

 

Adaptive Spectral Estimation for Nonstationary Time Series

4 January

Noam Shental, Open University

 

Identification of rare alleles and their carriers using compressed se(que)nsing

11 January

Galit Shmueli, University of Maryland

 

COM-Poisson Regression: A Flexible Model for Count Data

18 January

Youngjo Lee, Seoul University

 

Likelihood approach to large-scale multiple testing


 

Seminars are held on Tuesdays, 10.30 am, Schreiber Building, 309 (see the TAU map ).  The seminar organizer is Saharon Rosset.

To join the seminar mailing list or any other inquiries - please call (03)-6408820 or email 12345saharon@post.tau.ac.il54321 (remove numbers unless you are a spammer…)

 


Seminars from previous years


ABSTRACTS

 

  • Yulia Gavrilov,

Peak Detection as Multiple Testing

Abstract: This talk considers the problem of detecting equal-shaped non-overlapping unimodal peaks in the presence of Gaussian ergodic stationary noise, where the number, location and heights of the peaks are unknown. A multiple testing approach is proposed in which, after kernel smoothing, the presence of a peak is tested at each observed local maximum. The procedure provides strong control of the family wise error rate and the false discovery rate asymptotically as both the signal-to-noise ratio (SNR) and the search space get large, where the search space may grow exponentially as a function of SNR. Simulations assuming a Gaussian peak shape and a Gaussian autocorrelation function show that desired error levels are achieved for relatively low SNR and are robust to partial peak overlap. Simulations also show that detection power is maximized when the smoothing bandwidth is close to the bandwidth of the signal peaks, akin to the well-known matched filter theorem in signal processing. The procedure is illustrated in an analysis of electrical recordings of neuronal cell activity.

 

Core-sets and their applications

Abstract: A coreset (or, core-set) for a given problem is a "compressed" representation of its input, in the sense that a solution for the problem with the (small) coreset as input would yield an approximate solution to the problem with the original (large) input.
This relatively new paradigm implies algorithms that are both provable and practical for problems in fields such as: Streaming, Compressed Sensing, Machine learning, Time Series Analysis, Distributed Computing, GPU based algorithms, and Privacy preserving algorithms.
In this talk, I will present coresets that provide efficient and provable algorithms for problems such as k-mean/median clustering, and L_p matrix approximation (known as PCA/SVD/Linear Regression for p=2) in high-dimensional spaces.
If time permits, we will also construct privacy-preserving coresets which allow us to answer unbounded number of queries on the dataset without disclosing information about specific individuals.
This talk is based on papers co-authored with: Amos Fiat, Haim Kaplan, Michael Langberg, Morteza Monemizadeh, Kobbi Nissim, Micha Sharir, Christian Sohler, and David Woodruff

 

  • Itai Dattner, Haifa University

On deconvolution of distribution functions

Abstract: It is well known that rates of convergence of estimators in deconvolution problems are affected by the smoothness of the error density and the density to be estimated. However, the problem of distribution deconvolution is more delicate than what was considered so far. We derive different rates of convergence with respect to the tail behavior of the error characteristic function. We present optimal in order deconvolution estimators, both for known and unknown error distribution. An adaptive estimator which achieves the optimal rates within a logarithmic factor is developed. Simulation studies comparing the adaptive estimator to other methods are presented and support the superiority of our method. An example with real data is also discussed.
Based on joint works with Alexander Goldenshluger and Benjamin Reiser.

 

Isotonic Recursive Partitioning

Abstract: Isotonic regression is a well-known nonparametric tool for fitting monotonic data and has been studied from both theoretical and practical aspects.  We present a new algorithm for isotonic regression based on recursively partitioning the predictor space through solution of progressively smaller ``best cut'' subproblems. This creates a regularized sequence of isotonic models of increasing model complexity that converges to the global isotonic regression solution. The models along the sequence are often more accurate than the unregularized isotonic regression model because of the complexity control they offer. We offer quantification of this complexity control through estimation of degrees of freedom along the path. We also develop efficient methods for generating the global solution through this sequence of structured subproblems, as each subproblem is equivalent to a network flow problem for which efficient algorithms exist. The success of the regularized models in prediction and our algorithm's favorable computational properties are demonstrated through a series of simulated and real data experiments.
This is joint work with Saharon Rosset and Moni Shahar.

 

  • Felix Abramovich, Tel Aviv University

Model Selection in Regression: some new (?) thoughts on the old (?) problem

Abstract: We discuss the model selection problem in linear regression, where the number of potential predictors might be even much larger than the number of observations.  The proposed Bayesian approach implies the penalized least squares estimation with a complexity penalty associated with a prior on the model size. We present optimality properties of the resulting Bayesian model selector and compare it with several well-known existing counterparts. Computational problems and possible solutions will be also discussed.
This is joint work with Vadim Grinshtein from the Open University.

 

  • Yair Goldberg, University of North Carolina

Q-Learning with Censored Data

Abstract: We develop methodology for a multistage-decision problem with flexible number of stages in which the rewards are survival times that are subject to censoring. We present a novel Q-learning algorithm that is adjusted for censored data and allows a flexible number of stages. We provide finite sample bounds on the generalization error of the policy learned by the algorithm, and show that when the optimal Q-functions belong to the approximation spaces, the expected survival time for policies obtained by the algorithm converges to that of the optimal policy. We demonstrate the algorithm's performance using a simulation study.

 

  • Gideon Dror, Academic College of Tel-Aviv-Yaffo

Analysis of Swimming Sessions Using Semi-Markov Models

Abstract: Detailed monitoring of training sessions of elite athletes is an important component of their training. In this talk I will describe an application that performs a precise segmentation and labeling of swimming sessions. This allows a comprehensive break-down of the training session, including lap times, detailed statistics of strokes, and turns. To this end we use semi-Markov models (SMM), a formalism for labeling and segmenting sequential data. Using the trained model on test swimming sessions of different swimmers provides highly accurate segmentation as well as perfect labeling of individual segments.

 

  • Nir Atias, Tel Aviv University

An Algorithmic Framework for Predicting Side-Effects of Drugs

Abstract: One of the critical stages in drug development is the identification of potential side effects for promising drug leads. Large scale clinical experiments aimed at discovering such side effects are very costly and may miss subtle or rare side effects. Previous attempts to systematically predict side effects are sparse and consider each side effect independently. In this work we report on a novel approach to predict the side effects of a given drug, taking into consideration information on other drugs and their side effects. Starting from a query drug, a combination of canonical correlation analysis and network-based diffusion is applied to predict its side effects.
We evaluate our method by measuring its performance in a cross validation setting using a comprehensive data set of 692 drugs and their known side effects derived from package inserts. For 34% of the drugs the top scoring side effect matches a known side effect of the drug.  Remarkably, even on unseen data, our method is able to infer side effects that highly match existing knowledge. In addition, we show that our method outperforms a prediction scheme that considers each side effect separately.  Our method thus represents a promising step toward shortcutting the process and reducing the cost of side effect elucidation.

 

  • Ori Rosen, University of Texas at El Paso

Adaptive Spectral Estimation for Nonstationary Time Series

Abstract: We propose methodology for analyzing possibly nonstationary time series by adaptively dividing the time series into an unknown but finite number of segments and estimating the corresponding local spectra by smoothing splines.
The model is formulated in a Bayesian framework, and the estimation relies on reversible jump Markov chain Monte Carlo (RJMCMC) methods. For a given segmentation of the times series, the likelihood function is approximated via a product of local Whittle likelihoods. Thus, no parametric assumption is made about the process underlying the time series. The number and lengths of the segments are assumed unknown and may change from one MCMC iteration to another. The method is illustrated with simulated and real data.

 

  • Noam Shental, The Open University of Israel

Identification of rare alleles and their carriers using compressed se(que)nsing

Abstract: Identification of rare variants by resequencing is important both for detecting novel variations and for screening individuals for known disease alleles. New technologies enable low-cost resequencing of target regions, although it is still prohibitive to test more than a few individuals. We propose a novel pooling design that enables the recovery of novel or known rare alleles and their carriers in groups of individuals. The method is based on a Compressed Sensing (CS) approach, which is general, simple and efficient. CS allows the use of generic algorithmic tools for simultaneous identification of multiple variants and their carriers. We model the experimental procedure and show via computer simulations that it enables the recovery of rare alleles and their carriers in larger groups than were possible before. Our approach can also be combined with barcoding techniques to provide a feasible solution based on current resequencing costs. For example, when targeting a small enough genomic region (~100 bp) and using only ~10 sequencing lanes and ~10 distinct barcodes per lane, one recovers the identity of 4 rare allele carriers out of a population of over 4000 individuals. We demonstrate the performance of our approach over several publicly available experimental data sets, including the 1000 Genomes Pilot 3 study.
We believe our approach may significantly improve cost effectiveness in future Association Studies, and in screening large DNA cohorts for specific risk alleles.
Joint work with Amnon Amir from the Weizmann Institute of Science, and Or Zuk from the Broad Institute of MIT and Harvard

 

  • Galit Shmueli, University of Maryland

COM-Poisson Regression: A Flexible Model for Count Data

Abstract: Poisson regression is a popular tool for modeling count data and is applied in a vast array of applications from the social to the physical sciences and beyond. Real data, however, are often over- or under-dispersed and, thus, not conducive to Poisson regression. We propose a regression model based on the Conway–Maxwell-Poisson (COM-Poisson) distribution to address this problem. The COM-Poisson regression generalizes the well-known Poisson and logistic regression models, and is suitable for fitting count data with a wide range of dispersion levels. With a GLM approach that takes advantage of exponential family properties, we discuss model estimation, inference, diagnostics, and interpretation, and present a test for determining the need for a COM-Poisson regression over a standard Poisson regression. We compare the COM-Poisson to several alternatives and illustrate its advantages and usefulness on data sets with varying dispersion levels.
Joint with Kim Sellers, Georgetown University

 

  • Youngjo Lee, Seoul University

Likelihood approach to large-scale multiple testing

Abstract: To date, only frequentistic, Bayesian and empirical Bayes approaches have been studied for the large-scale inference problem of testing simultaneously hundreds or thousands of hypotheses. One consequence has been the need to consider an empirical null distribution in order to avoid the problem of rejection of all null hypotheses when sample sizes are large. We present the multiple testing problem as a multiple prediction problem of whether a null hypothesis is true or not. We introduce a hierarchical random-effect models and show how the extended likelihood is build. It is shown that the maximum likelihood prediction has a certain oracle property. The extended likelihood leads to new testing procedures, which are optimal for the usual loss function in hypothesis testing. The new tests control the local probability of false discovery for individual tests to maintain false discovery rate globally. One advantage with this likelihood approach is that there is no need to consider an empirical null distribution. Several examples are considered to show how to use the likelihood method in practice.

 

Visualization & Data Mining for High Dimensional Data

Abstract: A dataset with M items has 2M subsets anyone of which may be the one satisfying our objectives. With a good data display and interactivity our fantastic pattern-recognition can not only cut great swaths searching through this combinatorial explosion, but also extract insights from the visual patterns. These are the core reasons for data visualization. With parallel coordinates the search for relations in multivariate datasets is transformed into a 2-D pattern recognition problem. Guidelines and strategies for knowledge discovery are illustrated on several real datasets (financial, process control, credit-score, intrusion-detection etc) one with hundreds of variables. A geometric classification algorithm is presented and applied to complex datasets. It has low computational complexity providing the classification rule explicitly and visually. The minimal set of variables required to state the rule (features) is found and ordered by their predictive value. Multivariate relations can be modeled as hypersurfaces and used for decision support. A model of a (real) country’s economy reveals sensitivies, impact of constraints, trade-offs and economic sectors unknowingly competing for the same resources. An overview of the methodology provides foundational understanding; learning the patterns corresponding to various multivariate relations. These patterns are robust in the presence of errors and that is good news for the applications.
The parallel coordinates methodology has been applied to collision avoidance and conflict resolution algorithms for air traffic control (3 USA patents), computer vision (1 USA patent), data mining (1 USA patent), optimization, decision support and elsewhere.
Some related images.

 

Evaluating Probability Forecasts

Abstract: Probability forecasts of events are routinely used in climate predictions, in forecasting default probabilities on bank loans, or in estimating the probability of a patient's positive response to treatment. Scoring rules have long been used to assess the efficacy of the forecast probabilities after observing the occurrence, or non-occurrence, of the predicted events. We develop a statistical theory for scoring rules and propose an alternative approach to the evaluation of probability forecasts. This approach uses loss functions relating the predicted to the actual probabilities of the events, and applies martingale theory to exploit the temporal structure between the forecast and the subsequent occurrence or non-occurrence of the event.
In Epidemiology, a variety of indices, such as PPV (positive predictive value), NPV, predictiveness curve, the area under the ROC curve and R2 difference between models are routinely used, without adequate inferential tools. We provide such tools.

Hidden features of financial markets

Abstract: The ensuing economic crisis has re-emphasized the need to study and understand the dynamics of the stock market. Furthermore, the fact that the crisis itself, with its magnitude and severity, was poorly predicted by existing economic theories highlights the need to rethink certain economic key concepts. Such attempts are carried out by Econophysicists, who have been performing rigorous empirical and analytical research on financial markets for the past two decades. One of the focuses of these investigations has been the correlations between stock price movements, and the extraction of meaningful information out of these correlations regarding the behavior of the market. This talk will introduce new methods, based on the concept of stock partial correlations, to uncover important characteristics of financial markets. After an overview of the relevant concepts, two new methods will be presented - the stock dependency networks, and the Index Cohesive Force (ICF). These methods have been used to study the New-York stock market (NYSE) and the Tel-Aviv stock market (TASE). The result of this study was a successful uncovering of hidden features of these markets, features that were previously unobservable in the standard stock correlations. The main findings of this study will be presented, following by a discussion of the possible applications of these methodologies in economic practice.
Joint work with Eshel Ben-Jacob

 

Aggregate Diffusion Dynamics in Agent-Based Models with a Spatial Structure

Abstract: The diffusion or adoption of new products (such as fax machines, skype, facebook, Ipad, etc.) is one of the key problems in Marketing research. In recent years, this problem was often studied numerically, using agent-based models (ABMs). In this talk I will focus on analysis of the aggregate diffusion dynamics in ABMs with a spatial structure. In one-dimensional ABMs, the aggregate diffusion dynamics can be explicitly calculated, without using the mean-field approximation. In multidimensional ABMs, we introduce a clusters-dynamics approach, and use it to derive an analytic approximation of the aggregate diffusion dynamics. The clusters-dynamics approximation shows that the aggregate diffusion dynamics does not depend on the average distance between individuals, but rather on the expansion rate of clusters of adopters. Therefore, the grid dimension has a large effect on the aggregate adoption dynamics, but a small-world structure and heterogeneity among individuals have only a minor effect. Our results suggest that the one-dimensional model and the fully-connected Bass model provide a lower bound and an upper bound, respectively, for the aggregate diffusion dynamics in agent-based models with "any" spatial structure.
This is joint work with Ro'i Gibori and Eitan Muller

 

Harmonic Analysis of Transposable Arrays

Abstract: "Transposable Arrays" (TA) are data matrices in which both rows and columns should be treated as features and have nontrivial correlation structure. This data type is ubiquitous in modern applications (e.g. gene microarrays, recommender systems, text-document matrices), where analysis often aims at common tasks and encounters common difficulies. Tasks of interest include finding interpretable structure for the row set and the column set, missing value imputation, quantization, denoising and anomaly detection. Still, most known treatments of TA data are ad-hoc.
I will describe a nonparametric model allowing systematic treatment of TA's (and, more generally, higher rank data tensors) and algorithms for performing these tasks. Our treatment is motivated by a theorem of J.O.Stromberg, whereby the tensor product of Haar bases is extremely efficient in describing functions of bounded mixed derivatives in high dimensions. Fitting the model means simultaneously organizing the rows and columns of the TA to yield small "mixed variation". Once fitted, the above tasks are performed in the coefficient domain of the tensor product of Haar-like bases.
Any interpretable TA analysis produce complicated output, which is hard to visualize on paper. As a computational interlude, we'll play with an interactive user interface developed to explore the output of TA analysis algorithms.

Joint work with Ronald Coifman (Yale) and Boaz Nadler (Weizmann).
References: Haar-like bases (ICML 2010) ; Harmonic analysis of data bases (Book chapter)

 

  • Marina Bogomolov, Tel Aviv University

Hierarchical Testing of Subsets of Hypotheses

Abstract: As the size of large testing problems encountered in practice keeps increasing, more of these problems have further structure where the set of hypotheses can be partitioned into subsets of the hypotheses, and a discovery of some signal in a subset is of interest on top of the discovery of a signal in each of the many hypothesis on its own. Furthermore, the true state of the tested signals tends to be more similar within these subsets than across the subsets. Examples are regions in the brain in functional MRI research, sets of genes in genomic research, or geographical areas in disease outbreaks monitoring. The challenges in the analysis of such multiple testing problems will be discussed, and previous efforts to address them will be reviewed. We then present a few new methods to control various aspects of the False Discovery Rate, and discuss their benefits and limitations.
Joint work with Yoav Benjamini. This talk is part of PhD thesis defense.

 

  • Theofanis Sapatinas, University of Cyprus

Functional Time-Series Prediction: Application to the Prediction of the Daily Power Demand Shapes for the Electricity Authority of Cyprus

Abstract: We consider the prediction problem of a time series on a whole time interval in terms of its past.
The approach that we adopt is based on functional kernel non-parametric regression estimation techniques where observations are discrete recordings of segments of an underlying stochastic process considered as curves. We estimate conditional expectations by using appropriate wavelet decompositions of the segmented sample paths. We also propose a method to select the bandwidth for functional time series prediction. The idea underlying this method is to calculate the empirical risk of prediction using past segments of the observed series and to select as value of the bandwidth for prediction the bandwidth which minimizes this risk. We illustrate the usefulness of the proposed functional wavelet-kernel prediction methodology to the prediction of the daily power demand shapes for the Electricity Authority of Cyprus. Some recent advances will also be discussed.
(This is joint work with Anestis Antoniadis (Univesrite Joseph Fourier, France) and Efstathios Paparoditis (University of Cyprus, Cyprus).)

 

Generalized Alpha Investing: Definitions, Optimality Results, and Application to Public Databases

Abstract: The increasing prevalence and utility of large, public databases necessitates the development of appropriate methods for controlling false discovery. Motivated by this problem, we discuss the generic problem of testing a possibly infinite stream of null hypotheses. In this context, Foster and Stine (2007) proposed a false discovery measure they called mFDR, and an approach for controlling it named alpha investing. We generalize alpha investing and use our generalization to derive optimal allocation rules for the case of simple hypotheses. We demonstrate empirically that this approach is more powerful than alpha investing while controlling mFDR.
We then present the concept of quality preserving databases (QPD), originally introduced in Aharoni et al. (2010), which formalizes efficient public database management to simultaneously save costs and control false discovery. We show how one variant or generalized alpha investing can be used to control mFDR in a QPD and lead to significant reduction in costs compared to na¨ıve approaches for controlling the family-wise error rate implemented in Aharoni et al. (2010).
Joint work with Saharon Rosset of Tel Aviv University.

 

Strategies for Data Analysis with Two Types of Missing Values

Abstract: Analysis of incomplete data requires assumptions about the mechanism that divides the complete sample into its observed and missing parts. When two different types of missing values occur in the same data set, one should also consider the process that partitions the missing data into the two groups. In this presentation, I extend Rubin’s (1976) concept of missing at random to two sets of missing values, describing conditions under which the missing-data processes may be completely or partially ignored. Multiple imputation (Rubin, 1987) may be carried out in two stages, allowing us to measure rates of missing information due to each type of missing value. For illustration, I apply these concepts and techniques to several data examples. (This is joint work with Joe Schafer).

 

  • Adi Wyner, The Wharton School, University of Pennsylvania

Uncertainty in 1000 Year Old Reconstructions of the Earth Temperatures: Criticisms and Rejoinder

Abstract: In my last visit, I discussed a working paper on the accuracy of naturally occurring proxies of temperatures. I concluded  that a proper statistical analysis of the data reveals that the uncertainty in such reconstructions are too large to answer the "burning" question of our time: are contemporary temperatures greater than at any time in the last 1000 years.  My paper, co-authored with my former Student Blake McShane has since been published in the Annals of Applied Statistics setting records for  the most downloaded paper in the journal's (short) history. In my talk today, I review the finding of our paper and then I will discuss the 13 discussion papers submitted in response and our rejoinder. The discussion will focus on the difficulties of time series regression, PC regression, cross-validation and the Lasso.