Department of Statistics & Operations Research

Statistics Seminars

2005/2006

Note: the program is not final and is subject to possible changes

Second Term

7, March	Hovav Dror, Tel Aviv University
	Robust Experimental Design for Multivariate Generalized Linear Models
21, March	Isaac Meilijson, Tel Aviv University
	Pricing financial instruments with stochastic volatility - a Bayesian approach.
4, April	Robert J. Adler, Technion
	Rice and Geometry (via the Brain)
25, April	Ruth Heller, Tel Aviv University
	False Discovery Rates for Spatial Signals
16, May	Vered Madar, Tel Aviv University - in defense of her PhD thesis.
	Simultaneous Confidence Intervals for Multiple Parameters with More Power to Determine the Sign
23, May	Itzhak Gilboa, Tel Aviv University
	Empirical Similarity and Objective Probabilities
30, May	Daniel Yekutieli, Tel Aviv University
	Hierarchical FDR controlling procedure
6, June	Eytan Domany, Weizmann Institute of Science
	Predicting outcome in breast cancer: the search for a robust list of predictive genes.
13, June	Michael Elad, Technion
	Sparse and Redundant Signal Representation, and its Role in Image Processing
20, June	Hovav Dror, Tel Aviv University
	Sequential experimental design for multivariate GLM
27, June	Assaf Oron, University of Washington
	A small correction to Isotonic Regression

First Term

27, September	Anat Reiner, Tel Aviv University
	Complexity of Data and Analysis Related to FDR Control in Microarray Experiments
29, November	Dorron Levy, Comverse
	Software failure early alert
11, December*	Dror Berel, Tel Aviv University
	USING CART FOR DESCRIBING THE tRNA OPERATIVE CODE OF …
13, December*	Albert Vexler, National Institute of Child Health and Human Development, NIH
	Nonparametric deconvolution applied to a tradition/nontraditional pooling design
3, January*	David Steinebrg
	Sample Size For Positive and Negative Predictive Value in Diagnostic Research
10, January	Saharon Rosset
	The Genographic Project: Background and Some Statistical Challenges
16, January*	Galit Shmueli, University of Maryland
	A Functional Data Analytic Approach To Empirical eCommerce Research

Seminars are held on Tuesdays, 10.30 am, Schreiber Building, 309 (see the TAU map ). is served before.

The seminar organizer is Daniel Yekutieli.

To join the seminar mailing list or any other inquiries - please call (03)-6409612 or email yekutiel@post.tau.ac.il

Details of previous seminars:

ABSTRACTS

Anat Reiner, Tel Aviv University

Complexity of Data and Analysis Related to FDR Control in Microarray Experiments

Statistical issues involved in the analysis of gene expression data are encountered in many other types of data, and thus their exploration may offer insights for a wide range of applications. One of the major concerns is multiple testing, since the inference of interest is for each gene separately, given a set of thousands of genes examined in one experiment. The false discovery rate (FDR) is proposed to control the type I error, but its implementation is challenged by several statistical aspects related both to the data itself and the analytical process used.

One aspect arises since microarray data is typically subjected to technological and biological factors that are potential causes of dependencies between the multiple test statistics. I will discuss the effect of dependence on FDR behavior and examine more closely the frequently encountered case of two-sided tests. A few scenarios of dependency structure will be represented along with the respective least favorable cases.

In addition, the control of the FDR is complicated when conducting an analysis containing several research questions. Such analysis may include pairwise comparisons and interaction contrasts on a gene level. Furthermore, correlation analysis may be used when incorporating phenotypic measures with the genetic expression data, for the purpose of exploring the relations between them. I will discuss a strategy for FDR control in such cases, based on arranging the hypotheses in hierarchical manner, and propose a solution for additional statistical problems that may arise. A functional genomic research problem will be presented.

Dorron Levy

Software failure early alert

Software failures pose significant operational and financial burden on Telecom companies. Expenses on support tend to be high, with companies having hundreds of support personnel on board, tending urgent system problems. Maintenance is complex following large system variability, complexity, continuous changes and extremely low tolerance for failure.

An effort to achieve some early alert for failure is described. Data from field-installed systems is analyzed to find some normative behavior metrics and identify early shifts from normal behavior. Some initial results are presented, with emphasis on simplicity (for field worthiness) and early detection (for financial effectiveness). The work is almost purely empirical, with initial field implementation shows promise.

Dror Berel, Tel Aviv University

USING A CLSSIFICATION TREE MODEL FOR DESCRIBING THE tRNA OPERATIVE CODE OF AMINO ACIDS AMONG THE BIOLOGY DOMAINS ARCHAEL AND EUBACTERIAL

The aim of this research is to describe and analyze the sequence data of the upper section of the tRNA molecule, known as the Acceptor Stem. We used data from creatures that belong to the Archaea and Eubacteria biological domains. We analyzed the relation between these sequences and the Amino Acids determined by the tRNA molecule by using the Classification Tree model of CART.

The model performed well, explaining 96% and 92% of the observations which belong to the Eubacteria, and Archaea respectively. The performance of the model was tested by using both a Cross Validation technique, and simulation. It was found that the results of the full model are the best possible for
future data. This implies that there is no over fitting of the model to the data. Also tested the connection to the Anti Codon mechanism, which is related to the Amino Acid determination. In addition, we also developed a novel graphical presentation of the Acceptor Stem sequences, which includes a measure of the data heterogeneity.

Albert Vexler, Biometry and Mathematical Statistics Branch, National Institute of Child Health and Human Development, NIH

Nonparametric deconvolution applied to a tradition/nontraditional pooling design

We present developed methodologies for distribution-free estimation of a density function based upon observed sums or pooled data. The proposed
methods employ a fourier approach to nonparametric deconvolution of a density estimate. It is shown that reconstruction of the density function of a random variable by the density of its observed sums requires strong conditions on distribution functions. To relax these assumptions, a nontraditional pooling design to generate partial sums is proposed. The methods are exemplified using data from a study of biomarkers associated with coronary heart disease.

David Steinebrg, Tel Aviv University

Sample Size For Positive and Negative Predictive Value in Diagnostic Research

Important properties of diagnostic methods are their sensitivity, specificity, and positive and negative predictive values (PPV and NPV). These methods are typically assessed via case-control samples, which include one cohort of cases, known to have the disease, and a second control cohort of disease-free subjects. Such studies give direct estimates of sensitivity and specificity, but only indirect estimates of PPV and NPV, which also depend on the disease prevalence in the tested population. We develop formulas for optimal allocation of the sample between the case and control cohorts and for computing sample size when the goal of the study is to prove that the test procedure exceeds pre-stated bounds for PPV and/or NPV.

This is joint work with Rick Chappell and Jason Fine Department of Biostatistics University of Wisconsin

Saharon Rosset, Watson, IBM

The Genographic Project: Background and Some Statistical Challenges

The Genographic project is a research partnership of National Geographic and IBM, intended to investigate the migration history of humans across the globe through the "history book" hidden in the DNA of each one of us. This project is unprecedented in scope: it targets genetic testing and analysis of 100,000 members of indigenous populations around the world and at least as many members of the general public who choose to purchase participation kits. For more details please go to: www.nationalgeographic.com/genographic

In this talk, I will first give a brief overview of this project and its scope. I will then survey some statistical challenges that arise from it, and concentrate on one or both of the following topics, as time permits: 1. Maximum likelihood estimation of mutation probabilities and coalescent tree sizes in mtDNA.
The control region of the mitochondrial DNA mutates much more quickly than most DNA in our body. Consequently it contains useful information for phylogenetic analysis. However the mutation rate across this region is known to vary greatly. I will describe some data from the Genographic project, including observed mtDNA control region mutations and haplogroup classification of several thousand individuals. The poisson likelihood associated with the number of mutations at each locus in each haplogroup reduces to a binomial likelihood given the observed data. This maximum likelihood problem can be solved as a binomial GLM with a complementary log-log link function. The resulting estimates are useful for improving mtDNA classification models and phylogenetic analysis, and also give interesting insights into the population history of different haplogroups. 2. Power analysis of tests for discovery of admixture between modern humans and Neanderthals. I will discuss the feasibility of discovering inter-breeding between modern humans and Neanderthals in Europe through sequencing of nuclear DNA of modern Europeans. I will show how the length of regions sequenced and number of individuals sampled affect the probability of rejecting the ``null'' of no inter-breeding under various alternatives.

Galit Shmueli, Department of Decision & Information Technologies, Robert H Smith School of Business, University of Maryland

A Functional Data Analytic Approach To Empirical eCommerce Research

Electronic commerce has received an extreme surge of popularity in recent years. While the field of economics has created many theories for understanding economic behavior at the individual and market level, many of these theories were developed before the emergence of the world wide web, and do not carry over to the new online environment. Consider online auctions: While auction theory has been studied for a long time from a game-theory perspective, the electronic implementation of the auction mechanism poses new and challenging research questions. Luckily, empirical research of eCommerce is blessed by an ever increasing amount of readily-available, high-quality data, and is therefore thriving. However, analysis methods used in this research community for extracting information from data have not kept up with the vast amount and the complex structure of eCommerce data. The currently used statistical methods have typically been limited to “off the shelf" methods such as regression-type modeling.

In this talk, we present a novel statistical approach and set of tools called Functional Data Analysis (FDA) and discuss its usefulness for empirical research of eCommerce. We show how this approach allows the researcher to study, for the first time, dynamic concepts such as process-evolution and, associated with that, process-dynamics in the eCommerce context. We illustrate these ideas by focusing on online auctions, showing that understanding priceevolution

and its dynamics can be helpful in characterizing, differentiating, and even forecasting an auction.

There are multiple statistical challenges in applying FDA in the eCommerce context. Many of these arise from the non-standard data structure that arises in typical eCommerce applications. We describe such challenges and some solutions, but also unanswered questions. Finally, we present interesting results obtained from analyzing online auction data from eBay.com using a functional approach. We show how methods such as curve clustering and functional regression models shed new light on the dynamics that take place in online auctions.

Joint with Wolfgang Jank. A series of relevant papers is available at http://www.smith.umd.edu/ceme/statistics/papers.html

Hovav Dror, Tel Aviv University

Robust Experimental Design for Multivariate Generalized Linear Models

Optimal experimental designs for generalized linear models (GLM) depend on the unknown coefficients, and two experiments having the same model but different coefficient values will typically have different optimal designs. Therefore, unlike experimental design for linear models, the prior knowledge and estimates of the outcome of an experiment must be taken into account.

Prior work on local optimal experimental designs for GLM is mainly focused on a simple linear effect and one design variable. Generalizing these local results to take account of uncertainty is even more difficult. As a result, literature concerning multivariate robust designs for GLM is scarce.

We describe a fast and simple method for finding local D-optimal designs for high-order multivariate models. With this capability in hand we suggest a simple heuristic capable of finding designs that are robust to most parameters an experimenter might consider, including uncertainty in the coefficient values, in the linear predictor equation and in the link function. The procedure is based on K-means clustering of local optimal designs. Clustering, with its simplicity and minimal computation needs, is demonstrated to outperform more complex and sophisticated methods.

Isaac Meilijson, Tel Aviv University

Pricing financial instruments with stochastic volatility - a Bayesian approach.

This talk will review Generalized Inverse Gaussian and Hyperbolic distributions, presenting a family richer than Gamma of conjugate priors
for (one over the) variance of a normal variable. This family yields light-tail predictive distributions, unlike the t distribution induced by Gamma priors.

The variance will be permitted to change along time by the introduction of two conflicting experts, one Bayesian (unknown variance doesn't change) and one Empirical Bayesian (variance is sampled from the prior towards each observation). The logarithmic expert opinion pooling
of Hammond-Genest-Zidek will be applied.

Robert J. Adler, Technion - Israel Institute of Technology

Rice and Geometry (via the Brain)

The classic Rice formula for the expected number of upcrossings of a smooth stationary Gaussian process on the real line is one of the oldest and most important results in the theory of smooth stochastic processes. It has been generalised over the years to non-stationary and non-Gaussian processes, both over the reals and over more complex parameter spaces, and to vector valued rather than real valued processes.

Over the last few years, jointly with Jonathan Taylor, we have discovered a kind of Rice "super formula" which incorporates effectively all the (constant variance) special cases known until now. More interesting, however, is that it shows that these formulae all have a deep geometric interpretation, giving a version of the Kinematic Fundamental Formula of Integral and Differential Geometry for Gauss space.

To keep things reasonably concrete, and to put them into a statistical framework, I shall take as a motivating application of these results some hypothesis testing issues in brain imaging, and relate much of what I have to say to them.

· Ruth Heller, Tel-Aviv University

False Discovery Rates for Spatial Signals

We suggest a new approach to multiple testing for signal presence in spatial data that tests cluster units rather than individual locations. This approach leads to increased signal to noise ratio within the unit tested as well to a reduced number of hypotheses tests compared. We introduce a powerful adaptive procedure to control the size weighted FDR on clusters, i.e. the size of erroneously rejected clusters out of the total size of clusters rejected. Once the cluster discoveries have been made, we suggest ’cleaning’ loca tions in which the signal is absent by a hierarchical testing procedure that controls the expected proportion of locations in which false rejections occur. We discuss an application to functional MRI which motivated this research and demonstrate the advantages of the proposed methodology on an example.

· Vered Madar, Tel-Aviv University

Simultaneous Confidence Intervals for Multiple Parameters with More Power to Determine the Sign

We offer new simultaneous two-sided confidence intervals for the estimation of the expectations of k normal random variables. We invert a family of rectangular acceptance regions that have minimal incursion to the other quadrants into a confidence set, and project its convex-hull onto the axes to get simultaneous confidence intervals. Besides offering simultaneous coverage, the new intervals also provide slightly stronger sign classification than offered by the simultaneous conventional two-sided intervals.

Finally, we shall illustrate the new intervals on a clinical trial that has several primary endpoints.

· Itzhak Gilboa, Tel Aviv University

Empirical Similarity and Objective Probabilities

We suggest to extend the definition of probability as empirical frequency by the introduction of similarity-weighted frequencies. This formula is axiomatized as follows. A decision maker is asked to express her beliefs by assigning probabilities to certain possible states. We focus on the relationship between her database and her beliefs. We show that, if beliefs given a union of two databases are a convex combination of beliefs given each of the databases, beliefs are a similarity-weighted average of the beliefs induced by each past case. In an attempt to retain objectivity, we propose that the similarity function be estimated from the data.

Joint work with A. Billot, G. Gayer, O. Lieberman, A. Postlewaite, D. Samet, D. Schmeidler

Daniel Yekutieli, Tel Aviv University

Hierarchical FDR controlling procedures

I will introduce FDR trees - a new class of hierarchical FDR controlling procedures. In this new testing approach, rather than test all the hypotheses simultaneously, the tested hypotheses are arranged in a tree of disjoint subfamilies and the tree of subfamilies is tested hierarchically. This is a very flexible and powerful testing framework suited to perform complex statistical analysis. I will present the theoretical properties of FDR trees and demonstrate the use of FDR trees with several examples.

· Eytan Domany, Department of Physics of Complex Systems, Weizmann Institute of Science

Predicting outcome in breast cancer: the search for a robust list of predictive genes.

Predicting at the time of discovery the prognosis and metastatic potential of breast cancer is a major challenge in current clinical research. Numerous recent studies searched for gene expression signatures that outperform traditionally used clinical parameters in outcome prediction. Finding such a signature will free many patients of the suffering and toxicity associated with adjuvant chemotherapy given to them under current protocols, even though they do not need such treatment. A reliable set of predictive genes will contribute also to a better understanding of the biological mechanism of metastasis. Several groups have published ranked lists of predictive genes and reported good predictive performance based on them. However, the gene lists obtained for the same clinical types of patients by different groups differed widely and had only very few genes in common, raising doubts about the reliability and robustness of the reported predictive gene lists. The main source of the problem was shown to be [1] the highly fluctuative nature of the correlation of single genes' expression with outcome, on which the ranking was based. The underlying biological reason is the heterogeneity of the disease; to stabilize the genes' ranked list a much larger number of samples (patients) are needed than what has been used so far. We introduced [2] a novel mathematical method, PAC ranking, for evaluating the robustness of such rank-based lists. We calculated for several published datasets the number of samples that are needed in order to achieve any desired level of reproducibility. For example, in order to achieve a typical overlap of 50% between two predictive lists of genes, breast cancer studies would need the expression profiles of several thousand early-discovery patients.

[1] L. Ein-Dor, I. Kela, G. Getz, D. Givol and E. Domany, Bioinformatics 21, 171 (2005)

[2] Liat Ein-Dor, Or Zuk and Eytan Domany, PNAS 103, 5923 (2006)

· Michael Elad, Technion

Sparse and Redundant Signal Representation, and its Role in Image Processing

In signal and image processing, we often use transforms in order to simplify operations or to enable better treatment to the given data. A recent trend in these fields is the use of overcomplete linear transforms that lead to a sparse description of signals. This new breed of methods is more difficult to use, often requiring more computations. Still, they are much more effective in applications such as signal compression and inverse problems. In fact, much of the success attributed to the wavelet transform in recent years, is directly related to the above-mentioned trend. In this talk I plan to present a survey of this recent path of research, and its main results. I will discuss both the theoretic and the applicative sides to this field. No previous knowledge is assumed (... just common sense, and little bit of linear algebra).

Hovav Dror, Tel Aviv University

SEQUENTIAL EXPERIMENTAL DESIGN FOR MULTIVARIATE GENERALIZED LINEAR MODELS

A one-stage experimental plan requires the researcher to fix in advance the factor settings at which data will be observed. Sequential experimental design allows updating and improving the experimental plan following the data already observed. We consider the problem of choosing sequential plans when the response is modeled by a GLM. A common setting is sensitivity testing" (also known as "dose-response" or "up and down"). In a typical experiment, a dose level is chosen at each step and success or failure of a treatment is recorded.

We suggest a new procedure for the sequential choice of observations and show it is superior in efficiency to commonly used procedures, such as the "Bruceton" test (Dixon and Mood, 1948), the Langlie (1965) test or Neyer's (1994) procedure. The suggested procedure is based on a D-optimality criterion, and on a Bayesian approximation that exploits a discretization of the parameter space.

Perhaps more important than the improved efficiency, the suggested algorithm can be used in many situations where the former algorithms do not apply. These include extension from the fully sequential design to any partition of the experiment to blocks of observations, from a binary response to any GLM (including Poisson count models), and from the univariate case to the treatment of multiple predictors.

We present a comparison of results obtained with the new algorithm versus the "Bruceton" method on an actual sensitivity test conducted recently at an industrial plant. We also provide comprehensive comparison of techniques via a Monte-Carlo simulation.

· Assaf Oron, Department of Statistics, University of Washington, Seattle

Small correction to Isotonic Regression

Many scientific and engineering experiments focus on a monotone function y=F(x), using a sample of observations at different design points, {(x₁,y₁)… (x_n,y_n)}. Some examples are sensory stimulus-response experiments, drug dose-response studies or survival function studies. Isotonic Regression (IR) is the standard non-parametric solution for estimation of F in case some observations violate monotonicity. IR locally ‘flattens’ monotonicity-violating sequences, and so as the extent of such sequences increases, IR’s output may resemble a staircase function. This is undesirable, especially when F is known to be smooth and strictly monotone. Current solutions to this problem use ad-hoc smoothing algorithms. User intervention is required to choose the algorithm and set the value of smoothing parameters.

We propose a statistical solution that does not require such intervention, and therefore can directly replace IR in many applications. This solution – centered isotonic regression (CIR; temporary name) – is based on a conditional expectation analysis of the IR estimate, given the location of monotonicity-violating sequences. It can be shown that CIR has lower MSE than IR, under conditions which are met in typical applications. The advantages of CIR are demonstrated using re-analysis of published data.