Note: the program is not final and is subject to possible changes
7, March |
Hovav
Dror, |
|
Robust Experimental Design for Multivariate Generalized Linear
Models |
21, March |
Isaac Meilijson, Tel Aviv University |
|
Pricing financial instruments with stochastic volatility - a
Bayesian approach. |
4, April |
Robert J. Adler, Technion |
|
|
25, April |
Ruth Heller, |
|
|
16, May |
Vered
Madar, |
|
Simultaneous Confidence Intervals for Multiple Parameters with
More Power to Determine the Sign |
23, May |
Itzhak
Gilboa, |
|
|
30, May |
Daniel Yekutieli, Tel Aviv
University |
|
|
6, June |
Eytan Domany, Weizmann Institute of
Science |
|
Predicting
outcome in breast cancer: the search for
a robust list of predictive genes. |
13, June |
Michael Elad,
Technion |
|
Sparse and Redundant Signal Representation, and its Role in
Image Processing |
20, June |
Hovav
Dror, |
|
|
27, June |
Assaf
Oron, University of Washington |
|
27,
September |
Anat
Reiner, |
|
Complexity of Data
and Analysis Related to FDR Control in
Microarray Experiments |
29,
November |
Dorron
Levy, Comverse |
|
|
11,
December* |
Dror Berel, |
|
|
13,
December* |
Albert
Vexler, National Institute of Child Health and
Human Development, NIH |
|
Nonparametric deconvolution applied
to a tradition/nontraditional pooling design |
3,
January* |
David Steinebrg |
|
Sample Size For Positive and Negative Predictive Value in
Diagnostic Research |
10,
January |
Saharon
Rosset |
|
The Genographic Project: Background and Some Statistical
Challenges |
16,
January* |
Galit Shmueli, |
|
A Functional Data Analytic Approach To Empirical eCommerce
Research |
Seminars are held on Tuesdays,
10.30 am, Schreiber
Building, 309 (see the
TAU map ). is served before.
The seminar organizer is Daniel Yekutieli.
To join the seminar mailing list or any other inquiries - please call (03)-6409612 or email yekutiel@post.tau.ac.il
Details of previous seminars:
Complexity
of Data and Analysis Related to FDR Control
in Microarray Experiments
Statistical issues involved in the
analysis of gene expression data are encountered in many other types of data,
and thus their exploration may offer insights for a wide range of
applications. One of the major concerns
is multiple testing, since the inference of interest is for each gene
separately, given a set of thousands of genes examined in one experiment. The false discovery rate (FDR) is proposed to
control the type I error, but its implementation is challenged by several
statistical aspects related both to the data itself and the analytical process
used.
One aspect arises since microarray
data is typically subjected to technological and biological factors that are
potential causes of dependencies between the multiple test statistics. I will discuss the effect of dependence on
FDR behavior and examine more closely the frequently encountered case of
two-sided tests. A few scenarios of
dependency structure will be represented along with the respective least
favorable cases.
In addition, the control of the
FDR is complicated when conducting an analysis containing several research
questions. Such analysis may include
pairwise comparisons and interaction contrasts on a gene level. Furthermore, correlation analysis may be used
when incorporating phenotypic measures with the genetic expression data, for
the purpose of exploring the relations between them. I will discuss a strategy for FDR control in
such cases, based on arranging the hypotheses in hierarchical manner, and
propose a solution for additional statistical problems that may arise. A functional genomic research problem will be
presented.
Software
failure early alert
Software failures pose significant
operational and financial burden on Telecom companies. Expenses on support tend
to be high, with companies having hundreds of support personnel on board, tending
urgent system problems. Maintenance is complex following large system
variability, complexity, continuous changes and extremely low tolerance for
failure.
An effort to achieve some early alert for failure is described. Data from
field-installed systems is analyzed to find some normative behavior metrics and
identify early shifts from normal behavior. Some initial results are presented,
with emphasis on simplicity (for field worthiness) and early detection (for
financial effectiveness). The work is almost purely empirical, with initial
field implementation shows promise.
USING
A CLSSIFICATION TREE MODEL FOR DESCRIBING THE tRNA
OPERATIVE CODE OF AMINO ACIDS AMONG THE
BIOLOGY DOMAINS ARCHAEL AND EUBACTERIAL
The aim of this research is to
describe and analyze the sequence data of the upper section of the tRNA molecule, known as the Acceptor Stem. We used data
from creatures that belong to the Archaea and Eubacteria biological domains. We analyzed the relation
between these sequences and the Amino Acids determined by the tRNA molecule by using the Classification Tree model of
CART.
The model performed well, explaining 96% and 92% of the observations which
belong to the Eubacteria, and Archaea
respectively. The performance of the model was tested by using both a Cross
Validation technique, and simulation. It was found that the results of the full
model are the best possible for
future data. This implies that there is no over fitting of the model to the
data. Also tested the connection to the Anti Codon
mechanism, which is related to the Amino Acid determination. In addition, we
also developed a novel graphical presentation of the Acceptor Stem sequences,
which includes a measure of the data heterogeneity.
Nonparametric
deconvolution applied to a tradition/nontraditional
pooling design
We present developed methodologies for distribution-free estimation of a
density function based upon observed sums or pooled data. The proposed
methods employ a fourier approach to nonparametric deconvolution of a density estimate. It is shown that
reconstruction of the density function of a random variable by the density of
its observed sums requires strong conditions on distribution functions. To
relax these assumptions, a nontraditional pooling design to generate partial
sums is proposed. The methods are exemplified using data from a study of biomarkers associated with coronary heart
disease.
Sample Size For Positive and
Negative Predictive Value in Diagnostic Research
Important properties of diagnostic methods are their sensitivity, specificity,
and positive and negative predictive values (PPV and NPV). These methods are
typically assessed via case-control samples, which include one cohort of cases,
known to have the disease, and a second control cohort of disease-free
subjects. Such studies give direct estimates of sensitivity and specificity,
but only indirect estimates of PPV and NPV, which also depend on the disease
prevalence in the tested population. We develop formulas for optimal
allocation of the sample between the case and control cohorts and for computing
sample size when the goal of the study is to prove that the test procedure
exceeds pre-stated bounds for PPV and/or NPV.
This is joint work with Rick Chappell and Jason Fine Department of
Biostatistics University of
The Genographic Project:
Background and Some Statistical Challenges
The
Genographic project is a research partnership of
National Geographic and IBM, intended to investigate the migration history of
humans across the globe through the "history book" hidden in the DNA
of each one of us. This project is unprecedented in scope: it targets genetic
testing and analysis of 100,000 members of indigenous populations around the
world and at least as many members of the general public who choose to purchase
participation kits. For more details please go to: www.nationalgeographic.com/genographic
In
this talk, I will first give a brief overview of this project and its scope. I
will then survey some statistical challenges that arise from it, and
concentrate on one or both of the following topics, as time permits: 1. Maximum
likelihood estimation of mutation probabilities and coalescent tree sizes in mtDNA.
The control region of the mitochondrial DNA mutates much more quickly than most DNA in our body. Consequently it
contains useful information for phylogenetic
analysis. However the mutation rate across this region is known to vary
greatly. I will describe some data from the Genographic
project, including observed mtDNA control region
mutations and haplogroup classification of several
thousand individuals. The poisson likelihood
associated with the number of mutations at each locus in each haplogroup reduces to a binomial likelihood given the
observed data. This maximum likelihood problem can be solved as a binomial GLM
with a complementary log-log link function. The resulting estimates are useful
for improving mtDNA classification models and phylogenetic analysis, and also give interesting insights
into the population history of different haplogroups.
2. Power analysis of tests for discovery of admixture between modern humans and Neanderthals. I will discuss the feasibility of discovering
inter-breeding between modern humans
and Neanderthals in
A
Functional Data Analytic Approach To Empirical eCommerce Research
Electronic commerce has received
an extreme surge of popularity in recent years. While the field of economics
has created many theories for understanding economic behavior at the individual
and market level, many of these theories were developed before the emergence of
the world wide web, and do not carry over to the new online environment.
Consider online auctions: While auction theory has been studied for a long time
from a game-theory perspective, the electronic implementation of the auction
mechanism poses new and challenging research questions. Luckily, empirical
research of eCommerce is blessed by an ever increasing amount of
readily-available, high-quality data, and is therefore thriving. However,
analysis methods used in this research community for extracting information
from data have not kept up with the vast amount and the complex structure of
eCommerce data. The currently used statistical methods have typically been
limited to “off the shelf" methods such as regression-type modeling.
In this talk, we present a novel statistical
approach and set of tools called Functional Data Analysis (FDA) and discuss its
usefulness for empirical research of eCommerce. We show how this approach
allows the researcher to study, for the first time, dynamic concepts such as
process-evolution and, associated with that, process-dynamics in the eCommerce
context. We illustrate these ideas by focusing on online auctions, showing that
understanding priceevolution
and its dynamics can be helpful in
characterizing, differentiating, and even forecasting an auction.
There are multiple statistical
challenges in applying FDA in the eCommerce context. Many of these arise from
the non-standard data structure that arises in typical eCommerce applications.
We describe such challenges and some solutions, but also unanswered questions.
Finally, we present interesting results obtained from analyzing online auction
data from eBay.com using a functional approach. We show how methods such as
curve clustering and functional regression models shed new light on the
dynamics that take place in online auctions.
Joint with Wolfgang Jank. A series of relevant papers is available at
http://www.smith.umd.edu/ceme/statistics/papers.html
Robust
Experimental Design for Multivariate Generalized Linear Models
Optimal experimental designs for generalized linear models (GLM) depend on the
unknown coefficients, and two experiments having the same model but different
coefficient values will typically have different optimal designs. Therefore,
unlike experimental design for linear models, the prior knowledge and estimates
of the outcome of an experiment must be taken into account.
Prior work on local optimal experimental designs for GLM is mainly focused on a
simple linear effect and one design variable. Generalizing these local results
to take account of uncertainty is even more difficult. As a result, literature
concerning multivariate robust designs for GLM is scarce.
We describe a fast and simple method for finding local D-optimal designs for
high-order multivariate models. With this capability in hand we suggest a
simple heuristic capable of finding designs that are robust to most parameters
an experimenter might consider, including uncertainty in the coefficient
values, in the linear predictor equation and in the link function. The
procedure is based on K-means clustering of local optimal designs. Clustering,
with its simplicity and minimal computation needs, is demonstrated to
outperform more complex and sophisticated methods.
Pricing financial instruments with
stochastic volatility - a Bayesian approach.
This talk will review Generalized Inverse Gaussian and Hyperbolic
distributions, presenting a family richer than Gamma of conjugate priors
for (one over the) variance of a normal variable. This family yields light-tail
predictive distributions, unlike the t distribution induced by Gamma priors.
The variance will be permitted to change along time by the introduction of two
conflicting experts, one Bayesian (unknown variance doesn't change) and one
Empirical Bayesian (variance is sampled from the prior towards each
observation). The logarithmic expert opinion pooling
of Hammond-Genest-Zidek will be applied.
Rice and Geometry (via the
Brain)
The classic Rice formula for the expected number of upcrossings
of a smooth stationary Gaussian process on the real line is one of the oldest
and most important results in the theory of smooth stochastic processes. It has
been generalised over the years to non-stationary and
non-Gaussian processes, both over the reals and over
more complex parameter spaces, and to vector valued rather than real valued
processes.
Over the last few years, jointly with Jonathan Taylor, we have discovered a
kind of Rice "super formula" which incorporates effectively all the
(constant variance) special cases known until now. More interesting, however,
is that it shows that these formulae all have a deep geometric interpretation,
giving a version of the Kinematic Fundamental Formula
of Integral and Differential Geometry for Gauss space.
To keep things reasonably concrete, and to put them into a statistical
framework, I shall take as a motivating application of these results some
hypothesis testing issues in brain imaging, and relate much of what I have to
say to them.
·
Ruth Heller,
False
Discovery Rates for Spatial Signals
We suggest a new approach to multiple testing for signal
presence in spatial data that tests cluster units rather than individual
locations. This approach leads to increased signal to noise ratio within the
unit tested as well to a reduced number of hypotheses tests compared. We
introduce a powerful adaptive procedure to control the size weighted FDR on
clusters, i.e. the size of erroneously rejected clusters out of the total size
of clusters rejected. Once the cluster discoveries have been made, we suggest
’cleaning’ loca tions in
which the signal is absent by a hierarchical testing procedure that controls
the expected proportion of locations in which false rejections occur. We
discuss an application to functional MRI which motivated this research and
demonstrate the advantages of the
proposed methodology on an example.
·
Vered Madar,
Hierarchical FDR controlling procedures
I will introduce FDR trees - a new class of hierarchical FDR controlling
procedures. In this new testing approach, rather than test all the hypotheses
simultaneously, the tested hypotheses are arranged in a tree of disjoint
subfamilies and the tree of subfamilies is tested hierarchically. This is a
very flexible and powerful testing framework suited to perform complex
statistical analysis. I will present the theoretical properties of FDR trees
and demonstrate the use of FDR trees with several examples.
·
Eytan
Domany,
Department of Physics of Complex Systems, Weizmann Institute
of Science
Predicting outcome
in breast cancer: the search for a
robust list of predictive genes.
Predicting at the time of discovery the prognosis and metastatic potential of breast cancer is a major challenge in current clinical research. Numerous recent studies searched for gene expression signatures that outperform traditionally used clinical parameters in outcome prediction. Finding such a signature will free many patients of the suffering and toxicity associated with adjuvant chemotherapy given to them under current protocols, even though they do not need such treatment. A reliable set of predictive genes will contribute also to a better understanding of the biological mechanism of metastasis. Several groups have published ranked lists of predictive genes and reported good predictive performance based on them. However, the gene lists obtained for the same clinical types of patients by different groups differed widely and had only very few genes in common, raising doubts about the reliability and robustness of the reported predictive gene lists. The main source of the problem was shown to be [1] the highly fluctuative nature of the correlation of single genes' expression with outcome, on which the ranking was based. The underlying biological reason is the heterogeneity of the disease; to stabilize the genes' ranked list a much larger number of samples (patients) are needed than what has been used so far. We introduced [2] a novel mathematical method, PAC ranking, for evaluating the robustness of such rank-based lists. We calculated for several published datasets the number of samples that are needed in order to achieve any desired level of reproducibility. For example, in order to achieve a typical overlap of 50% between two predictive lists of genes, breast cancer studies would need the expression profiles of several thousand early-discovery patients.
[1] L. Ein-Dor,
[2] Liat Ein-Dor, Or Zuk and Eytan Domany, PNAS 103, 5923 (2006)
Sparse
and Redundant Signal Representation, and its Role in
Image Processing
In signal and image
processing, we often use transforms in order to simplify operations or to
enable better treatment to the given data. A recent trend in these fields is
the use of overcomplete linear transforms that lead
to a sparse description of signals. This new breed of methods is more difficult
to use, often requiring more computations. Still, they are much more effective
in applications such as signal compression and inverse problems. In fact, much
of the success attributed to the wavelet transform in recent years, is directly
related to the above-mentioned trend. In this talk I plan to present a survey
of this recent path of research, and its main results. I will discuss both the
theoretic and the applicative sides to this field. No previous knowledge is
assumed (... just common sense, and little bit of linear algebra).
SEQUENTIAL EXPERIMENTAL DESIGN FOR MULTIVARIATE
GENERALIZED LINEAR MODELS
A one-stage experimental plan requires the researcher to fix in advance the factor settings at which data will be observed. Sequential experimental design allows updating and improving the experimental plan following the data already observed. We consider the problem of choosing sequential plans when the response is modeled by a GLM. A common setting is sensitivity testing" (also known as "dose-response" or "up and down"). In a typical experiment, a dose level is chosen at each step and success or failure of a treatment is recorded.
We suggest a new procedure for the sequential choice of observations and show it is superior in efficiency to commonly used procedures, such as the "Bruceton" test (Dixon and Mood, 1948), the Langlie (1965) test or Neyer's (1994) procedure. The suggested procedure is based on a D-optimality criterion, and on a Bayesian approximation that exploits a discretization of the parameter space.
Perhaps more important than the improved efficiency, the suggested algorithm can be used in many situations where the former algorithms do not apply. These include extension from the fully sequential design to any partition of the experiment to blocks of observations, from a binary response to any GLM (including Poisson count models), and from the univariate case to the treatment of multiple predictors.
We present a comparison of results obtained with the new algorithm versus the "Bruceton" method on an actual sensitivity test conducted recently at an industrial plant. We also provide comprehensive comparison of techniques via a Monte-Carlo simulation.
·
Assaf Oron, Department of
Statistics,
Small correction to Isotonic Regression
Many scientific and engineering experiments focus on a monotone function y=F(x), using a sample of observations at different design points, {(x1,y1)… (xn,yn)}. Some examples are sensory stimulus-response experiments, drug dose-response studies or survival function studies. Isotonic Regression (IR) is the standard non-parametric solution for estimation of F in case some observations violate monotonicity. IR locally ‘flattens’ monotonicity-violating sequences, and so as the extent of such sequences increases, IR’s output may resemble a staircase function. This is undesirable, especially when F is known to be smooth and strictly monotone. Current solutions to this problem use ad-hoc smoothing algorithms. User intervention is required to choose the algorithm and set the value of smoothing parameters.
We propose a statistical solution that does not require such intervention, and therefore can directly replace IR in many applications. This solution – centered isotonic regression (CIR; temporary name) – is based on a conditional expectation analysis of the IR estimate, given the location of monotonicity-violating sequences. It can be shown that CIR has lower MSE than IR, under conditions which are met in typical applications. The advantages of CIR are demonstrated using re-analysis of published data.