Department of Statistics & Operations Research

Statistics Seminars

2006/2007

Note: the program is not final and is subject to possible changes

Summer term

19, July*	Katherine S. Pollard, UC Davis Genome Center & Department of Statistics
	Detecting Lineage-Specific Evolution (slides)

Second term

6, March	Tom Trigano, Hebrew University
	Statistical signal processing and spectrometry: some processing methods for higher counting rates
27, March	Dean Foster, University of Pennsylvania
	On the Intrinsic Dimensionality of Multi-View Regression
1, May	Nitzan Rosenfeld, Rosetta Genomics
	MicroRNA discovery and application for diagnostics
8, May	Svetlana Bunimovich, Tel Aviv University
	Immunotherapy treatment of Bladder Cancer: A mathematical model
29, May	Mariana Pensky, University of Central Florida (download paper)
	Bayesian Approach to Estimation and Testing in Time Course Microarray Experiments
19, June	Ruth Heller, Tel Aviv University
	Screening for Partial Conjunction Hypotheses

First Term

31, October	Meir Smorodinsky, Tel Aviv University
	מידול הסתברותי של הסיכונים בפרוצדורות רפואיות
7, November	Felix Abramovich, Tel Aviv University
	Prelude and Fugue in Bayesian Testimation
28, November	Isaac Meilijson and Alon Kaufman, Tel Aviv University
	Does the neuron's gene expression carry information on its synaptic connectivity?
26, December	Adi Ben-Israel
	Probabilistic Distance Clustering (slides)
2, January*	David M. Steinberg, Tel Aviv University
	Orthogonal Latin Hypercube Designs
9, January	Sigal Levy, Tel Aviv University
	The analysis of time dependent computer experiments
16, January	Tal Pupko, Tel Aviv University
	Probabilistic evolutionary models and their applications.

* Notice: special time / date / venue

Seminars are held on Tuesdays, 10.30 am, Schreiber Building, 309 (see the TAU map ). is served before.

The seminar organizer is Daniel Yekutieli.

To join the seminar mailing list or any other inquiries - please call (03)-6409612 or email yekutiel@post.tau.ac.il

Details of previous seminars:

ABSTRACTS

· מאיר סמורודינסקי

מידול הסתברותי של הסיכונים בפרוצדורות רפואיות

מקובל להעריך פרוצדורות רפואיות באמצעות שקלול, על פני אוכלוסיית החולים, של מחירן והסיכון הכרוך בביצוען לעומת התועלת שיכולה לצמוח מהם. את החולה, לעומת זאת, מעניינים סיכויי ההחלמה שלו עצמו. לצורך כך, אציג מודלים הסתברותיים לרעש רציף בזמן עבור אומדני הסיכון האינדיבידואליים של הפעולות האפשריות בהינתן התיק הרפואי של החולה.

· Felix Abramovich, Tel Aviv University

Prelude and Fugue in Bayesian Testimation

In the Prelude (a joint composition with Claudia Angelini, CNR Napoli) we present a Bayesian multiple testing procedure. A hierarchical prior model
is based on imposing a prior distribution on the number of hypotheses arising from alternatives (false nulls). We then apply the maximum a posteriori (MAP) rule to find the most likely configuration of nulls and alternatives. We discuss the relations between the proposed MAP procedure and its several existing frequentist counterparts.

In the Fugue (a joint composition with Vadim Grinshtein, the Open University and Marianna Pensky, University of Central Florida) we apply the developed MAP procedure to the normal means problem for recovering a high dimensional vector observed in white noise. The resulting Bayesian
testimator leads to a general thresholding rule which accomodates many of the known thresholding and model selection procedures as its particular
cases. We discuss the optimality of the MAP testimator and specify the class of priors for which it is adaptively minimax for a wide range of sparse
sequences.

· Isaac Meilijson and Alon Kaufman, Tel Aviv University

Does the neuron's gene expression carry information on its synaptic connectivity?

The speakers will discuss a large-scale investigation on the nematode Caenorhabditis Elegans – joint work with Gideon Dror and Eytan Ruppin.

· Adi Ben-Israel, Rutgers University

Probabilistic Distance Clustering

Clustering is a process of partitioning a data set into clusters, i.e. subsets of data points that are similar in some sense. Probabilistic clustering is when cluster membership is expressed by probabilities p(x|C) that a point x belongs to a cluster C. Distance clustering is when similar means close w.r.t. a given distance function (Euclidean, Mahalanobis, etc.)
I present new approach and method for probabilistic clustering of data. Given clusters, their sizes (except if these are unknown and have to be estimated), centers, and the distances of data points from these centers, the probability of cluster membership at any point is assumed to be inversely proportional to its distance from the center of that cluster, and directly proportional to the cluster size. The method is based on the above assumption, and on the joint distance function, a weighted harmonic mean of distance from all cluster centers, that evolves during the iterations, and captures the data in its low contours. The method is simple, and works well and fast.
In addition to clustering, I present application to location of several capacitated facilities, and to demixing of mixtures of distributions, where the proposed method is a viable alternative to the EM method for estimating the relevant parameters.
Joint work with Cem Iyigun, Rutgers University.

· David M. Steinberg, Tel Aviv University

Orthogonal Latin Hypercube Designs

Latin Hypercube (LHC) designs are one of the most popular choices for experiments run on computer simulators. As first proposed by McKay, Beckman and Conover in 1979, LHC designs guarantee that input factor settings are uniformly spread for each single factor, but rely on “random mating” to achieve good spread in high dimensions. In experiments with many factors, some pairs of factors typically have moderately high correlations and a number of schemes have been proposed to reduce the correlations. In this talk we show how to generate completely orthogonal LHC designs by rotating two-level factorial designs into LHC designs. For example, with 256 runs, our method produces a LHC design with 248 orthogonal factors. The best known result previous to our work achieved only 14 orthogonal factors. As a side problem, we show how to arrange the columns of a saturated two-level fractional factorial with 2m runs so that every consecutive set of m columns is a full two-level factorial. We will also present some results on orthogonal designs that are “nearly” LHC designs.

This is joint work with Dennis Lin.

· Sigal Levy, Tel Aviv University

The analysis of time dependent computer experiments

Computer experiments are a convenient substitute to real life experiments, when such an experiment is too complex, expensive or time consuming. This work is concerned with experiments in which the output from each computer run is a dense time trace that is a function of some low dimensional set of explanatory variables. We suggest two-stage methods for modelling and predicting such data, separating the model for time from the model for the explanatory variables. The time dependence is modelled by fitting known basis functions such as splines, as well as data derived, shape-based basis functions. Such basis functions are generated by clustering the data into similarly shaped functions and taking the mean function of each cluster as a basis function. Several methods were tested for modelling the relation to the explanatory variables. Bayesian and other models that used additional information about the data set were considered as a means of improving the fit obtained by the two-stage methods.

Two data sets were analysed using these methods: a simulation of response to chemotherapy, which yields the amount of cancer cells in a patient’s body in response to different chemotherapy treatment protocols, and a circadian rhythm simulation, showing the mRNA production and degradation throughout a sleep-wake cycle. Results show that the shape-based estimation methods proved to be efficient in both cases, and that usually Kriging predicted relation to the explanatory variables with the most accuracy.

· Tal Pupko, Tel Aviv University

Probabilistic evolutionary models and their applications.

In my talk I will first give all needed biological background and will provide the motivation for using evolutionary models. For example I will
discuss the evolutionary relationships between humans and Neanderthals. I will then explain what are probabilistic evolution models and why they
are needed in the context of phylogenetic tree reconstruction. I will then give the statistical/mathematical background of these continuous
time Markov models. I will discuss the maximum likelihood approach for learning with these models. I will also explain the computational
challenges and will give a few applications.

· Tom Trigano, Hebrew University

Statistical signal processing and spectrometry: some processing methods for higher counting rates...

The main objective of spectrometry is to characterize the radioactive elements of an unknown source by studying the energy of the emitted photons. When a photon interacts with a detector, its photonic energy is converted into an electrical pulse, whose integral energy is measured.

Since the detector has a finite resolution, close arrival times of photons which can be modeled as an homogeneous Poisson process cause pile-ups of individual pulses. This phenomenon distorts energy spectra by introducing amongst other perturbations multiple fake spikes.

Since the shape of photonic impulses depends on many physical parameters, we consider this problem in a nonparametric framework. By introducing an adapted model based on two marked point processes, we establish a nonlinear relation between the probability measure associated to the observations and the probability density function we wish to estimate. It provides a framework to this problem, which can be consider as a problem of nonlinear density deconvolution and nonparametric density estimation from indirect measurements.

Using these considerations, we propose an estimator obtained by direct inversion. We show that this estimator is consistent and almost achieves the usual rate of convergence obtained in classical nonparametric density estimation in the L2 sense. We show in both simulated and real examples that the distortions caused by the pile-up phenomenon are well corrected by the algorithms derived from our estimators.

· Dean Foster, University of Pennsylvania

On the Intrinsic Dimensionality of Multi-View Regression

In the multi-view regression problem, we have a regression problem where the input variable can be partitioned into two different views, where it is assumed that either view of the example would be sufficient for learning --- this is essentially the co-training assumption for the regression problem. For example, the task might be to identify a person, and the two views might be a video stream of the person and an audio stream of the person.

We show how Canonical Correlation Analysis,CCA, (related to PCA for two random variables) implies a ridge regression algorithm, where we can characterize the intrinsic dimensionality of this regression problem by the correlation of the two views. An interesting aspect of our analysis is that the norm used by the ridge regression algorithm is derived from the CCA --- no norm or Hilbert space is assumed apriori (unlike in kernel methods).

(with Sham Kakade)

· Nitzan Rosenfeld, Rosetta Genomics

MicroRNA discovery and application for diagnostics

I will describe the computational methodology we used for identification of microRNAs in the human genome. I will introduce some of the challenges and pitfalls of cancer diagnostics, and open the topic for discussion. I will conclude by presenting our algorithmic approach to tissue classification.

· Svetlana Bunimovich, Tel Aviv University

Immunotherapy treatment of Bladder Cancer: A mathematical model

I present a modeling study of bladder cancer growth and its treatment via immunotherapy with Bacillus Calmette-Gue´rin (BCG) - an attenuated strain of Mycobacterium bovis (M. bovis). BCG immunotherapy is a clinically established procedure for the treatment of superficial bladder cancer. However, the mode of action has not yet been fully elucidated, despite extensive biological research. The mathematical model presented here attempts to gain insights into the different dynamical outcomes arising from tumor-immune interactions in the bladder. I studied two types of the treatment: continuous and pulsed BCG therapy. Attention is given to estimating parameters and validating the model using published data taken from in vitro, mouse and human studies. A mathematical analysis of the differential equations identifies multiple equilibrium points, their stability properties, and bifurcation points. Intriguing regimes of bistability are identified in which treatment has the potential to result in a tumor-free equilibrium or a full-blown tumor depending only on initial conditions. In the case of continuous therapy, the model makes clear that intensity of immunotherapy must be kept in limited bounds. While small treatment levels may fail to clear the tumor, a treatment that is too large can lead to an over-stimulated immune system having dangerous side effects for the patient. The model predicts i) regimes in which immunotherapy cannot help; ii) the optimal BCG dosage. Intense therapy can incur damage and side effects via the immune system; iii) quantitative relationships between BCG dosage, the cancer’s initial condition and tumour growth rate that can be calculated prior to treatment.

Impulsive differential equations are used for studying periodic BCG instillations (pulsed BCG therapy). The mathematical analysis defines the critical threshold values of the BCG instillation dose and rate of pulsing for tumor elimination. The final goal in this work is to determine the applicable treatment regime that prevents the immune system side effects (caused by BCG) and enhances tumor destruction.

· Marianna Pensky, University of Central Florida

Bayesian Approach to Estimation and Testing in Time Course Microarray Experiments

The objective of the paper is to develop a truly functional fully Bayesian method which allows to identify differentially expressed genes in a time-course microarray experiment. Each gene expression profile is modeled as an expansion over some orthonormal basis with coefficients and the number of basis functions estimated from the data. The proposed procedure deals successfully with various technical difficulties which arise in microarray time- course experiments such as a small number of observations available, non-uniform sampling intervals, presence of missing data or multiple data as well as temporal dependence between observations for each gene. The procedure allows to account for various types of errors thus offering a good compromise between nonparametric and normality assumption based techniques. The method accounts for multiplicity, select differentially expressed genes and rank them. In addition, all evaluations are performed using analytic expressions, hence, the entire procedure requires very small computational effort. The quality of the procedure is studied by simulations. Finally the procedure is applied to the case study of the human breast cancer cell stimulated with estrogen leading to discovery of some new significant genes which were not marked earlier due to the high variability in the raw data.

Joint work with Claudia Angelini, Daniela De Canditiis and Margherita Mutarelli

· Ruth Heller, Tel Aviv University

Screening for Partial Conjunction Hypotheses

We consider the problem of testing the partial conjunction null hypothesis that there are less than u out of n false null hypotheses. It offers an in-between approach to the testing of the global null that all $n$ null hypotheses are true, and the conjunction null that not all of the n alternative hypotheses are true. We address the problem of testing many partial conjunction hypotheses simultaneously, a problem that arises when combining maps of p-values. Each map contains a large number of locations, and the n p-values per location come from different yet related hypotheses. We suggest powerful test statistics for testing the partial conjunction null hypothesis that are valid under dependence between the p-values as well as under independence. We suggest control of the FDR for testing the partial conjunction hypotheses, and we prove that BH FDR controlling procedure remains valid under various dependency structures. We apply our screening method to important examples from
microarray meta-analysis and fMRI group analysis, and discuss its usefulness for inference on spatial signals.

This seminar is part of a defense on Ruth Heller's PhD thesis

· Katherine S. Pollard, UC Davis Genome Center & Department of Statistics

Detecting Lineage-Specific Evolution

Genomic regions that vary in their patterns of sequence conservation across a phylogeny are interesting candidates for the study of evolutionary shifts in
function. We have developed two comparative genomic methods for detecting lineage-specific evolution on a genome-wide scale. The first approach, called DLESS, is based on a phylogenetic hidden Markov model (phylo-HMM), which does not require the lineage of interest or the element boundaries to be determined a priori. Applying DLESS to the ENCODE regions of the human genome, we detected differences in patterns of loss and gain of conserved elements between coding and non-coding regions and between vertebrate clades. DLESS has very little power, however, to identify changes in substitution rate on a single lineage. To address this question, we developed a second method that begins with a set of ancestrally conserved elements and applies a likelihood ratio test to screen these for the subset whose substitution rate is significantly higher in a lineage of interest. With this approach we identified 202 Human Accelerated Regions (HARs), which are highly conserved among mammals but show a significant increase in the rate of substitutions in the human genome since divergence from the chimp-human ancestor. Bioinformatic characteristics of the HARs suggest that many are involved in the regulation of gene expression. The most dramatically accelerated region, HAR1, is part of a novel RNA gene (HAR1F) that is expressed during human cortical development.