Department of Statistics & Operations Research

Statistics Seminars

2004/2005

Note: the program is not final and is subject to possible changes

First Term

2, November	Ori Davidov, Haifa University
	When is the mean self-consistent ?
30, November	Anat Sakov, Tel Aviv University
	Mice Behavior and Laboratories, Statistical Challenges and Proposed Solutions
21, December	David Steinberg, Tel Aviv University
	Identifying critical parameters in simulations: a case study of a nuclear waste repository
28, December	Saharon Rosset, IBM Watson
	A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
4, January	Eitan Bachmat, Ben Gurion University
	Airplane boarding, disk I/O scheduling, patience sorting, surface growth and space-time geometry

Second Term

22, February	Roelof Helmers, CWI, Amsterdam, The Netherlands
	The Empirical Edgeworth expansion for a studentized trimmed mean
14, March*	Tsachy Weismann, Stanford University
	Discrete Denoising for Channels with Memory
15, March	Peter McCullagh, University of chicago
	Spatial correlation in field trials
22, March	Malka Gorfine, Bar Ilan University
	Survival analysis with general semiparametric shared frailty model - prospective and retrospective designs.
27, March*	Ibragimov
	On estimation of analytic functions
29, March	Amir Herman, Tel Aviv University
	Analyzing Leukemia Survival Data Focusing on non-Remissing and Remissing patients
24, May	Nicole A. Lazar, University of Georgia
	The Use of Resampling and Visualization for the Comparison of Changepoint Location in Two Independent Curves

Summer Term

27, June*	Jay H. Beder, University of Wisconsin - Milwaukee
	Box-Hunter resolution in arbitrary fractional factorial designs
11, July*	Abraham Wyner, Department of Statistics The Wharton School, University of Pennsylvania
	Boosted Classification Trees and Class Probability/Quantile Estimation

Seminars are held on Tuesdays, 10.30 am, Schreiber Building, 309 (see the TAU map ). is served before.

* Seminar held at other time and place.

The seminar organizer is Daniel Yekutieli. To join the seminar mailing list and get updated information about current/forthcoming seminars

and for other inquiries call (03)-6409612 or email yekutiel@post.tau.ac.il

Details of previous seminars:

ABSTRACTS

Ori Davidov, Haifa University

When is the mean self-consistent ?

We study the conditions under which the sample mean is self consistent, and therefore an optimal predictor, for an arbitrary observation in the sample.

Anat Sakov, Tel Aviv University

Mice Behavior and Laboratories, Statistical Challenges and Proposed Solutions

In the field of behavior genetics, behavior patterns of mice genotypes (strains) are characterized via different measures (end-points), in order to associate them with particular genes. Genotypes are usually compared within a single laboratory, and questions regarding the replicability of results from one laboratory to the other eventually arise. We propose to view this problem using the mixed-effects model. The replicability problem is relevant whenever observations and conclusions are extended beyond a single laboratory (and not only in behavior genetics).

Our approach is presented in the context of a mouse loco-motor behavior. Mouse locations in an arena are recorded, pre-processed, and than end-points are computed. The process is executed in different laboratories. The differences between genotypes are then assessed using the mixed model.

Time permitting, we will discuss our ongoing research: many end-points are measured on each mouse, and dimension reduction becomes of interest. However, due to the complexity of the data, usual principal component analysis is not valid. We present the problem, our strategy and preliminary results.

This is a joint work with Yoav Benjamini, Neri Kafkafi, Greg Elmer, Itay Hen, Ilan Golani and his students.

· David M. Steinberg, Tel Aviv University

Identifying critical parameters in simulations: a case study of a nuclear waste repository

Joint work with: Tamir Reisin and Eyal Hashavia, Soreq Nuclear Research Center; Gideon Leonard, Licensing Division, Israel Atomic Energy Commission.

An important issue in nuclear waste disposal is to assess the potential risk to human life over the very long time scales that are associated with the decay of radioactive isotopes. Typical risk analyses consider the migration of radionuclides into the food and water supply during tens of thousands of years. There is good understanding of the physics that govern decay and migration so that physical models can be derived and implemented in computer simulators, which are thus important tools in carrying out risk analyses. The physical models require as inputs a large number of parameters that govern the interaction between the isotopes and the repository site (for example, precipitation and isotope-specific distribution coefficients). The exact values of the parameters depend on the specifics of the repository site and the
isotopes themselves. Prior knowledge of many of the parameter values is often vague. A concern in planning a repository is to identify the parameters that are most influential in controlling the risk.

The focus of this talk will be on a case study to identify critical parameters for a nuclear waste repository using the RESRAD simulator, developed at Argonne National Laboratory. The study involved more than 20 different input parameters. Among numerous outcomes associated with risk, we focus on the maximal equivalent annual dose in the drinking water during a 10,000-year time frame. Many conceivable input settings lead to no migration at all. When migration does occur, the maximal doses can vary by orders of magnitude. Statistical analyses to identify influential parameters must take into account this combination of highly skewed, yet truncated, outcome data.

The problem described here falls into a general area known as “the design and analysis of computer experiments.” We will discuss some general ideas for how to attack such problems and then develop some specific ideas for the waste repository problem. In particular, we propose a two-phase analysis strategy that examines, first, whether or not migration is present and then the extent of migration, assuming that it occurs.

· Saharon Rosset, IBM Watson

A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

The term "semi-supervised learning" is used in the Machine Learning community to describe a predictive modeling task where some of the training labels are known, and some are hidden.
We consider the situation in semi-supervised learning, where the "label sampling'' mechanism stochastically depends on the true response (as well as potentially on the features). We suggest a
method of moments for estimating this stochastic dependence using the unlabeled data. This is potentially useful for two distinct purposes: a. As an input to a supervised learning procedure which
can be used to "de-bias'' its results using labeled data only and b. As a potentially interesting learning task in itself. We present several examples to illustrate the practical usefulness of our method.

Joint work with Ji Zhu, Hui Zou and Trevor Hastie

· Eitan Bachmat, department of computer science, BGU

Airplane boarding, disk I/O scheduling, patience sorting, surface growth and space-time geometry

We show that several discrete random processes, listed in the title, which arise in diverse disciplines can be asymptotically analyzed via 2 dimensional space-time geometry. One particularly interesting process models the way passengers board an airplane. We use the geometry to study the effectiveness of airline boarding policies as implemented by announcements of the form "passengers from row 40 and above are now welcome to board the plane", often heard around airport terminals. we will show that the effectiveness of such policies depends crucially on a parameter which is related to the interior design of the airplane (leg room, number of passengers per row). As the parameter increases the boarding policy experiences a phase transition in which it passes from being effective to being detrimental. unfortunately we seem to be on the wrong side of the phase transition.

We will also explain briefly the relation between fluctuations in airplane boarding time with random matrix theory.

If time permits we will discuss other examples which include scheduling of I/O requests to a simplified model of a disk drive and the polynuclear growth model which is a 1+1 dimensional surface growth model in the Kardar-Parisi-Zhang universality class

No knowledge of space-time (a.k.a Lorentzian) geometry is needed (the speaker himself hardly knows anything about it).

Joint work with Danny berend, Luba Sapir and Steve Skiena.

· Roelof Helmers, CWI, Amsterdam, The Netherlands

The Empirical Edgeworth expansion for a studentized trimmed mean

We establish the validity of the empirical Edgeworth expansion(EE) for a studentized trimmed mean ,under the sole condition that the underlying

distribution function of the observations satisfies a local smoothness condition near the two quantiles where the trimming occurs.

A simple explicit formula for the $N^{-1/2}$ term ,correcting for skewness and bias($N$ being the sample size) of the EE will be given.

In particular our result supplements previous work by Hall and Padmanabhan(1992) and Putter and van Zwet (1998).

The proof is based on a U-statistic type approximation and also uses a version of Bahadur's (1966) representation for sample quantiles.

This is joint work with Nadezhda Gribkova(St.Petersburg).

· Tsachy Weismann, Stanford University

Discrete Denoising for Channels with Memory

The problem of estimating a finite-alphabet signal corrupted by a finite-alphabet noise process, aka "discrete denoising", is arising in an
increasing variety of applications spanning engineering, statistics, computer science, and biology.

For the case where the corruption mechanism (channel) is memoryless (independent noise components), a practical (linear time) algorithm dubbed
DUDE (Discrete Universal DEnoiser) was recently introduced. This denoiser was shown to achieve optimum performance (in the limit of large data sets),
with no a priori knowledge of statistical (or any other) properties of the noiseless data.

This talk will present an extension of the algorithm that accommodates possible memory (dependence) in the noise. We establish asymptotic
optimality of the proposed denoiser under a mild mixing condition on the noise process. The algorithm has near-linear complexity in both time and
space. We present empirical evidence supporting the theoretical predictions and highlighting the benefit in taking the channel memory into account.

Based on joint work with Rui Zhang.

Time permitting, I will briefly mention related research on discrete denoising.

· Peter McCullagh, University of Chicago

Spatial correlation in field trials

Fairfield Smith (1935) initiated the first systematic study of the nature of spatial correlation in field trials, The variation of yields from plots of various sizes was studied by aggregation of adjacent plots. It was found that the sample variance per unit area does not decrease in proportion to the plot area as would be expected if yields on distinct plots were independent. Fairfield Smith found that the sample variance per unit area
decreases according to a power law. The index is not a universal agricultural constant: it ranges from 0.25 to 0.75 depending on the crop and on the season.

This talk describes a large-scale study of 25 uniformity trials, many of which were also studied by Fairfield Smith. The primary purpose of the study is not so much the comparison of strategies for the estimation of variety effects, as the understanding of natural or non-anthropogenic spatial variation of crop yields. The term `field crop' is interpreted to include annual cash crops such as cereals, beans, potatoes, beets and brassicas, and also fruit crops such as oranges, lemons, peaches, apples, olives and walnuts. In each trial, yields were recorded on a semi-regular grid of rectangular plots of known size and known spacing. Geometric information is necessary in order to study deviations from isotropy and to study spatial covariances.

Our findings are as follows:
(i) Agricultural processes have infinite range. (ii) The spatial component of variation is not only self-similar but also conformally invariant. (iii) Most agricultural processes are isotropic or close to isotropic. Where anisotropy is present, it is associated with the direction of drills in the field.

These conclusions point to a generalized covariance model of the form
$$ \cov(Y(x), Y(x')) = \sigma_0^2 \delta_{x - x'} - \sigma_1^2 \log|x - x'| $$
in which the first term is white noise and the second term is the de~Wijs process defined in integrated contrasts. Non-isotropic anthropogenic effects associated with rows and/or columns can be included in addition if necessary.

Details of the study are given in the technical report {\tt www.stat.uchicago.edu/\tie pmcc/reml/terroir.pdf}, written jointly with David Clifford.

· Malka Gorfine, Department of Mathematics, Bar-Ilan University

Survival analysis with general semiparametric shared frailty model - prospective and retrospective designs.

In this talk a simple estimation procedure will be presented, for a general frailty model for analysis of prospective correlated failure times and for analysis of failure time data from case-control family study. Large-sample theory for the proposed estimators of both the regression coefficient vector and the dependence parameter, under each design, will be discussed. The proposed approaches provide a framework capable of handling general frailty distributions with finite moments and yield an explicit consistent variance estimator.

Joint work with D.M. Zucker and L. Hsu

· Ibragimov

On estimation of analytic functions

No abstract provided

· Amir Herman, Tel Aviv University

Analyzing Leukemia Survival Data Focusing on non-Remissing and Remissing patients Amir Herman

Analyzing Survival Data in oncological diseases (e.g. Leukemia) is different from analyzing survival data for non oncologic diseases. In the classic survival data analysis the event is defined as death of a patient. However in leukemias and some other cancers we often use the ‘event free survival’ to describe the desirable end point of the treatment. In this case an event is considered to be either relapses of the disease or the death of the patient.

That endpoint entails a problem in it. That is: not all the patients are disease free at the point of entry to the study. In fact, inducing remission of the disease is also one of the desirable endpoints of a treatment.

In our work we compare two methods to analyze such data. One method is to include the non-remissing patients as events at time zero, since the disease in these patients has never been remitted. The second method is to split the analysis and to analyze the patients from diagnosis to remission and from remission to the familiar endpoint of “relapse or death”.

We conducted a simulation to examine the errors and the influential parameters on the bias of the parameter estimates using the Cox proportional hazards model. The bias of the estimator from the split method was influenced by the sample size and time of follow-up. The main parameters that influenced the parameter estimate of the method with events at time zero, were the relative risk at time 0 and the proportional size of the non-remissing patients group.

We then analyzed pediatric AML data using the split analysis approach and the whole analysis approach. Another advantage of this approach was the ability to include variables that exist only at the time of remission. Such a variable is the remission index defined as:

{log (platelets at time of remission)/log(platelets at time of diagnosis)}* {log (Hemoglobin at time of remission)/log(Hemoglobin at time of diagnosis)}. This index was found to be an important prognostic factor for remaining in remission.

· Nicole A. Lazar, Department of Statistics, University of Georgia

The Use of Resampling and Visualization for the Comparison of Changepoint Location in Two Independent Curves

In many psychological and psychiatric studies, a research question of interest relates to the presence, or absence, of a changepoint. Working with developmental data, for instance, researchers might want to know at what age does the reaction time for children, to a particular task, become the same as that for adults. Much research has been done over the years on the changepoint problem for a single sample. I consider in this talk a generalization of the simple changepoint problem, namely, comparing the location of the changepoint in two independent samples.

The psychological question that motivated this work was twofold: first, to discover at what age do children stop making errors (on a visual processing task) at a higher rate than adults; and second, to discover whether this age is different for autistics than for healthy controls. Clearly, these two aspects are linked, and involve locating and comparing the changepoints (if they exist) for the two groups. A straightforward algorithm gives practitioners an easily implemented way to find the changepoints in each group separately. A combination of visualization and Resampling (permutation and bootstrap), allows one to further attach significance statements. In this talk, I will highlight some of the issues that arose in attempting to find a solution to the problem as it was presented, and point to open questions that still remain.

· Jay H. Beder, Dept. of Mathematical Sciences, University of Wisconsin – Milwaukee

Box-Hunter resolution in arbitrary fractional factorial designs

In 1961 Box and Hunter defined the resolution of a regular fractional factorial design as a measure of the amount of aliasing in the fraction. They indicated that the maximum resolution is equal to the minimum length of a defining word.

Since then, various approaches have been offered to generalize the concept of resolution to arbitrary (possibly mixed-level) fractions.
These have generally been based on estimability and on the assumption that high-order interactions are absent, rather than on the alias structure of the fraction. In this talk we will formulate a generalization of Box-Hunter resolution based on an idea that may be traced back to Rao (1947). Using it, we show that in an arbitrary fraction of maximum strength t and maximum resolution R, we have R = t+1. This generalizes the wordlength criterion.

· Abraham Wyner, Department of Statistics The Wharton School, University of Pennsylvania

Boosted Classification Trees and Class Probability/Quantile Estimation

The standard by which binary classifiers are usually judged, misclassification error, assumes equal cost of misclassifying the two classes or, equivalently, classifying at the 1/2 quantile of the conditional class probability function $P(y=1|x)$. Boosted classification trees are known to perform quite well for such problems. In this talk we consider the use of boosting algorithms for two more general problems: 1)~classification with unequal costs or, equivalently, classification at quantiles other than 1/2, and 2)~estimation of the conditional class probability function $P(y=1|x)$. We then consider the practice of over/under-sampling of the two classes. We present an algorithm that uses the boosting algorithm AdaBoost in conjunction with {\bf O}ver/{\bf U}nder-{\bf S}ampling and {\bf J}ittering of the data (``JOUS-Boost''). This algorithm is simple, yet successful, and it preserves the advantage of relative protection against overfitting, but for arbitrary misclassification costs and, equivalently, arbitrary quantile boundaries.