Department of Statistics & Operations Research

Statistics Seminars

2017/2018

To subscribe to the list, please follow this link or send email to 12345yekutiel@post.tau.ac.il54321 (remove numbers unless you are a spammer…)

Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: C:\Users\user\Documents\myWebsite\TAU Statistics Seminar Home Page_files\red2.gif

 

 

 

 

Second  Semester

 

 

5 March

James A. Evans, University of Chicago 

 

Centralized Scientific Communities More Likely Generate Non-replicable Results

19 March

Nalini Ravishanker, University of Connecticut

 

Modeling Inter-event Durations in High-Frequency Time Series

26 March

Stacey Cherny, Tel Aviv University

 

Longitudinal Heritability of Childhood Aggression: Twin Modelling using SEM

16 April

Uri Shalit, Technion

 

Predicting individual-level treatment effects in patients: challenges and proposed best practices

30 April

Lucas Janson, Harvard University

 

Modeling X in High-Dimensional Inference

4 June

Giles Hooker, Cornell

 

Random Forests and Inference: U and V Statistics

11 June

Judith Somekh, Haifa University

 

Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset 

18 June

Dorothea Dumuid, University of South Australia

 

Statistical Adventures in the Emerging Field of Time-Use Epidemiology

 

 

 

First Semester

23 October

Adam Kapelner, City University of New York

 

Harmonizing Fully Optimal Designs with Classic Randomization in Fixed Trial Experiments

6 November

Daniel Nevo, TAU

 

LAGO: The adaptive Learn-As-you-GO design for multi-stage intervention studies

27 November

Liran Katzir,  Final Ltd.

 

Social network size estimation via sampling

25 December

Bella Vakulenko-Lagun, Harvard

 

Some methods to recover from selection bias in survival data

1 January

Meir Feder, TAU

 

Universal Learning for Individual Data

8 January

Adi Berliner Senderey, Clalit

 

Effective implementation of evidence based medicine in Healthcare

 

 

 

 

 
 
 
 
 
 
 

 

 
 
 

 

 

 

 

 

 

 


 

 

Seminars are held on Tuesdays, 10.30 am, Schreiber Building, 309 (see the TAU map ). The seminar organizer is Daniel Yekutieli.

To join the seminar mailing list or any other inquiries - please call (03)-6409612 or email 12345yekutiel@post.tau.ac.il54321 (remove numbers unless you are a spammer…)

 


Seminars from previous years


 

 

 

ABSTRACTS

 

 

 

 

 

·         Daniel Nevo, TAU

 

LAGO: The adaptive Learn-As-you-GO design for multi-stage intervention studies

 

In large-scale public-health intervention studies, the intervention is a package consisting of multiple components. The intervention package is chosen in a small pilot study and then implemented in large-scale setup. However, for various reasons I will discuss, this approach can lead the an implementation failure. 

In this talk, I will present a new design, called the learn-as-you-go (LAGO) adaptive design. In the LAGO design,  the intervention package is adapted in stages during the study

based on past outcomes.  Typically, an effective intervention package is sought, while minimizing cost. The main complication when analyzing data from a LAGO is that interventions in later stages depend upon the outcomes in the previous stages. Under the setup of logistic regression, I will present asymptotic theory for LAGO studies and tools that can be used by researchers in practice. The LAGO design will be illustrated via application to the BetterBirth Study, which aimed to improve maternal and neonatal outcomes in India.

 

 

 

 

·         Adam Kapelner, City University of New York

 

Harmonizing Fully Optimal Designs with Classic Randomization in Fixed Trial Experiments

 

There is a movement in design of experiments away from the classic randomization put forward by Fisher, Cochran and others to one based on optimization. In fixed-sample trials comparing two groups, measurements of subjects are known in advance and subjects can be divided optimally into two groups based on a criterion of homogeneity or "imbalance" between the two groups. These designs are far from random. This talk seeks to understand the benefits and the costs over classic randomization in the context of different performance criterions such as Efron's worst-case analysis. In the criterion that we motivate, randomization beats optimization. However, the optimal design is shown to lie between these two extremes. Much-needed further work will provide a procedure to find this optimal designs in different scenarios in practice. Until then, it is best to randomize.

 

 

 

 

·         Liran Katzir,  financial algorithms researcher at Final Ltd.

 

Social network size estimation via sampling

 

This presentation addresses the problem of estimating the number of users in online social networks. While such networks occasionally publish user numbers, there are good reasons to validate their reports. The proposed algorithm can also estimate the cardinality of network sub-populations.  Since this information is seldom voluntarily divulged, algorithms must limit themselves to the social networks’ public APIs. No other external information can be assumed.  Additionally, due to obvious traffic and privacy concerns, the number of API requests must also be severely limited. Thus, the main focus is on minimizing the number of API requests needed to achieve good estimates. Our approach is to view a social network as an undirected graph and use the public interface to produce a random walk. By counting the number of collisions, an estimate is produced using a non-uniform samples version of the birthday paradox. The algorithms are validated on several publicly available social network datasets.

 

 

 

·         Bella Vakulenko-Lagun, Harvard

 

Some methods to recover from selection bias in survival data

 

We consider several study designs resulting in truncated survival data.  First, we look at a study with delayed entry, where the left truncation time and the lifetime of interest are dependent. The critical assumption in using standard methods for truncated data is the assumption of quasi-independence or factorization. If this condition does not hold, the standard methods cannot be used. We address one specific scenario that can result in dependence between truncation and event times - this is covariates-induced dependent truncation. While in regression models for time-to-event data this type of dependence does not present any problem, in nonparametric estimation of the lifetime distribution P(X), ignoring the dependence might cause bias. We propose two methods that are able to account for this dependence and allow consistent estimation of P(X).

 

Our estimators for dependently truncated data will be inefficient if we use them when there is no dependence between truncation and event times. Therefore it is important to test for independence. The common knowledge is that we can test for quasi-independence, that is "independence in the observable region". We derived two other conditions, called factorization conditions, which are indistinguishable from quasi-independence, given data at hand. This means that in the standard analysis of truncated data, when we assume quasi-independence, we ultimately make an untestable assumption in order to estimate the distribution of the target lifetime. This non-identifiability problem has not been recognized before.

 

Finally, we consider retrospectively ascertained time-to-event data resulting in right truncation, and discuss estimation of regression coefficients in the Cox model. We suggest an approach that incorporates external information in order to solve the problem of non-positivity that often happens with right-truncated data. 

 

 

·         Meir Feder, TAU

 

 

Universal Learning for Individual Data

 

Universal learning is considered from an information theoretic point of view following the universal prediction approachoriginated by Solomonoff, Kolmogorov, Rissanen, Cover, Ziv and others and developed in the 90's by F&Merhav.  Interestingly, the extension to learning is not straight-forward. In previous works we considered on-line learning and supervised learning in a stochastic setting. Yet, the most challenging case is batch learning where prediction is done  on a test sample once the entire training data is observed, in the individual setting where the features and labels,  both of the training and test, are specific individual quantities. 

 

Our results provide schemes that for any individual data compete with a "genie" (or reference) that knows the true test label.  We suggest design criteria and develop the corresponding universal learning schemes, where the main proposed scheme is termed Predictive Normalized Maximum Likelihood (pNML). We demonstrate that pNML learning and its variations provide robust, "stable" learning solutions that outperforms the current leading approach based on Empirical Risk Minimization (ERM). Furthermore, the pNML construction provides a pointwise indication for the learnability. This measure the uncertainty in learning the  specific test challenge with the given training examples letting the learner know when it does not know.

 

Joint work with Yaniv Fogel and Koby Bibas

 

 

 

·         Adi Berliner Senderey, Clalit

 

Effective implementation of evidence based medicine in Healthcare

 

Two projects illustrating use of data for determining effective treatment policies are presented.

 

1. Machine Learning in Healthcare – Shifting the Focus to Fairness  – by Noam Barda

This project deals with an algorithm for improving fairness in predictive models. The method is meant to address concerns regarding potential unfairness of prediction models towards groups which are underrepresented in the training dataset and thus might receive uncalibrated scores. the algorithm was implemented on widely used risk models, including the ACC/AHA 2013 model for cardiovascular events and the FRAX model for osteoporotic fractures, and tested on a large real world sample. Based on a joint work with Noa Dagan, Guy Rothblum, Gal Yona, Ran Balicer and Eitan Bachmat.

 

2. Rates of Ischemic stroke, Death and Bleeding in Men and Women with Non-Valvular Atrial Fibrillation –by Adi Berliner Senderey

Data regarding the thromboembolic risk and differences in outcomes in men and women with non-valvular atrial fibrillation (NVAF) are inconsistent. The aim of the present study is to evaluate differences in treatment strategies and risk of ischemic stroke, death, and bleeding between men and women in a large, population-based cohort of individuals with non-valvular AF (NVAF). Based on a joint work with Yoav Arnson, Moshe Hoshen, Adi Berliner Senderey, Orna Reges, Ran Balicer, Morton Leibowitz, Meytal Avgil Tsadok, Moti Haim

 

 

 

·         James A. Evans, University of Chicago 

 

 

Centralized Scientific Communities More Likely Generate Non-replicable Results


Growing concern that published results, including those widely agreed upon, may lack replicability is often modeled, but rarely empirically examined amidst the rapid increase of biomedical publications. We introduce a novel, high-throughput replication strategy aligning 64,412 published findings about 51,292 distinct drug-gene interaction claims (e.g., Benzo(a)pyrene decreases expression of SLC22A3) with high-throughput experiments performed through the NIH LINCS L1000 program. We show (1) that claims reported in a single paper replicate 19.0% (95% confidence interval [CI], 16.9% to 21.2%) more frequently than expected, while those reported in multiple papers and widely agreed upon replicate 45.5% (95% CI, 21.8% to 74.2%) more frequently, manifesting collective correction in science. Nevertheless (2), among the 2,493 interactions reported in two or more papers, centralized scientific communities perpetuate less replicable claims, demonstrating how centralized collaborations weaken collective inquiry. Decentralized, disconnected research communities involve more independent teams, use more diverse methodologies, and draw on more diverse prior knowledge, generating the most robust, replicable results. Our findings highlight the importance of policies that foster decentralized collaboration to promote robust biomedical advance. Our large-scale approach holds promise for identifying reliable biomedical results out of numerous published experiments.

 

 

·         Nalini Ravishanker, University of Connecticut, Storrs

 

 

Modeling Inter-event Durations in High-Frequency Time Series

 

This talk will discuss statistical analysis of durations between events for high-frequency financial time series obtained from the Trade and Quotes (TAQ) database. The class of logarithmic autoregressive conditional duration (Log ACD) models provides a rich framework for analyzing durations, and recent research is focused on developing fast and accurate methods for fitting these models to long time series of durations under least restrictive assumptions. This talk will describe use of Godambe-Durbin martingale estimating functions, and will discuss three approaches for parameter estimation: solution of nonlinear estimating equations, recursive formulas for the vector-valued parameter estimates, and iterated component-wise scalar recursions, further show how penalizing the estimating functions can achieve sparsity. This is joint work with Yaohua Zhang, Jian Zou, and A. Thavaneswaran.

 

 

 

 

·         Stacey Cherny, Tel Aviv University

 

 

Longitudinal Heritability of Childhood Aggression: Twin Modelling using SEM

 

Twin studies are a powerful tool for partitioning trait variance and covariance into genetic and environmental components. I discuss the basics of the methodology, why the twin study is an important tool in psychology and medicine, and apply this methodology it to the study of childhood aggression using two of the largest twin cohorts ever collected, the Netherlands Twin Register (NTR) and the Twins Early Development Study (TEDS; United Kingdom). In NTR, maternal ratings on aggression from the Child Behavior Checklist (CBCL) were available for 10,765 twin pairs at age 7, for 8,557 twin pairs at age 9/10, and for 7,176 twin pairs at age 12. In TEDS, parental ratings of conduct disorder from the Strength and Difficulty Questionnaire (SDQ) were available for 6,897 twin pairs at age 7, 3,028 twin pairs at age 9, and 5,716 twin pairs at age 12. In both studies, stability and heritability of aggressive behavioral problems was high. Heritability was on average somewhat, but significantly, lower in TEDS (around 60%) than in NTR (between 50% and 80%) and sex differences were slightly larger in the NTR sample. In both studies, the influence of shared environment was similar: in boys, shared environment explained around 20% of the variation in aggression across all ages while in girls its influence was absent around age 7 and only came into play at later ages. Longitudinal genetic correlations were the main reason for stability of aggressive behavior.

 

 

 

·         Uri Shalit, Technion

 

Predicting individual-level treatment effects in patients: challenges and proposed best practices

 

One of the most inspiring promises of using machine learning in healthcare is learning how to optimally treat individual patients based on data from past patients. I will discuss the challenges that come up when addressing this task, and why standard machine learning methods can catastrophically fail. I will then propose best-practices based on ideas from causal inference, along with the necessary identification assumptions for learning treatment recommendations. I will present two case studies: one dealing with treatment of chronic disease using data from a large health provider, and one dealing with acute care using data from a university hospital.

 

 

 

 

 

 

·         Lucas Janson, Harvard University

 

Modeling X in High-Dimensional Inference

For answering questions about the relationship between a response variable Y and a set of explanatory variables X, most statistical methods focus their assumptions on the conditional distribution of Y given X (or Y | X for short). I will describe some benefits of shifting those assumptions from the conditional distribution Y | X to the distribution of X instead, especially when X is high-dimensional. I will briefly review my recent methodological work on knockoffs and the conditional randomization test for high-dimensional controlled variable selection, and explain how the model-X framework endows them with desirable properties like finite-sample error control, power, modeling flexibility, and robustness. Then I will discuss two new papers: The first is a general computational framework using tools from Markov chain Monte Carlo for exactly sampling model-X knockoffs for arbitrary distributions for X, even when the normalizing constant is unknown. The second is how to perform exact high-dimensional controlled variable selection while only assuming a flexible parametric model for X.

 

 

 

·         Giles Hooker, Cornell

Random Forests and Inference: U and V Statistics

Brieman's "Two Cultures" essay presents a dichotomy between model-based and algorithmic statistics. This talk present results that start to bridge this gap.  We show that when the bootstrap procedure in Random Forests is replaced with subsampling, the resulting learner can be represented as an infinite-order, incomplete, random-kernel U-statistic for which a limiting normal distribution can be derived. Moreover, the limiting variance can be estimated at no additional work.  Very recent work shows that subsampling with replacement, falling into the category of V-statistics, yields improved variance estimates.

Using this result, we can compare the predictions made by a model learned with a feature of interest, to those made by a model learned without it and ask whether the differences between these could have arisen by chance. By evaluating the model at a structured set of points we can also ask whether it differs significantly from an additive model. We demonstrate these results in an application to citizen-science data collected by Cornell's Laboratory of Ornithology.

 

 

 

·         Judith Somekh, Haifa University

Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset

Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the removal of unwanted technical variation harms the biological signal that is of interest to the researcher. I will describe a novel framework to evaluate the effectiveness of batch correction methods and their tendency toward over or under correction. The approach is based on comparing co-expression of adjusted gene-gene pairs to a-priori knowledge of highly confident gene-gene associations based on thousands of unrelated experiments derived from an external reference. The framework includes three steps: (1) data adjustment with the desired methods (2) calculating gene-gene co-expression measurements for adjusted datasets (3) evaluating the performance of the co-expression measurements against a gold standard. Using the framework, five batch correction methods were evaluated and applied to RNA-seq data of six representative tissue datasets derived from the GTEx project 

 

 

 

 

 

 

 

 

·         Dot Dumuid, University of South Australia

Statistical Adventures in the Emerging Field of Time-Use Epidemiology

How we spend our time influences our health. Previously, researchers have been concerned with relationships between individual activities (e.g., sleep or physical activity) and health. Yet these activities are all inter-related, you can't change one without changing others. This is because each day we get a limited amount of time – 24 hours. We can spend this time in three ways: sleeping, being sedentary or being active. These three activities are mutually exclusive (we can only do one at a time) and exhaustive (there is never a time when we are not doing one of them). Spending more time in one activity can only be at the expense of the other behaviours. Thus time-use data are a special kind of data – they are compositional data, and should be analysed using statistical methods that respect their unique properties.

In this talk, we explore how compositional data analysis can be used to model time-use data. I will present examples of compositional data analysis to (1) predict how a health outcome may change when a set duration (e.g. 30 minutes) is reallocated from one activity to another; (2) explore how activity composition is predicted to change across chronic disease progression (3) determine whether there are differences in activity composition between groups of people (e.g., socioeconomic groups, control/treatment groups in a clinical trial). Finally, I will share preliminary findings for the optimisation of time spending across sleep, sedentary behaviour and physical activity, in order to achieve the best overall health.