

5 March 
James A. Evans, University of Chicago 

Centralized Scientific Communities More Likely Generate Nonreplicable Results 
19 March 
Nalini Ravishanker, University of Connecticut 

Modeling Interevent Durations in HighFrequency Time Series 
26 March 
Stacey Cherny, Tel Aviv University 

Longitudinal Heritability of Childhood Aggression: Twin Modelling using SEM 
16 April 
Uri Shalit, Technion 

Predicting individuallevel treatment effects in patients: challenges and proposed best practices 
30 April 
Lucas Janson, Harvard University 


4 June 
Giles Hooker, Cornell 


11 June 
Judith Somekh, Haifa University 


18 June 
Dorothea Dumuid, University of South Australia 

Statistical Adventures in the Emerging Field of TimeUse Epidemiology 
23 October 
Adam Kapelner, City University of New York 

Harmonizing Fully Optimal Designs with Classic Randomization in Fixed Trial Experiments 
6 November 
Daniel Nevo, TAU 

LAGO: The adaptive LearnAsyouGO design for multistage intervention studies 
27 November 
Liran Katzir, Final Ltd. 


25 December 
Bella VakulenkoLagun, Harvard 

Some methods to recover from selection bias in survival data 
1 January 
Meir Feder, TAU 


8 January 
Adi Berliner Senderey, Clalit 

Effective implementation of evidence based medicine in Healthcare 
















Seminars are held on Tuesdays, 10.30 am, Schreiber Building, 309 (see the TAU map ). The seminar organizer is Daniel Yekutieli.
To join the seminar mailing list or any other inquiries  please call (03)6409612 or email 12345yekutiel@post.tau.ac.il54321 (remove numbers unless you are a spammer…)
Seminars from previous years
ABSTRACTS
LAGO: The adaptive LearnAsyouGO design for multistage intervention studies
In largescale publichealth intervention studies, the intervention is a package consisting of multiple components. The intervention package is chosen in a small pilot study and then implemented in largescale setup. However, for various reasons I will discuss, this approach can lead the an implementation failure.
In this talk, I will present a new design, called the learnasyougo (LAGO) adaptive design. In the LAGO design, the intervention package is adapted in stages during the study
based on past outcomes. Typically, an effective intervention package is sought, while minimizing cost. The main complication when analyzing data from a LAGO is that interventions in later stages depend upon the outcomes in the previous stages. Under the setup of logistic regression, I will present asymptotic theory for LAGO studies and tools that can be used by researchers in practice. The LAGO design will be illustrated via application to the BetterBirth Study, which aimed to improve maternal and neonatal outcomes in India.
· Adam Kapelner, City University of New York
Harmonizing Fully Optimal Designs with Classic Randomization in Fixed Trial Experiments
There is a movement in design of experiments away from the classic randomization put forward by Fisher, Cochran and others to one based on optimization. In fixedsample trials comparing two groups, measurements of subjects are known in advance and subjects can be divided optimally into two groups based on a criterion of homogeneity or "imbalance" between the two groups. These designs are far from random. This talk seeks to understand the benefits and the costs over classic randomization in the context of different performance criterions such as Efron's worstcase analysis. In the criterion that we motivate, randomization beats optimization. However, the optimal design is shown to lie between these two extremes. Muchneeded further work will provide a procedure to find this optimal designs in different scenarios in practice. Until then, it is best to randomize.
· Liran Katzir, financial algorithms researcher at Final Ltd.
Social network size estimation via sampling
This presentation addresses the problem of estimating the number of users in online social networks. While such networks occasionally publish user numbers, there are good reasons to validate their reports. The proposed algorithm can also estimate the cardinality of network subpopulations. Since this information is seldom voluntarily divulged, algorithms must limit themselves to the social networks’ public APIs. No other external information can be assumed. Additionally, due to obvious traffic and privacy concerns, the number of API requests must also be severely limited. Thus, the main focus is on minimizing the number of API requests needed to achieve good estimates. Our approach is to view a social network as an undirected graph and use the public interface to produce a random walk. By counting the number of collisions, an estimate is produced using a nonuniform samples version of the birthday paradox. The algorithms are validated on several publicly available social network datasets.
· Bella VakulenkoLagun, Harvard
Some methods to recover from selection bias in survival data
We consider several study designs resulting in truncated survival data. First, we look at a study with delayed entry, where the left truncation time and the lifetime of interest are dependent. The critical assumption in using standard methods for truncated data is the assumption of quasiindependence or factorization. If this condition does not hold, the standard methods cannot be used. We address one specific scenario that can result in dependence between truncation and event times  this is covariatesinduced dependent truncation. While in regression models for timetoevent data this type of dependence does not present any problem, in nonparametric estimation of the lifetime distribution P(X), ignoring the dependence might cause bias. We propose two methods that are able to account for this dependence and allow consistent estimation of P(X).
Our estimators for dependently truncated data will be inefficient if we use them when there is no dependence between truncation and event times. Therefore it is important to test for independence. The common knowledge is that we can test for quasiindependence, that is "independence in the observable region". We derived two other conditions, called factorization conditions, which are indistinguishable from quasiindependence, given data at hand. This means that in the standard analysis of truncated data, when we assume quasiindependence, we ultimately make an untestable assumption in order to estimate the distribution of the target lifetime. This nonidentifiability problem has not been recognized before.
Finally, we consider retrospectively ascertained timetoevent data resulting in right truncation, and discuss estimation of regression coefficients in the Cox model. We suggest an approach that incorporates external information in order to solve the problem of nonpositivity that often happens with righttruncated data.
Universal Learning for Individual Data
Universal learning is considered from an information theoretic point of view following the universal prediction approachoriginated by Solomonoff, Kolmogorov, Rissanen, Cover, Ziv and others and developed in the 90's by F&Merhav. Interestingly, the extension to learning is not straightforward. In previous works we considered online learning and supervised learning in a stochastic setting. Yet, the most challenging case is batch learning where prediction is done on a test sample once the entire training data is observed, in the individual setting where the features and labels, both of the training and test, are specific individual quantities.
Our results provide schemes that for any individual data compete with a "genie" (or reference) that knows the true test label. We suggest design criteria and develop the corresponding universal learning schemes, where the main proposed scheme is termed Predictive Normalized Maximum Likelihood (pNML). We demonstrate that pNML learning and its variations provide robust, "stable" learning solutions that outperforms the current leading approach based on Empirical Risk Minimization (ERM). Furthermore, the pNML construction provides a pointwise indication for the learnability. This measure the uncertainty in learning the specific test challenge with the given training examples letting the learner know when it does not know.
Joint work with Yaniv Fogel and Koby Bibas
· Adi Berliner Senderey, Clalit
Effective implementation of evidence based medicine in Healthcare
Two projects illustrating use of data for determining effective treatment policies are presented.
1. Machine Learning in Healthcare – Shifting the Focus to Fairness – by Noam Barda
This project deals with an algorithm for improving fairness in predictive models. The method is meant to address concerns regarding potential unfairness of prediction models towards groups which are underrepresented in the training dataset and thus might receive uncalibrated scores. the algorithm was implemented on widely used risk models, including the ACC/AHA 2013 model for cardiovascular events and the FRAX model for osteoporotic fractures, and tested on a large real world sample. Based on a joint work with Noa Dagan, Guy Rothblum, Gal Yona, Ran Balicer and Eitan Bachmat.
2. Rates of Ischemic stroke, Death and Bleeding in Men and Women with NonValvular Atrial Fibrillation –by Adi Berliner Senderey
Data regarding the thromboembolic risk and differences in outcomes in men and women with nonvalvular atrial fibrillation (NVAF) are inconsistent. The aim of the present study is to evaluate differences in treatment strategies and risk of ischemic stroke, death, and bleeding between men and women in a large, populationbased cohort of individuals with nonvalvular AF (NVAF). Based on a joint work with Yoav Arnson, Moshe Hoshen, Adi Berliner Senderey, Orna Reges, Ran Balicer, Morton Leibowitz, Meytal Avgil Tsadok, Moti Haim
· James A. Evans, University of Chicago
Centralized Scientific Communities More Likely Generate Nonreplicable Results
Growing concern that published results, including those widely agreed
upon, may lack replicability is often modeled, but rarely empirically
examined amidst the rapid increase of biomedical publications. We
introduce a novel, highthroughput replication strategy aligning 64,412
published findings about 51,292 distinct druggene interaction claims
(e.g., Benzo(a)pyrene decreases expression of SLC22A3)
with highthroughput experiments performed through the NIH LINCS
L1000 program. We show (1) that claims reported in a single paper
replicate 19.0% (95% confidence interval [CI], 16.9% to 21.2%) more
frequently than expected, while those reported in multiple papers and
widely agreed upon replicate 45.5% (95% CI, 21.8% to 74.2%)
more frequently, manifesting collective correction in science. Nevertheless
(2), among the 2,493 interactions reported in two or more
papers, centralized scientific communities perpetuate less replicable
claims, demonstrating how centralized collaborations weaken collective
inquiry. Decentralized, disconnected research communities involve
more independent teams, use more diverse methodologies, and draw on
more diverse prior knowledge, generating the most robust,
replicable results. Our findings highlight the importance of policies that
foster decentralized collaboration to promote robust biomedical advance.
Our largescale approach holds promise for identifying reliable biomedical
results out of numerous published experiments.
· Nalini Ravishanker, University of Connecticut, Storrs
Modeling Interevent Durations in HighFrequency Time Series
This talk will discuss statistical analysis of durations between events for highfrequency financial time series obtained from the Trade and Quotes (TAQ) database. The class of logarithmic autoregressive conditional duration (Log ACD) models provides a rich framework for analyzing durations, and recent research is focused on developing fast and accurate methods for fitting these models to long time series of durations under least restrictive assumptions. This talk will describe use of GodambeDurbin martingale estimating functions, and will discuss three approaches for parameter estimation: solution of nonlinear estimating equations, recursive formulas for the vectorvalued parameter estimates, and iterated componentwise scalar recursions, further show how penalizing the estimating functions can achieve sparsity. This is joint work with Yaohua Zhang, Jian Zou, and A. Thavaneswaran.
· Stacey Cherny, Tel Aviv University
Longitudinal Heritability of Childhood Aggression: Twin Modelling using SEM
Twin studies are a powerful tool for partitioning trait variance and covariance into genetic and environmental components. I discuss the basics of the methodology, why the twin study is an important tool in psychology and medicine, and apply this methodology it to the study of childhood aggression using two of the largest twin cohorts ever collected, the Netherlands Twin Register (NTR) and the Twins Early Development Study (TEDS; United Kingdom). In NTR, maternal ratings on aggression from the Child Behavior Checklist (CBCL) were available for 10,765 twin pairs at age 7, for 8,557 twin pairs at age 9/10, and for 7,176 twin pairs at age 12. In TEDS, parental ratings of conduct disorder from the Strength and Difficulty Questionnaire (SDQ) were available for 6,897 twin pairs at age 7, 3,028 twin pairs at age 9, and 5,716 twin pairs at age 12. In both studies, stability and heritability of aggressive behavioral problems was high. Heritability was on average somewhat, but significantly, lower in TEDS (around 60%) than in NTR (between 50% and 80%) and sex differences were slightly larger in the NTR sample. In both studies, the influence of shared environment was similar: in boys, shared environment explained around 20% of the variation in aggression across all ages while in girls its influence was absent around age 7 and only came into play at later ages. Longitudinal genetic correlations were the main reason for stability of aggressive behavior.
Predicting individuallevel treatment effects in patients: challenges and proposed best practices
One of the most inspiring promises of using machine learning in healthcare is learning how to optimally treat individual patients based on data from past patients. I will discuss the challenges that come up when addressing this task, and why standard machine learning methods can catastrophically fail. I will then propose bestpractices based on ideas from causal inference, along with the necessary identification assumptions for learning treatment recommendations. I will present two case studies: one dealing with treatment of chronic disease using data from a large health provider, and one dealing with acute care using data from a university hospital.
· Lucas Janson, Harvard University
Modeling X in HighDimensional Inference
For answering questions about the relationship between a response variable Y and a set of explanatory variables X, most statistical methods focus their assumptions on the conditional distribution of Y given X (or Y  X for short). I will describe some benefits of shifting those assumptions from the conditional distribution Y  X to the distribution of X instead, especially when X is highdimensional. I will briefly review my recent methodological work on knockoffs and the conditional randomization test for highdimensional controlled variable selection, and explain how the modelX framework endows them with desirable properties like finitesample error control, power, modeling flexibility, and robustness. Then I will discuss two new papers: The first is a general computational framework using tools from Markov chain Monte Carlo for exactly sampling modelX knockoffs for arbitrary distributions for X, even when the normalizing constant is unknown. The second is how to perform exact highdimensional controlled variable selection while only assuming a flexible parametric model for X.
Random Forests and Inference: U and V Statistics
Brieman's "Two Cultures" essay presents a dichotomy between modelbased and algorithmic statistics. This talk present results that start to bridge this gap. We show that when the bootstrap procedure in Random Forests is replaced with subsampling, the resulting learner can be represented as an infiniteorder, incomplete, randomkernel Ustatistic for which a limiting normal distribution can be derived. Moreover, the limiting variance can be estimated at no additional work. Very recent work shows that subsampling with replacement, falling into the category of Vstatistics, yields improved variance estimates.
Using this result, we can compare the predictions made by a model learned with a feature of interest, to those made by a model learned without it and ask whether the differences between these could have arisen by chance. By evaluating the model at a structured set of points we can also ask whether it differs significantly from an additive model. We demonstrate these results in an application to citizenscience data collected by Cornell's Laboratory of Ornithology.
· Judith Somekh, Haifa University
Batch correction evaluation framework using apriori genegene associations: applied to the GTEx dataset
Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the removal of unwanted technical variation harms the biological signal that is of interest to the researcher. I will describe a novel framework to evaluate the effectiveness of batch correction methods and their tendency toward over or under correction. The approach is based on comparing coexpression of adjusted genegene pairs to apriori knowledge of highly confident genegene associations based on thousands of unrelated experiments derived from an external reference. The framework includes three steps: (1) data adjustment with the desired methods (2) calculating genegene coexpression measurements for adjusted datasets (3) evaluating the performance of the coexpression measurements against a gold standard. Using the framework, five batch correction methods were evaluated and applied to RNAseq data of six representative tissue datasets derived from the GTEx project
· Dot Dumuid, University of South Australia
Statistical Adventures in the Emerging Field of TimeUse Epidemiology
How we spend our time influences our health. Previously, researchers have been concerned with relationships between individual activities (e.g., sleep or physical activity) and health. Yet these activities are all interrelated, you can't change one without changing others. This is because each day we get a limited amount of time – 24 hours. We can spend this time in three ways: sleeping, being sedentary or being active. These three activities are mutually exclusive (we can only do one at a time) and exhaustive (there is never a time when we are not doing one of them). Spending more time in one activity can only be at the expense of the other behaviours. Thus timeuse data are a special kind of data – they are compositional data, and should be analysed using statistical methods that respect their unique properties.
In this talk, we explore how compositional data analysis can be used to model timeuse data. I will present examples of compositional data analysis to (1) predict how a health outcome may change when a set duration (e.g. 30 minutes) is reallocated from one activity to another; (2) explore how activity composition is predicted to change across chronic disease progression (3) determine whether there are differences in activity composition between groups of people (e.g., socioeconomic groups, control/treatment groups in a clinical trial). Finally, I will share preliminary findings for the optimisation of time spending across sleep, sedentary behaviour and physical activity, in order to achieve the best overall health.