|
|
5 March |
James A. Evans, University of Chicago |
|
Centralized Scientific Communities More Likely Generate Non-replicable Results |
19 March |
Nalini Ravishanker, University of Connecticut |
|
Modeling Inter-event Durations in High-Frequency Time Series |
26 March |
Stacey Cherny, Tel Aviv University |
|
Longitudinal Heritability of Childhood Aggression: Twin Modelling using SEM |
16 April |
Uri Shalit, Technion |
|
Predicting individual-level treatment effects in patients: challenges and proposed best practices |
30 April |
Lucas Janson, Harvard University |
|
|
4 June |
Giles Hooker, Cornell |
|
|
11 June |
Judith Somekh, Haifa University |
|
|
18 June |
Dorothea Dumuid, University of South Australia |
|
Statistical Adventures in the Emerging Field of Time-Use Epidemiology |
23 October |
Adam Kapelner, City University of New York |
|
Harmonizing Fully Optimal Designs with Classic Randomization in Fixed Trial Experiments |
6 November |
Daniel Nevo, TAU |
|
LAGO: The adaptive Learn-As-you-GO design for multi-stage intervention studies |
27 November |
Liran Katzir, Final Ltd. |
|
|
25 December |
Bella Vakulenko-Lagun, Harvard |
|
Some methods to recover from selection bias in survival data |
1 January |
Meir Feder, TAU |
|
|
8 January |
Adi Berliner Senderey, Clalit |
|
Effective implementation of evidence based medicine in Healthcare |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Seminars are held on Tuesdays, 10.30 am, Schreiber Building, 309 (see the TAU map ). The seminar organizer is Daniel Yekutieli.
To join the seminar mailing list or any other inquiries - please call (03)-6409612 or email 12345yekutiel@post.tau.ac.il54321 (remove numbers unless you are a spammer…)
Seminars from previous years
ABSTRACTS
LAGO: The adaptive Learn-As-you-GO design for multi-stage intervention studies
In large-scale public-health intervention studies, the intervention is a package consisting of multiple components. The intervention package is chosen in a small pilot study and then implemented in large-scale setup. However, for various reasons I will discuss, this approach can lead the an implementation failure.
In this talk, I will present a new design, called the learn-as-you-go (LAGO) adaptive design. In the LAGO design, the intervention package is adapted in stages during the study
based on past outcomes. Typically, an effective intervention package is sought, while minimizing cost. The main complication when analyzing data from a LAGO is that interventions in later stages depend upon the outcomes in the previous stages. Under the setup of logistic regression, I will present asymptotic theory for LAGO studies and tools that can be used by researchers in practice. The LAGO design will be illustrated via application to the BetterBirth Study, which aimed to improve maternal and neonatal outcomes in India.
· Adam Kapelner, City University of New York
Harmonizing Fully Optimal Designs with Classic Randomization in Fixed Trial Experiments
There is a movement in design of experiments away from the classic randomization put forward by Fisher, Cochran and others to one based on optimization. In fixed-sample trials comparing two groups, measurements of subjects are known in advance and subjects can be divided optimally into two groups based on a criterion of homogeneity or "imbalance" between the two groups. These designs are far from random. This talk seeks to understand the benefits and the costs over classic randomization in the context of different performance criterions such as Efron's worst-case analysis. In the criterion that we motivate, randomization beats optimization. However, the optimal design is shown to lie between these two extremes. Much-needed further work will provide a procedure to find this optimal designs in different scenarios in practice. Until then, it is best to randomize.
· Liran Katzir, financial algorithms researcher at Final Ltd.
Social network size estimation via sampling
This presentation addresses the problem of estimating the number of users in online social networks. While such networks occasionally publish user numbers, there are good reasons to validate their reports. The proposed algorithm can also estimate the cardinality of network sub-populations. Since this information is seldom voluntarily divulged, algorithms must limit themselves to the social networks’ public APIs. No other external information can be assumed. Additionally, due to obvious traffic and privacy concerns, the number of API requests must also be severely limited. Thus, the main focus is on minimizing the number of API requests needed to achieve good estimates. Our approach is to view a social network as an undirected graph and use the public interface to produce a random walk. By counting the number of collisions, an estimate is produced using a non-uniform samples version of the birthday paradox. The algorithms are validated on several publicly available social network datasets.
· Bella Vakulenko-Lagun, Harvard
Some methods to recover from selection bias in survival data
We consider several study designs resulting in truncated survival data. First, we look at a study with delayed entry, where the left truncation time and the lifetime of interest are dependent. The critical assumption in using standard methods for truncated data is the assumption of quasi-independence or factorization. If this condition does not hold, the standard methods cannot be used. We address one specific scenario that can result in dependence between truncation and event times - this is covariates-induced dependent truncation. While in regression models for time-to-event data this type of dependence does not present any problem, in nonparametric estimation of the lifetime distribution P(X), ignoring the dependence might cause bias. We propose two methods that are able to account for this dependence and allow consistent estimation of P(X).
Our estimators for dependently truncated data will be inefficient if we use them when there is no dependence between truncation and event times. Therefore it is important to test for independence. The common knowledge is that we can test for quasi-independence, that is "independence in the observable region". We derived two other conditions, called factorization conditions, which are indistinguishable from quasi-independence, given data at hand. This means that in the standard analysis of truncated data, when we assume quasi-independence, we ultimately make an untestable assumption in order to estimate the distribution of the target lifetime. This non-identifiability problem has not been recognized before.
Finally, we consider retrospectively ascertained time-to-event data resulting in right truncation, and discuss estimation of regression coefficients in the Cox model. We suggest an approach that incorporates external information in order to solve the problem of non-positivity that often happens with right-truncated data.
Universal Learning for Individual Data
Universal learning is considered from an information theoretic point of view following the universal prediction approachoriginated by Solomonoff, Kolmogorov, Rissanen, Cover, Ziv and others and developed in the 90's by F&Merhav. Interestingly, the extension to learning is not straight-forward. In previous works we considered on-line learning and supervised learning in a stochastic setting. Yet, the most challenging case is batch learning where prediction is done on a test sample once the entire training data is observed, in the individual setting where the features and labels, both of the training and test, are specific individual quantities.
Our results provide schemes that for any individual data compete with a "genie" (or reference) that knows the true test label. We suggest design criteria and develop the corresponding universal learning schemes, where the main proposed scheme is termed Predictive Normalized Maximum Likelihood (pNML). We demonstrate that pNML learning and its variations provide robust, "stable" learning solutions that outperforms the current leading approach based on Empirical Risk Minimization (ERM). Furthermore, the pNML construction provides a pointwise indication for the learnability. This measure the uncertainty in learning the specific test challenge with the given training examples letting the learner know when it does not know.
Joint work with Yaniv Fogel and Koby Bibas
· Adi Berliner Senderey, Clalit
Effective implementation of evidence based medicine in Healthcare
Two projects illustrating use of data for determining effective treatment policies are presented.
1. Machine Learning in Healthcare – Shifting the Focus to Fairness – by Noam Barda
This project deals with an algorithm for improving fairness in predictive models. The method is meant to address concerns regarding potential unfairness of prediction models towards groups which are underrepresented in the training dataset and thus might receive uncalibrated scores. the algorithm was implemented on widely used risk models, including the ACC/AHA 2013 model for cardiovascular events and the FRAX model for osteoporotic fractures, and tested on a large real world sample. Based on a joint work with Noa Dagan, Guy Rothblum, Gal Yona, Ran Balicer and Eitan Bachmat.
2. Rates of Ischemic stroke, Death and Bleeding in Men and Women with Non-Valvular Atrial Fibrillation –by Adi Berliner Senderey
Data regarding the thromboembolic risk and differences in outcomes in men and women with non-valvular atrial fibrillation (NVAF) are inconsistent. The aim of the present study is to evaluate differences in treatment strategies and risk of ischemic stroke, death, and bleeding between men and women in a large, population-based cohort of individuals with non-valvular AF (NVAF). Based on a joint work with Yoav Arnson, Moshe Hoshen, Adi Berliner Senderey, Orna Reges, Ran Balicer, Morton Leibowitz, Meytal Avgil Tsadok, Moti Haim
· James A. Evans, University of Chicago
Centralized Scientific Communities More Likely Generate Non-replicable Results
Growing concern that published results, including those widely agreed
upon, may lack replicability is often modeled, but rarely empirically
examined amidst the rapid increase of biomedical publications. We
introduce a novel, high-throughput replication strategy aligning 64,412
published findings about 51,292 distinct drug-gene interaction claims
(e.g., Benzo(a)pyrene decreases expression of SLC22A3) with high-throughput
experiments performed through the NIH LINCS L1000 program. We show
(1) that claims reported in a single paper replicate 19.0% (95%
confidence interval [CI], 16.9% to 21.2%) more frequently than
expected, while those reported in multiple papers and widely agreed upon
replicate 45.5% (95% CI, 21.8% to 74.2%) more frequently, manifesting
collective correction in science. Nevertheless (2), among the 2,493
interactions reported in two or more papers, centralized scientific
communities perpetuate less replicable claims, demonstrating how
centralized collaborations weaken collective
inquiry. Decentralized, disconnected research communities involve
more independent teams, use more diverse methodologies, and draw on
more diverse prior knowledge, generating the most robust,
replicable results. Our findings highlight the importance of policies that
foster decentralized collaboration to promote robust biomedical advance.
Our large-scale approach holds promise for identifying reliable biomedical
results out of numerous published experiments.
· Nalini Ravishanker, University of Connecticut, Storrs
Modeling Inter-event Durations in High-Frequency Time Series
This talk will discuss statistical analysis of durations between events for high-frequency financial time series obtained from the Trade and Quotes (TAQ) database. The class of logarithmic autoregressive conditional duration (Log ACD) models provides a rich framework for analyzing durations, and recent research is focused on developing fast and accurate methods for fitting these models to long time series of durations under least restrictive assumptions. This talk will describe use of Godambe-Durbin martingale estimating functions, and will discuss three approaches for parameter estimation: solution of nonlinear estimating equations, recursive formulas for the vector-valued parameter estimates, and iterated component-wise scalar recursions, further show how penalizing the estimating functions can achieve sparsity. This is joint work with Yaohua Zhang, Jian Zou, and A. Thavaneswaran.
· Stacey Cherny, Tel Aviv University
Longitudinal Heritability of Childhood Aggression: Twin Modelling using SEM
Twin studies are a powerful tool for partitioning trait variance and covariance into genetic and environmental components. I discuss the basics of the methodology, why the twin study is an important tool in psychology and medicine, and apply this methodology it to the study of childhood aggression using two of the largest twin cohorts ever collected, the Netherlands Twin Register (NTR) and the Twins Early Development Study (TEDS; United Kingdom). In NTR, maternal ratings on aggression from the Child Behavior Checklist (CBCL) were available for 10,765 twin pairs at age 7, for 8,557 twin pairs at age 9/10, and for 7,176 twin pairs at age 12. In TEDS, parental ratings of conduct disorder from the Strength and Difficulty Questionnaire (SDQ) were available for 6,897 twin pairs at age 7, 3,028 twin pairs at age 9, and 5,716 twin pairs at age 12. In both studies, stability and heritability of aggressive behavioral problems was high. Heritability was on average somewhat, but significantly, lower in TEDS (around 60%) than in NTR (between 50% and 80%) and sex differences were slightly larger in the NTR sample. In both studies, the influence of shared environment was similar: in boys, shared environment explained around 20% of the variation in aggression across all ages while in girls its influence was absent around age 7 and only came into play at later ages. Longitudinal genetic correlations were the main reason for stability of aggressive behavior.
Predicting individual-level treatment effects in patients: challenges and proposed best practices
One of the most inspiring promises of using machine learning in healthcare is learning how to optimally treat individual patients based on data from past patients. I will discuss the challenges that come up when addressing this task, and why standard machine learning methods can catastrophically fail. I will then propose best-practices based on ideas from causal inference, along with the necessary identification assumptions for learning treatment recommendations. I will present two case studies: one dealing with treatment of chronic disease using data from a large health provider, and one dealing with acute care using data from a university hospital.
· Lucas Janson, Harvard University
Modeling X in High-Dimensional Inference
For answering questions about the relationship between a response variable Y and a set of explanatory variables X, most statistical methods focus their assumptions on the conditional distribution of Y given X (or Y | X for short). I will describe some benefits of shifting those assumptions from the conditional distribution Y | X to the distribution of X instead, especially when X is high-dimensional. I will briefly review my recent methodological work on knockoffs and the conditional randomization test for high-dimensional controlled variable selection, and explain how the model-X framework endows them with desirable properties like finite-sample error control, power, modeling flexibility, and robustness. Then I will discuss two new papers: The first is a general computational framework using tools from Markov chain Monte Carlo for exactly sampling model-X knockoffs for arbitrary distributions for X, even when the normalizing constant is unknown. The second is how to perform exact high-dimensional controlled variable selection while only assuming a flexible parametric model for X.
Random Forests and Inference: U and V Statistics
Brieman's "Two Cultures" essay presents a dichotomy between model-based and algorithmic statistics. This talk present results that start to bridge this gap. We show that when the bootstrap procedure in Random Forests is replaced with subsampling, the resulting learner can be represented as an infinite-order, incomplete, random-kernel U-statistic for which a limiting normal distribution can be derived. Moreover, the limiting variance can be estimated at no additional work. Very recent work shows that subsampling with replacement, falling into the category of V-statistics, yields improved variance estimates.
Using this result, we can compare the predictions made by a model learned with a feature of interest, to those made by a model learned without it and ask whether the differences between these could have arisen by chance. By evaluating the model at a structured set of points we can also ask whether it differs significantly from an additive model. We demonstrate these results in an application to citizen-science data collected by Cornell's Laboratory of Ornithology.
· Judith Somekh, Haifa University
Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset
Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the removal of unwanted technical variation harms the biological signal that is of interest to the researcher. I will describe a novel framework to evaluate the effectiveness of batch correction methods and their tendency toward over or under correction. The approach is based on comparing co-expression of adjusted gene-gene pairs to a-priori knowledge of highly confident gene-gene associations based on thousands of unrelated experiments derived from an external reference. The framework includes three steps: (1) data adjustment with the desired methods (2) calculating gene-gene co-expression measurements for adjusted datasets (3) evaluating the performance of the co-expression measurements against a gold standard. Using the framework, five batch correction methods were evaluated and applied to RNA-seq data of six representative tissue datasets derived from the GTEx project
· Dot Dumuid, University of South Australia
Statistical Adventures in the Emerging Field of Time-Use Epidemiology
How we spend our time influences our health. Previously, researchers have been concerned with relationships between individual activities (e.g., sleep or physical activity) and health. Yet these activities are all inter-related, you can't change one without changing others. This is because each day we get a limited amount of time – 24 hours. We can spend this time in three ways: sleeping, being sedentary or being active. These three activities are mutually exclusive (we can only do one at a time) and exhaustive (there is never a time when we are not doing one of them). Spending more time in one activity can only be at the expense of the other behaviours. Thus time-use data are a special kind of data – they are compositional data, and should be analysed using statistical methods that respect their unique properties.
In this talk, we explore how compositional data analysis can be used to model time-use data. I will present examples of compositional data analysis to (1) predict how a health outcome may change when a set duration (e.g. 30 minutes) is reallocated from one activity to another; (2) explore how activity composition is predicted to change across chronic disease progression (3) determine whether there are differences in activity composition between groups of people (e.g., socioeconomic groups, control/treatment groups in a clinical trial). Finally, I will share preliminary findings for the optimisation of time spending across sleep, sedentary behaviour and physical activity, in order to achieve the best overall health.