Ruth Heller - Publications

Some Technical Reports

Frostig, T., Heller, R., and Benjamini, Y.
Direction Preferring Confidence Intervals (2024)
arXiv:2404.00319.
- abstract: Confidence intervals (CIs) are instrumental in statistical analysis, providing a range estimate of the parameters. In modern statistics, selective inference is common, where only certain parameters are highlighted. However, this selective approach can bias the inference, leading some to advocate for the use of CIs over p-values. To increase the flexibility of confidence intervals, we introduce direction-preferring CIs, enabling analysts to focus on parameters trending in a particular direction. We present these types of CIs in two settings: First, when there is no selection of parameters; and second, for situations involving parameter selection, where we offer a conditional version of the direction-preferring CIs. Both of these methods build upon the foundations of Modified Pratt CIs, which rely on non-equivariant acceptance regions to achieve longer intervals in exchange for improved sign exclusions. We show that for selected parameters out of m > 1 initial parameters of interest, CIs aimed at controlling the false coverage rate, have higher power to determine the sign compared to conditional CIs. We also show that conditional confidence intervals control the marginal false coverage rate (mFCR) under any dependency.
- Paper: Link.
Heller, Y. and Heller, R.
Computing the Bergsma Dassios sign-covariance(2016)
arXiv:1605.08732.
- abstract: Bergsma and Dassios (2014) introduced an independence measure which is zero if and only if two random variables are independent. This measure can be naively calculated in $O(n^4)$. Weihs et al. (2015) showed that it can be calculated in $O(n^2\log n)$. In this note we will show that using the methods described in Heller et al. (2016), the measure can easily be calculated in only $O(n^2)$.
- Paper: Link.
Gorfine, M. and Heller, R. and Heller, Y.
Comment on "Detecting Novel Associations in Large Data Sets"(2012)
- abstract: Reshef et al. presented a novel measure of dependence - the maximal information coefficient (MIC) aimed to capture a wide range of associations between pairs of variables, and a statistical test for independence based on MIC. They defined a concept of equitability and claim that non-equitable methods are less practical for data exploration. By simple power comparisons, we show that this conclusion is wrong.
- Paper: PDF file.

Publications

Samrat, R. and Bogomolov, M. and Heller, R. and Claridge, A. and Beeson, T. and Small, D.
Exploration, Confirmation, and Replication in the Same Observational Study: A Two Team Cross-Screening Approach to Studying the Effect of Unwanted Pregnancy on Mothers� Later Life Outcomes (2026)
Journal of the American Statistical Association.
- abstract: The long-term consequences of unwanted pregnancies carried to term on mothers have not been explored much. We use data from the Wisconsin Longitudinal Study (WLS) and propose a novel approach, namely two team cross-screening, to study the possible effects of unwanted pregnancies carried to term on various aspects of mothers' later life mental health, physical health, economic well-being, and life satisfaction. Our approach, unlike existing approaches to observational studies, enables investigators to perform exploratory data analysis, confirmatory data analysis, and replication in the same study. This is a valuable property when there is only one data set available with unique strengths. In two team cross-screening, the investigators split themselves into two teams and the data is split as well according to a meaningful covariate. Each team then performs an exploratory data analysis on its part of the data to design an analysis plan for the other part of the data. The complete freedom of the teams in designing the analysis has the potential to generate new unanticipated hypotheses in addition to a prefixed set of hypotheses. Moreover, only the hypotheses that looked promising in the data each team explored are forwarded for analysis (thus alleviating the multiple testing problem). These advantages are demonstrated in our study of the effects of unwanted pregnancies on mothers' later life outcomes.
- Paper: Link.
Frostig, T. and Heller, R.
Inferring on joint associations from marginal associations and a reference sample (2025)
Biometrical Journal.
- abstract: We present a method to infer on joint regression coefficients obtained from marginal regressions using a reference panel. This type of scenario is common in genetic fine-mapping, where the estimated marginal associations are reported in genomewide association studies (GWAS), and a reference panel is used for inference on the association in a joint regression model. We show that ignoring the uncertainty due to the use of a reference panel instead of the original design matrix can lead to a severe inflation of false discoveries and a lack of replicable findings. We derive the asymptotic distribution of the estimated coefficients in the joint regression model, and show how it can be used to produce valid inference. We address two settings: inference within regions that are pre-selected, as well as within regions that are selected based on the same data. By means of real data examples and simulations we demonstrate the usefulness of our suggested methodology.
- Paper: Link.
Karmakar, R., Heller, R., Rosset, S.
Inference with approximate local false discovery rates (2025)
Biometrics .
- abstract: Efron's two-group model is widely used in large scale multiple testing. This model assumes that test statistics are mutually independent, however in realistic settings they are typically dependent, and taking the dependence into account can boost power. The general two-group model takes the dependence between the test statistics into account. Optimal policies in the general two-group model require calculation, for each hypothesis, of the probability that it is a true null given all test statistics, denoted local false discovery rate (locFDR). Unfortunately, calculating locFDRs under realistic dependence structures can be computationally prohibitive. We propose calculating approximate locFDRs based on a properly defined N-neighborhood for each hypothesis. We prove that by thresholding the approximate locFDRs with a fixed threshold, the marginal false discovery rate is controlled for any dependence structure. Furthermore, we prove that this is the optimal procedure in a restricted class of decision rules, where decision for each hypothesis is only guided by its N-neighborhood. We show through extensive simulations that our proposed method achieves substantial power gains compared to alternative practical approaches, while maintaining conceptual simplicity and computational feasibility. We demonstrate the utility of our method on a genome wide association study of height.
- Paper: Link.
Gazin, U., Heller, R., Roquain, E., and Solari, A.
Powerful batch conformal prediction for classification (2025)
AISTATS 2025.
- abstract: In a split conformal framework with $K$ classes, a calibration sample of $n$ labeled examples is observed for inference on the label of a new unlabeled example. We explore the setting where a `batch' of $m$ independent such unlabeled examples is given, and the goal is to construct a batch prediction set with 1-$\alpha$ coverage. Unlike individual prediction sets, the batch prediction set is a collection of label vectors of size $m$, while the calibration sample consists of univariate labels. A natural approach is to apply the Bonferroni correction, which concatenates individual prediction sets at level $1-\alpha/m$. We propose a uniformly more powerful solution, based on specific combinations of conformal $p$-values that exploit the Simes inequality. We provide a general recipe for valid inference with any combinations of conformal $p$-values, and compare the performance of several useful choices. Intuitively, the pooled evidence of relatively `easy' examples within the batch can help provide narrower batch prediction sets. Additionally, we introduce a more computationally intensive method that aggregates batch scores and can be even more powerful. The theoretical guarantees are established when all examples are independent and identically distributed (iid), as well as more generally when iid is assumed only conditionally within each class. Notably, our results remain valid under label distribution shift, since the distribution of the labels need not be the same in the calibration sample and in the new batch. The effectiveness of the methods is highlighted through illustrative synthetic and real data examples.
- Paper: Link.
Dickhaus, T., Heller, R., Hoang, A., and Rinott, Y.
A procedure for multiple testing of partial conjunction hypotheses based on a hazard rate inequality (2025)
Bernoulli.
- abstract: The partial conjunction null hypothesis is tested in order to discover a signal that is present in multiple studies. The standard approach of carrying out a multiple test procedure on the partial conjunction (PC) p-values can be extremely conservative. We suggest alleviating this conservativeness, by eliminating many of the conservative PC p-values prior to the application of a multiple test procedure. This leads to the following two step procedure: first, select the set with PC p-values below a selection threshold; second, within the selected set only, apply a family-wise error rate or false discovery rate controlling procedure on the conditional PC p-values. The conditional PC p-values are valid if the null p-values are uniform and the combining method is Fisher. The proof of their validity is based on a novel inequality in hazard rate order of partial sums of order statistics which may be of independent interest. We also provide the conditions for which the false discovery rate controlling procedures considered will be below the nominal level. We demonstrate the potential usefulness of our novel method, CoFilter (conditional testing after filtering), for analyzing multiple genome wide association studies of Crohn's disease.
- Paper: Link.
Gazin, U., Heller, R., Marandon, A., and Roquain, E.
Selecting informative conformal prediction sets with false coverage rate control (2024)
Journal of the Royal Statistical Society (JRSS), series B.
- abstract: In supervised learning, including regression and classification, conformal methods provide prediction sets for the outcome/label with finite sample coverage for any machine learning predictors. We consider here the case where such prediction sets come after a selection process. The selection process requires that the selected prediction sets be `informative' in a well defined sense. We consider both the classification and regression settings where the analyst may consider as informative only the sample with prediction label sets or prediction intervals small enough, excluding null values, or obeying other appropriate `monotone' constraints. Our framework encompasses many additional notions of informativeness of possible interest in various applications. We develop a unified framework for building such informative conformal prediction sets while controlling the false coverage rate (FCR) on the selected sample. While conformal prediction sets after selection have been the focus of much recent literature in the field, the new introduced procedures, called InfoSP and InfoSCOP, are to our knowledge the first ones providing FCR control for informative prediction sets. We show the usefulness of our resulting procedures on real and simulated data.
- Paper: Link.
Heller, R., and Solari, A.
Simultaneous directional inference (2023)
Journal of the Royal Statistical Society (JRSS), series B.
- abstract: We consider the problem of inference on the signs of n>1 parameters. Within a simultaneous inference framework, we aim to: identify as many of the signs of the individual parameters as possible; provide confidence bounds on the number of positive (or negative) parameters on subsets of interest. Our suggestion is as follows: start by using the data to select the direction of the hypothesis test for each parameter; then, adjust the one-sided p-values for the selection, and use them for simultaneous inference on the selected n one-sided hypotheses. The adjustment is straightforward assuming that the one-sided p-values are conditionally valid and mutually independent. Such assumptions are commonly satisfied in a meta-analysis, and we can apply our approach following a test of the global null hypothesis that all parameters are zero, or of the hypothesis of no qualitative interaction. We consider the use of two multiple testing principles: closed testing and partitioning. The novel procedure based on partitioning is more powerful, but slightly less informative: it only infers on positive and non-positive signs. The procedure takes at most a polynomial time, and we show its usefulness on a subgroup analysis of a medical intervention, and on a meta-analysis of an educational intervention.
- Paper: Link.
Bogomolov, M. and Heller, R.
Replicability Across Multiple Studies(2023)
Statistical Science.
- abstract: Meta-analysis is routinely performed in many scientific disciplines. This analysis is attractive since discoveries are possible even when all the individual studies are underpowered. However, the meta-analytic discoveries may be entirely driven by signal in a single study, and thus non-replicable. Although the great majority of meta-analyses carried out to date do not infer on the replicability of their findings, it is possible to do so. We provide a selective overview of analyses that can be carried out towards establishing replicability of the scientific findings. We describe methods for the setting where a single outcome is examined in multiple studies (as is common in systematic reviews of medical interventions), as well as for the setting where multiple studies each examine multiple features (as in genomics applications). We also discuss some of the current shortcomings and future directions.
- Paper: Link.
Heller, R., Krieger, A., and Rosset, S.
Optimal multiple testing and design in clinical trials(2022)
Biometrics.
- abstract: A central goal in designing clinical trials is to find the test that maximizes power (or equivalently minimizes required sample size) for finding a false null hypothesis subject to the constraint of type I error. When there is more than one test, such as in clinical trials with multiple endpoints, the issues of optimal design and optimal procedures become more complex. In this paper, we address the question of how such optimal tests should be defined and how they can be found. We review different notions of power and how they relate to study goals, and also consider the requirements of type I error control and the nature of the procedures. This leads us to an explicit optimization problem with objective and constraints that describe its specific desiderata. We present a complete solution for deriving optimal procedures for two hypotheses, which have desired monotonicity properties, and are computationally simple. For some of the optimization formulations this yields optimal procedures that are identical to existing procedures, such as Hommel's procedure or the procedure of Bittman et al. (2009), while for other cases it yields completely novel and more powerful procedures than existing ones. We demonstrate the nature of our novel procedures and their improved power extensively in a simulation and on the APEX study (Cohen et al., 2016).
- Paper: Link.
- YoungStatS blog entry: Generalizing the Neyman-Pearson Lemma for multiple hypothesis testing problems
Jaljuli, I., Benjamini, Y., Shenhav, L., Panagiotou, O., and Heller, R.
Quantifying replicability and consistency in systematic reviews(2022)
Statistics in biopharmaceutical research.
- abstract: Systematic reviews and meta-analyses are important tools for synthesizing evidence from multiple studies. They serve to increase power and improve precision, in the same way that large studies can do, but also to establish the consistency of effects and replicability of results across studies. In this work we propose statistical tools to quantify replicability of effect signs (or directions) and their consistency. We suggest that these tools accompany the fixed-effect or random-effects meta-analysis, and we show that they convey important information for the assessment of the intervention under investigation. We motivate and demonstrate our approach and its implications by examples from systematic reviews from the Cochrane Library. Our tools make no assumptions on the distribution of the true effect sizes, so their inferential guarantees continue to hold even if the assumptions of the fixed-effect or random-effects models do not hold. We also develop a version of this tool under the fixed-effect assumption for cases where it is crucial and justified.
- Paper (pre-submission version): PDF file.
- R Package: metarep
Brill, B., Amir, A., and Heller, R.
Testing for differential abundance in compositional counts data, with application to microbiome studies(2022)
The Annals of Applied Statistics.
- abstract: Identifying which taxa in our microbiota are associated with traits of interest is important for advancing science and health. However, the identification is challenging because the measured vector of taxa counts (by amplicon sequencing) is compositional, so a change in the abundance of one taxon in the microbiota induces a change in the number of sequenced counts across all taxa. The data are typically sparse, with many zero counts present either due to biological variance or limited sequencing depth.We examine the case of Crohn's disease, where the microbial load changes substantially with the disease. For this representative example of a highly compositional setting, we show existing methods designed to identify differentially abundant taxa may have an inflated number of false positives. We introduce a novel non-parametric approach that provides valid inference even when the fraction of zero counts is substantial. Our approach uses a set of reference taxa that are non-differentially abundant, which can be estimated from the data or from outside information. Our approach also allows for a novel type of testing: multivariate tests of differential abundance over a focused subset of the taxa. Genera level multivariate testing discovers additional genera as differentially abundant by avoiding agglomeration of taxa.
- Paper: PDF file.
- R Package: DACOMP
Haroush, M., Frostig, T., Heller, R., and Sourdry, D.
A statistical framework for efficient out of distribution detection in deep neural networks(2022)
The tenth International Conference on Learning Representations (ICLR).
- abstract: Commonly, Deep Neural Networks (DNNs) generalize well on samples drawn from a distribution similar to that of the training set. However, DNNs� predictions are brittle and unreliable when the test samples are drawn from a dissimilar distribution. This is a major concern for deployment in real world applications, where such behavior may come at a considerable cost, such as industrial production lines, autonomous vehicles, or health care applications. We frame Out Of Distribution (OOD) detection in DNNs as a statistical hypothesis testing problem. Tests generated within our proposed framework combine evidence from the entire network. Unlike previous OOD detection heuristics, this framework returns a p-value for each test sample. It is guaranteed to maintain the Type I Error (T1E - incorrectly predicting OOD for an actual in-distribution sample) for test data. Moreover, this allows to combine several detectors while maintaining the T1E. Building on this framework, we suggest a novel OOD procedure based on low-order statistics. Our method achieves comparable or better results than state-of-the-art methods on well-accepted OOD benchmarks, without retraining the network parameters or assuming prior knowledge on the test distribution � and at a fraction of the computational cost.
- Paper: PDF file.
Rosset, S., Heller, R., Painsky, A., and Aharoni, E.
Optimal and maximin procedures for multiple testing problems(2022)
Journal of the Royal Statistical Society (JRSS), series B.
- abstract: Multiple testing problems are a staple of modern statistical analysis. The fundamental objective of multiple testing procedures is to reject as many false null hypotheses as possible (that is, maximize some notion of power), subject to controlling an overall measure of false discovery, like family-wise error rate (FWER) or false discovery rate (FDR). In this paper we formulate multiple testing of simple hypotheses as an infinite-dimensional optimization problem, seeking the most powerful rejection policy which guarantees strong control of the selected measure. In that sense, our approach is a generalization of the optimal Neyman-Pearson test for a single hypothesis. We show that for exchangeable hypotheses, for both FWER and FDR and relevant notions of power, these problems can be formulated as infinite linear programs and can in principle be solved for any number of hypotheses. We apply our results to derive explicit optimal tests for FWER or FDR control for three independent normal means. We find that the power gain over natural competitors is substantial in all settings examined. We also characterize maximin rules for complex alternatives, and demonstrate that such rules can be found in practice, leading to improved practical procedures compared to existing alternatives.
- Paper: Link.
- RMD file: Example
- HTML file: Example
Panagiotou, O.A. and Heller, R.
Inferential Challenges for Real-world Evidence in the Era of Routinely Collected Health Data: Many Researchers, Many More Hypotheses, a Single Database (2021)
JAMA Oncology.
- Paper: Link.
Heifetz, A., Heller R. and Ostreiher, R.
Do Arabian babblers play mixed strategies in a ``voulunteer's dilemma"?(2021)
Journal of Behavioral and Experimental Economics .
- Abstract: When group-living Arabian babbler songbirds hear a sentinel alarm call that indicates a raptor approach, they should instantaneously choose whether to flee to shelter, or rather to expose themselves while calling towards the raptor to communicate to it its detection. If enough group members thus signal to the raptor their vigilance, the raptor is likely to be dissuaded from attacking the group. Groupmates thus engage in a variant of the �volunteer�s dilemma� game (Diekmann, 1985), whose symmetric equilibrium is in mixed strategies. In a field experiment, we check whether in nature Arabian babblers indeed make independent randomized choices upon hearing alarm calls, in natural conditions as well as in a controlled experiment in which recorded alarm calls were broadcast to group members. We use a resampling method to check for independence across group members in their reactions to sentinel alarm calls. In natural conditions independent mixed-strategy behaviour was refuted, and not refuted in the artificial conditions of the experiment. This is the first real-world test of mixed-strategy behaviour in games with more than two players.
- Paper: Link.
Heller R. and Rosset, S.
Optimal control of false discovery criteria in the two-group model(2020)
Journal of the Royal Statistical Society (JRSS), series B .
- Abstract: The highly influential two-group model in testing a large number of statistical hypotheses assumes that the test statistics are drawn independently from a mixture of a high probability null distribution and a low probability alternative. Optimal control of the marginal false discovery rate (mFDR), in the sense that it provides maximal power (expected true discoveries) subject to mFDR control, is known to be achieved by thresholding the local false discovery rate (locFDR), i.e., the probability of the hypothesis being null given the set of test statistics, with a fixed threshold. We address the challenge of controlling optimally the popular false discovery rate (FDR) or positive FDR (pFDR) rather than mFDR in the general two-group model, which also allows for dependence between the test statistics. These criteria are less conservative than the mFDR criterion, so they make more rejections in expectation. We derive their optimal multiple testing (OMT) policies, which turn out to be thresholding the locFDR with a threshold that is a function of the entire set of statistics. We develop an efficient algorithm for finding these policies, and use it for problems with thousands of hypotheses. We illustrate these procedures on gene expression studies.
- Paper (pre-submission version): Link.
Panagiotou, O.A., Jaljuli, I., and Heller, R.
Replicability of Treatment Effect in Study of Blood Pressure Lowering With Dementia(2020)
JAMA.
- Paper: Link.
Heller, R.
Comments on: Hierarchical inference for genome-wide association studies: a view on methodology with software(2020)
Computational Statistics.
- Paper: Link.
Heller, R., Meir, A., and Chatterjee, N.
Post-selection estimation and testing following aggregated association tests(2019)
Journal of the Royal Statistical Society (JRSS), series B .
- Abstract: The practice of pooling several individual test statistics to form aggregate tests is common in many statistical applications where individual tests may be underpowered. While selection by aggregate tests can serve to increase power, the selection process invalidates the inference based on the individual test-statistics, making it difficult to identify the ones that drive the signal in follow-up inference. Here, we develop a general approach for valid inference following selection by aggregate testing. We present novel powerful post-selection tests for the individual null hypotheses which are exact for the normal model and asymptotically justified otherwise. Our approach relies on the ability to characterize the distribution of the individual test statistics after conditioning on the event of selection. We provide efficient algorithms for computation of the post-selection maximum-likelihood estimates and suggest confidence intervals which rely on a novel switching regime for good coverage guarantees. We validate our methods via comprehensive simulation studies and apply them to data from the Dallas Heart Study, demonstrating that single variant association discovery following selection by an aggregate test is indeed possible in practice.
- Paper (pre-submission version): Link.
- R Package: PSAT
Bogomolov, M., Heller, R.
Assessing replicability of findings across two studies of multiple features(2018)
Biometrika.
- Abstract: Replicability analysis aims to identify the overlapping signals across independent studies that examine the same features. For this purpose we develop hypothesis testing procedures that first select the promising features from each study separately. Only those features selected in both studies are then tested. The proposed procedures have theoretical guarantees regarding their control of the familywise error rate or false discovery rate on the replicability claims. They can also be used for signal discovery in each study separately, with the desired error control. Their power for detecting truly replicable findings is compared to alternatives. We illustrate the procedures on behavioural genetics data.
- Paper: HTML.
- R Package: radjust
Brill, B., Heller, Y., and Heller, R.
Nonparametric independence tests and K-sample tests for large sample sizes, using package HHG(2018)
R Journal.
- Abstract: Nonparametric tests of independence and K-sample tests are ubiquitous in modern applications, but they are typically computationally expensive. We present a family of nonparametric tests that are computationally efficient and powerful for detecting any type of dependence between a pair of univariate random variables. The computational complexity of the suggested tests is sub-quadratic in sample size, allowing calculation of test statistics for millions of observations. We survey both algorithms and the HHG package in which they are implemented, with usage examples showing the implementation of the proposed tests for both the independence case and the K-sample problem. The tests are compared to existing nonparametric tests via several simulation studies comparing both runtime and power. Special focus is given to the design of data structures used in implementation of the tests. These data structures can be useful for developers of nonparametric distribution-free tests.
- Paper: PDF file.
- R Package: HHG
Sampson, J., Boca, S., Moore, S., Heller, R.
FWER and FDR control when testing multiple mediators(2018)
Bioinformatics.
- Abstract: The biological pathways linking exposures and disease risk are often poorly understood. To gain insight into these pathways, studies may try to identify biomarkers that mediate the exposure/disease relationship. Such studies often simultaneously test hundreds or thousands of biomarkers. We consider a set of m biomarkers and a corresponding set of null hypotheses, where the jth null hypothesis states that biomarker j does not mediate the exposure/disease relationship. We propose a Multiple Comparison Procedure (MCP) that rejects a set of null hypotheses or, equivalently, identifies a set of mediators, while asymptotically controlling the Family-Wise Error Rate (FWER) or False Discovery Rate (FDR). We use simulations to show that, compared to currently available methods, our proposed method has higher statistical power to detect true mediators. We then apply our method to a breast cancer study and identify nine metabolites that may mediate the known relationship between an increased BMI and an increased risk of breast cancer.
- Paper: HTML.
Karmakar, B., Heller, R. and Small, D.
False discovery rate control for effect modification in observational studies(2018)
Electronic Journal of Statistics.
- Abstract: In an observational study, a difference between the treatment and control group�s outcome might reflect the bias in treatment assignment rather than a true treatment effect. A sensitivity analysis determines the magnitude of this bias that would be needed to explain away as noncausal a significant treatment effect from a naive analysis that assumed no bias. Effect modification is the interaction between a treatment and a pretreatment covariate. In an observational study, there are often many possible effect modifiers and it is desirable to be able to look at the data to identify the effect modifiers that will be tested. For observational studies, we address simultaneously the problem of accounting for the multiplicity involved in choosing effect modifiers to test among many possible effect modifiers by looking at the data and conducting a proper sensitivity analysis. We develop an approach that provides finite sample false discovery rate control for a collection of adaptive hypotheses identified from the data on matched-pairs design. Along with simulation studies, an empirical study is presented on the effect of cigarette smoking on lead level in the blood using data from the U.S. National Health and Nutrition Examination Survey. Other applications of the suggested method are briefly discussed.
- Paper: Link.
Heller, R., Chatterjee, N., Krieger, A., and Shi, J.
Post-selection Inference Following Aggregate Level Hypothesis Testing in Large Scale Genomic Data(2017)
Journal of the American Statistical Association.
- Abstract: In many genomic applications, hypotheses tests are performed by aggregating test-statistics across units within naturally defined classes for powerful identification of signals. Following class-level testing, it is naturally of interest to identify the lower level units which contain true signals. Testing the individual units within a class without taking into account the fact that the class was selected using an aggregate-level test-statistic, will produce biased inference. We develop a hypothesis testing framework that guarantees control for false positive rates conditional on the fact that the class was selected. Specifically, we develop procedures for calculating unit level p-values that allows rejection of null hypotheses controlling for two types of conditional error rates, one relating to family wise rate and the other relating to false discovery rate. We use simulation studies to illustrate validity and power of the proposed procedure in comparison to several possible alternatives. We illustrate the power of the method in a natural application involving whole-genome expression quantitative trait loci (eQTL) analysis across 17 tissue types using data from The Cancer Genome Atlas (TCGA) Project.
- Paper: Link.
- R Package: PSAT
Jiang L., Amir A., Morton J., Heller R., Arias-Castro E., and Knight R.
Discrete False-Discovery Rate Improves Identification of Differentially Abundant Microbes .(2017)
mSystems, doi: 10.1128/mSystems.00092-17.
- Abstract: Differential abundance testing is a critical task in microbiome studies that is complicated by the sparsity of data matrices. Here we adapt for microbiome studies a solution from the field of gene expression analysis to produce a new method, discrete false-discovery rate (DS-FDR), that greatly improves the power to detect differential taxa by exploiting the discreteness of the data. Additionally, DSFDR is relatively robust to the number of noninformative features, and thus removes the problem of filtering taxonomy tables by an arbitrary abundance threshold. We show by using a combination of simulations and reanalysis of nine real-world microbiome data sets that this new method outperforms existing methods at the differential abundance testing task, producing a false-discovery rate that is up to threefold more accurate, and halves the number of samples required to find a given difference (thus increasing the efficiency of microbiome experiments considerably). We therefore expect DS-FDR to be widely applied in microbiome studies. IMPORTANCE: DS-FDR can achieve higher statistical power to detect significant findings in sparse and noisy microbiome data compared to the commonly used Benjamini-Hochberg procedure and other FDR-controlling procedures.
- Paper: Link.
Sun L., Subar A.F., Bosire C., Dawsey S.M., Kahle L.L., Zimmerman T.P., Abnet C.C., Heller R., Graubard B.I., Cook M.B., and Petrick J.L.
Dietary Flavonoid Intake Reduces the Risk of Head and Neck but Not Esophageal or Gastric Cancer in US Men and Women.(2017)
J Nutr, pii: jn251579. doi: 10.3945/jn.117.251579.
- Abstract: Background: Flavonoids are bioactive polyphenolic compounds found in fruits, vegetables, and beverages of plant origin. Previous studies have shown that flavonoid intake reduces the risk of certain cancers; however, few studies to date have examined associations of flavonoids with upper gastrointestinal cancers or used prospective cohorts.Objective: Our study examined the association between intake of flavonoids (anthocyanidins, flavan-3-ols, flavanones, flavones, flavonols, and isoflavones) and risk of head and neck, esophageal, and gastric cancers.Methods: The NIH-AARP Diet and Health Study is a prospective cohort study that consists of 469,008 participants. Over a mean 12-y follow-up, 2453 head and neck (including 1078 oral cavity, 424 pharyngeal, and 817 laryngeal), 1165 esophageal (890 adenocarcinoma and 275 squamous cell carcinoma), and 1297 gastric (625 cardia and 672 noncardia) cancer cases were identified. We used Cox proportional hazards regression models to estimate HRs and CIs for the associations between flavonoid intake assessed at study baseline and cancer outcomes. For 56 hypotheses examined, P-trend values were adjusted using the Benjamini-Hochberg (BH) procedure for false discovery rate control.Results: The highest quintile of total flavonoid intake was associated with a 24% lower risk of head and neck cancer (HR: 0.76; 95% CI: 0.66, 0.86; BH-adjusted 95% CI: 0.63, 0.91; P-trend = 0.02) compared with the lowest quintile. Notably, anthocyanidins were associated with a 28% lower risk of head and neck cancer (HR: 0.72; 95% CI: 0.62, 0.82; BH-adjusted 95% CI: 0.59, 0.87; P-trend = 0.0005), and flavanones were associated with a 22% lower risk of head and neck cancer (HR: 0.78; 95% CI: 0.68, 0.89; BH-adjusted 95% CI: 0.64, 0.94; P-trend: 0.02). No associations between flavonoid intake and risk of esophageal or gastric cancers were found.Conclusions: Our results indicate that flavonoid intake is associated with lower head and neck cancer risk. These associations suggest a protective effect of dietary flavonoids on head and neck cancer risk, and thus potential as a risk reduction strategy.
- Paper: Link.
Eilenberg, R. and Heller, R.
On the use of balancing scores and matching in testing for exposure effect in case-control studies(2017)
Statistics and Its Interface, Vol. 11, No. 1, pp.51-60.
- Abstract: Balancing scores, especially the propensity score, are widely used to adjust for measured confounders in prospective studies. In case-control studies, the distribution of the exposure and outcome given the covariates is distorted when there is an exposure effect, due to the selection process. Therefore, it is less obvious how to estimate balancing scores. Extensive simulations revealed several interesting findings on the use of estimated balancing scores in testing for exposure effect. First, that with the aid of an estimated balancing score obtaining matched sets with a low absolute standardized difference in covariate means was far easier than without the aid of an estimated balancing score. Second, that the estimation approach matters, and that several potential approaches result in an inflation of the type I error probability. Third, that using full matching on cavariates and on the estimated balancing score for testing for exposure effect is preferred over covariate adjustment (which has reduced power) and over stratification (which is sensitive to the number of strata, and does not make full use of the observed covariates). We show the usefulness of full matching with the aid of our recommended approach to estimating the balancing score in a case-control study.
- Paper: Link.
Sofer, T., Heller, R., Bogomolov, M., Avery, C., Graff, M., North, K., Reiner, A., Thornton, T., Rice, K., Benjamini, Y., Lauriee, C., and Kerr, K.
A Powerful Statistical Framework for Generalization Testing in GWAS, with Application to the HCHS/SOL(2017)
Genetic Epidemiology.
- Abstract: In GWAS, �generalization� is the replication of genotype-phenotype association in a population with different ancestry than the population in which it was first identified. The standard for reporting findings from a GWAS requires a two-stage design, in which discovered associations are replicated in an independent follow-up study. Current practices for declaring generalizations rely on testing associations while controlling the Family Wise Error Rate (FWER) in the discovery study, then separately controlling error measures in the follow-up study. While this approach limits false generalizations, we show that it does not guarantee control over the FWER or False Discovery Rate (FDR) of the generalization null hypotheses. In addition, it fails to leverage the two-stage design to increase power for detecting generalized associations. We develop a formal statistical framework for quantifying the evidence of generalization that accounts for the (in)consistency between the directions of associations in the discovery and follow-up studies. We develop the directional generalization FWER (FWERg) and FDR (FDRg) controlling r-values, which are used to declare associations as generalized. This framework extends to generalization testing when applied to a published list of SNP-trait associations. We show that our framework accommodates various SNP selection rules for generalization testing based on p-values in the discovery study, and still control FWERg or FDRg. A key finding is that it is often beneficial to use a more lenient p-value threshold then the genome-wide significance threshold. For instance, in a GWAS of Total Cholesterol (TC) in the Hispanic Community Health Study/Study of Latinos (HCHS/SOL), when testing all SNPs with p-values< 5 � 10^(-8) (15 genomic regions) for generalization in a large GWAS of whites, we generalized SNPs from 15 regions. But when testing all SNPs with p-values< 6.6�10^(-5) (89 regions), we generalized SNPs from 27 regions.
- Paper: Link.
- Web Applet: ReplicabilityFDR
Karp, N.A., Heller, R., Yaacoby, S., White, J.K., and Benjamini, Y.
Improving the Identification of Phenotypic Abnormalities and Sexual Dimorphism in Mice When Studying Rare Event Categorical Characteristics(2016)
Genetics, DOI: 10.1534/genetics.116.195388.
- Abstract: Biological research frequently involves the study of phenotyping data. Many of these studies focus on rare event categorical data, and in functional genomics typically study the presence or absence of an abnormal phenotype. With the growing interest in the role of sex, there is a need to assess the phenotype for sexual dimorphism. The identification of abnormal phenotypes for downstream research is challenged by the small sample size, the rare event nature, and the multiple testing problem, as many variables are monitored simultaneously. Here we develop a statistical pipeline to assess statistical and biological significance whilst managing the multiple testing problem. We propose a two-step pipeline to initially assess for a treatment effect, in our case example genotype, and then test for an interaction with sex. We compare multiple statistical methods and use simulations to investigate the control of the type one error rate and power. To maximize the power whilst addressing the multiple testing issue we implement filters to remove datasets where the hypotheses to be tested cannot achieve significance. A motivating case study utilizing a large scale high throughput mouse phenotyping dataset from the Wellcome Trust Sanger Institute Mouse Genetics Project, where the treatment is a gene ablation, demonstrates the benefits of the new pipeline on the downstream biological calls.
- Paper: Link.
- Software: Link.
Heller, R. and Heller, Y.
Multivariate tests of association based on univariate tests(2016)
Neural Information Processing Systems (NIPS) 2016, Barcelona, Spain.
- Abstract: For testing two random vectors for independence, we consider testing whether the distance of one vector from a center point is independent from the distance of the other vector from a center point by a univariate test. In this paper we provide conditions under which it is enough to have a consistent univariate test of independence on the distances to guarantee that the power to detect dependence between the random vectors increases to one, as the sample size increases. These conditions turn out to be minimal. If the univariate test is distribution-free, the multivariate test will also be distribution-free. If we consider multiple center points and aggregate the center-specific univariate tests, the power may be further improved, and the resulting multivariate test may be distribution-free for specific aggregation methods (if the univariate test is distribution-free). We show that several multivariate tests recently proposed in the literature can be viewed as instances of this general approach.
- Paper: Link.
Heller, R., Heller, Y., Kaufman, S., Brill, B. and Gorfine, M.
Consistent distribution-free K-sample and independence tests for univariate random variables(2016)
Journal of Machine Learning Research , Vol. 17.
- Abstract: A popular approach for testing if two univariate random variables are statistically independent consists of partitioning the sample space into bins, and evaluating a test statistic on the binned data. The partition size matters, and the optimal partition size is data dependent. While for detecting simple relationships coarse partitions may be best, for detecting complex relationships a great gain in power can be achieved by considering finer partitions. We suggest novel consistent distribution-free tests that are based on summation or maximization aggregation of scores over all partitions of a fixed size. We show that our test statistics based on summation can serve as good estimators of the mutual information. Moreover, we suggest regularized tests that aggregate over all partition sizes, and prove those are consistent too. We provide polynomial-time algorithms, which are critical for computing the suggested test statistics efficiently. We show that the power of the regularized tests is excellent compared to existing tests, and almost as powerful as the tests based on the optimal (yet unknown in practice) partition size, in simulations as well as on a real data example.
- Paper: Link.
- R Package: HHG
Angelini, C., Heller, R., Volkinshtein, R., and Yekutieli, D.
Is this the right normalization? A diagnostic tool for ChIP-seq normalization.(2015)
BMC Bioinformatics, Vol. 16, No. 150.
- Abstract: Background: Chip-seq experiments are becoming a standard approach for genome-wide profiling protein-DNA interactions, such as detecting transcription factor binding sites, histone modification marks and RNA Polymerase II occupancy. However, when comparing a ChIP sample versus a control sample, such as Input DNA, normalization procedures have to be applied in order to remove experimental source of biases. Despite the substantial impact that the choice of the normalization method can have on the results of a ChIP-seq data analysis, their assessment is not fully explored in the literature. In particular, there are no diagnostic tools that show whether the applied normalization is indeed appropriate for the data being analyzed. Results: In this work we propose a novel diagnostic tool to examine the appropriateness of the estimated normalization procedure. By plotting the empirical densities of log relative risks in bins of equal read count, along with the estimated normalization constant, after logarithmic transformation, the researcher is able to assess the appropriateness of the estimated normalization constant. We use the diagnostic plot to evaluate the appropriateness of the estimates obtained by CisGenome, NCIS and CCAT on several real data examples. Moreover, we show the impact that the choice of the normalization constant can have on standard tools for peak calling such as MACS or SICER. Finally, we propose a novel procedure for controlling the FDR using sample swapping. This procedure makes use of the estimated normalization constant in order to gain power over the naive choice of constant (used in MACS and SICER), which is the ratio of the total number of reads in the ChIP and Input samples. Conclusions: Linear normalization approaches aim to estimate a scale factor, r, to adjust for different sequencing depths when comparing ChIP versus Input samples. The estimated scaling factor can easily be incorporated in many peak caller algorithms to improve the accuracy of the peak identification. The diagnostic plot proposed in this paper can be used to assess how adequate ChIP/Input normalization constants are, and thus it allows the user to choose the most adequate estimate for the analysis.
- Paper: Link.
M. Gorfine, B. Goldstein, A. Fishman, R. Heller, Y. Heller, A. Lamm
Function of cancer associated genes revealed by modern univariate and multivariate association tests(2015)
PLOS ONE, DOI: 10.1371/journal.pone.0126544.
- Abstract: Copy number variation (CNV) plays a role in pathogenesis of many human diseases, especially cancer. Several whole genome CNV association studies have been performed for the purpose of identifying cancer associated CNVs. Here we undertook a novel approach to whole genome CNV analysis, with the goal being identification of associations between CNV of different genes (CNV-CNV) across 60 human cancer cell lines. We hypothesize that these associations point to the roles of the associated genes in cancer, and can be indicators of their position in gene networks of cancer-driving processes. Recent studies show that gene associations are often non-linear and non-monotone. In order to obtain a more complete picture of all CNV associations, we performed omnibus univariate analysis by utilizing dCov, MIC, and HHG association tests, which are capable of detecting any type of associations, including non-monotone relationships. For comparison we used Spearman and Pearson association tests, which detect only linear or monotone relationships. Application of dCov, MIC and HHG tests resulted in identification of twice as many associations compared to those found by Spearman and Pearson alone. Interestingly, most of the new associations were detected by the HHG test. Next, we utilized dCov's and HHG's ability to perform multivariate analysis. We tested for association between genes of unknown function and known cancer-related pathways. Our results indicate that multivariate analysis is much more effective than univariate analysis for the purpose of ascribing biological roles to genes of unknown function. We conclude that a combination of multivariate and univariate omnibus association tests can reveal significant information about gene networks of disease-driving processes. These methods can be applied to any large gene or pathway dataset, allowing more comprehensive analysis of biological processes.
- Paper: link
Heller, R., Bogomolov, M., and Benjamini, Y.
Deciding whether follow-up studies have replicated findings in a preliminary large-scale �omics� study�(2014)
Proceedings of the National Academy of Sciences (PNAS), Vol. 111, Pp. 16262-16267.
- Abstract: We propose a formal method to declare that findings from a primary study have been replicated in a follow-up study. Our proposal is appropriate for primary studies that involve large-scale searches for rare true positives (i.e. needles in a haystack). Our proposal assigns an r-value to each finding; this is the lowest false discovery rate at which the finding can be called replicated. Examples are given and software is available.
- Paper: PDF file.
- Web Applet: ReplicabilityFDR
- R Script: ReplicabilityFDR
Heller R., Yaacoby S., and Yekutieli D.
repfdr: A tool for replicability analysis for genome-wide association studies (2014)
Bioinformatics, doi: 10.1093/bioinformatics/btu434
- Abstract: Identification of SNPs that are associated with a phenotype in more than one study is of great scientific interest in GWAS research. The empirical Bayes approach for discovering whether results have been replicated across studies was shown to be a reliable method, and close to optimal in terms of power. The R package repfdr provides a flexible implementation of the empirical Bayes approach for replicability analysis and meta-analysis, to be used when several studies examine the same set of null hypotheses. The usefulness of the package for the GWAS community is discussed.
- Paper: PDF file.
- R Package: repfdr
Heller, R. and Yekutieli, D.
Replicability analysis for genome-wide association studies (2014)
Annals of Applied Statistics, Vol. 8, No. 1, Pp. 481-498.
- Abstract: The paramount importance of replicating associations is well recognized in the genome-wide associaton (GWA) research community, yet methods for assessing replicability of associations are scarce. Published GWA studies often combine separately the results of primary studies and of the follow-up studies. Informally, reporting the two separate meta-analyses, that of the primary studies and follow-up studies, gives a sense of the replicability of the results. We suggest a formal empirical Bayes approach for discovering whether results have been replicated across studies, in which we estimate the optimal rejection region for discovering replicated results. We demonstrate, using realistic simulations, that the average false discovery proportion of our method remains small. We apply our method to six type two diabetes (T2D) GWA studies. Out of 803 SNPs discovered to be associated with T2D using a typical meta-analysis, we discovered 219 SNPs with replicated associations with T2D. We recommend complementing a meta-analysis with a replicability analysis for GWA studies.
- Paper: PDF file.
- R Package: repfdr
Bogomolov, M. and Heller, R.
Discovering findings that replicate from a primary study of high dimension to a follow-up study (2013)
Journal of the American Statistical Association, Vol. 108, No. 504, Pp. 1480-1492.
- Abstract: We consider the problem of identifying whether findings replicate from one study of high dimension to another, when the primary study guides the selection of hypotheses to be examined in the follow-up study as well as when there is no division of roles into the primary and the follow-up study. We show that existing meta-analysis methods are not appropriate for this problem, and suggest novel methods instead. We prove that our multiple testing procedures control for appropriate error-rates. The suggested FWER controlling procedure is valid for arbitrary dependence among the test statistics within each study. A more powerful procedure is suggested for FDR control. We prove that this procedure controls the FDR if the test statistics are independent within the primary study, and independent or have dependence of type PRDS in the follow-up study. For arbitrary dependence within the primary study, and either arbitrary dependence or dependence of type PRDS in the follow-up study, simple conservative modifications of the procedure control the FDR. We demonstrate the usefulness of these procedures via simulations and real data examples.
- Paper: PDF file.
Heller, R. and Heller, Y. and Gorfine, M.
A consistent multivariate test of association based on ranks of distances(2013)
Biometrika, Vol. 100, No. 2, Pp. 503-510.
- Abstract: We consider the detection of associations between random vectors of any dimension. Few tests of independence exist that are consistent against all dependent alternatives. We propose a powerful test that is applicable in all dimensions and is consistent against all alternatives. The test has a simple form, is easy to implement, and has good power.
- Paper: PDF file.
- R Package: HHG
Heller, R. and Gorfine, M. and Heller Y.
A class of multivariate distribution-free tests of independence based on graphs (2012)
Journal of Statistical Planning and Inference, Vol. 142, No. 12, Pp. 3097�3106.
- Abstract: A class of distribution-free tests is proposed for the independence of two subsets of response coordinates. The tests are based on the pairwise distances across subjects within each subset of the response. A complete graph is induced by each subset of response coordinates, with the sample points as nodes and the pairwise distances as the edge weights. The proposed test statistic depends only on the rank order of edges in these complete graphs. The response vector may be of any dimensions. In particular, the number of samples may be smaller than the dimensions of the response. The test statistic is shown to have a normal limiting distribution with known expectation and variance under the null hypothesis of independence. The exact distribution free null distribution of the test statistic is given for a sample of size 14, and its Monte-Carlo approximation is considered for larger sample sizes. We demonstrate in simulations that this new class of tests has good power properties for very general alternatives.
- Paper: PDF file.
Heller, R.
Discussion of �Multiple Testing for Exploratory Research� by J. J. Goeman and A. Solari (2012)
Statistical Science, Vol. 26, No. 4, Pp. 598-600.
- Abstract: Goeman and Solari [Statist. Sci.26(2011) 584�597] have addressed the interesting topic of multiple testing for exploratory research, and provided us with nice suggestions for exploratory analysis. They defined properties that an inferential procedure should have for exploratory analysis: the procedure should be mild, flexible and post hoc. Their inferential procedure gives a lower bound on the number of false hypotheses among the selected hypotheses, and moreover whenever possible identifies elementary hypotheses that are false. The need to estimate a lower bound on the number of false hypotheses arises in various applications, and the partial conjunction approach was developed for this purpose in Biometrics 64(2008) 1215�1222 (see also Philos. Trans. R. Soc. Lond. Ser. A367(2009) 4255�4271 for more details). For example, in a combined analysis of several studies that exam-ine the same problem, it is of interest to give a lower bound on the number of studies in which the finding was reproduced. I will first address the rela-tion between the method of Goeman and Solari and the partial conjunction approach. Then I will discuss possible extensions and address the issue of ex-ploration in more general settings, where the local test may not be defined in advance or where the candidate hypotheses may not be known to begin with.
- Paper: PDF file.

Heller, R.
Comment:Correlated z-values and the accuracy of large scale statistical estimates(2010)
Journal of the American Statistical Association, Vol. 105, No. 491, Pp. 1057-1059.
- Abstract: Professor Efron has given us an interesting article on how to quantify the uncertainty in summary statistics of interest in large scale problems, when the summary statistics are based on correlated normal variates. It is shown that the inflation in the accuracy estimate due to correlation among the normal variates cannot be ignored (except possibly at the very far tails of distributions). Using a series of simplifications of the covariance formula, a simple formula is derived and it is shown in a numerical example that the approximation is indeed very close to the truth. In particular it is shown that the entire correlation structure is captured by one parameter ?, the rms correlation. Several methods of estimating ?, as well as the other unknown parameters, are suggested. In what follows I will discuss several topics in large scale significance testing that are related to the results of this paper.
- Paper: PDF file.
Heller, R. and Rosenbaum, P.R. and Small, D.S.
Using the cross-match test to appraise covariate balance in matched pairs(2010)
The American Statistician, Vol. 64, No. 4, Pp. 299-309
- Abstract: Having created a tentative matched design for an observa-tional study, diagnostic checks are performed to see whether observed covariates exhibit reasonable balance, or alternatively whether further effort is required to improve the match. We illustrate the use of the cross-match test as an aid to appraising balance on high-dimensional covariates, and we discuss its close logical connections to the techniques used to construct matched samples. In particular, in addition to a significance level, the cross-match test provides an interpretable measure of high-dimensional covariate balance, specifically a measure defined in terms of the propensity score. An example from the economics of education is used to illustrate. In the example, imbalances in an initial match guide the construction of a better match. The better match uses a recently proposed technique, optimal tapered matching, that leaves certain possibly innocuous covariates imbalanced in one match but not in another, and yields a test of whether the imbalances are actually innocuous.
- Paper: PDF file.
- R Package: Crossmatch
Heller, R. and Jensen, S.T. and Rosenbaum, P.R. and Small, D.S.
Sensitivity Analysis for the Cross-Match Test, With Applications in Genomics(2010)
Journal of the American Statistical Association, Vol. 105, No. 491, Pp. 1005-1013.
- Abstract: The cross-match test is an exact, distribution free test of no treatment e�ect on a high dimensional outcome in a randomized experiment. The test uses optimal nonbipartite matching to pair 2I subjects into I pairs based on similar outcomes, and the cross-match statistic A is the number of times a treated subject was paired with a control, rejecting for small values of A. If the test is applied in an observational study in which treatments are not randomly assigned, it may be comparing treated and control subjects who are not comparable, and may therefore falsely reject a true null hypothesis of no treatment e�ect. We develop a sensitivity analysis for the cross-match test, and apply it in an observational study of the e�ects of smoking on gene expression levels. In addition, we develop a sensitivity analysis for several multiple testing procedures using the cross-match test and apply it to 1627 molecular function categories in Gene Ontology.
- Paper: PDF file.
- R Package: Crossmatch
Benjamini, Y. and Heller, R. and Yekutieli, D.
Selective Inference in Complex Research(2009)
Philosophical Transactions of the Royal Society A, Vol. 367, No. 1906, Pp. 4255-4271 .
- Abstract: We explain the problem of selective inference in complex research using a recently published study: a replicability study of the associations in order to reveal and establish risk loci for type 2 diabetes. The false discovery rate approach to such problems will be reviewed, and we further address two problems: (i) setting confidence intervals on the size of the risk at the selected locations and (ii) selecting the replicable results.
- Paper: PDF file.
Heller, R. and Manduchi, E. and Grant, G.R. and Ewens, W.J.
A flexible two-stage procedure for identifying gene sets that are differentially expressed(2009)
Bioinformatics, Vol. 25, No. 8, Pp. 1019-1025.
- Abstract: Motivation: Microarray data analysis has expanded from testing individual genes for differential expression to testing gene sets for differential expression. The tests at the gene set level may focus on multivariate expression changes or on the differential expression of at least one gene in the gene set. These tests may be powerful at detecting subtle changes in expression, but findings at the gene set level need to be examined further to understand whether they are informative and if so how. Results: We propose to first test for differential expression at the gene set level but then proceed to test for differential expression of individual genes within discovered gene sets. We introduce the overall FDR (OFDR) as an appropriate error rate to control when testing multiple gene sets and genes. We illustrate the advantage of this procedure over procedures that only test gene sets or individual genes. Availability: R code (www.r-project.org) for implementing our approach is included as supplementary material.
- Paper: PDF file.
- R code: in Software section.
Heller, R. and Rosenbaum, P.R. and Small, D.S.
Split samples and design sensitivity in observational studies(2009)
Journal of the American Statistical Association, Vol. 104, No. 487, Pp. 1090-1101.
- Abstract: An observational or nonrandomized study of treatment effects may be biased by failure to control for some relevant covariate that was not measured. The design of an observational study is known to strongly affect its sensitivity to biases from covariates that were not observed. For instance, the choice of an outcome to study, or the decision to combine several outcomes in a test for coherence can materially affect the sensitivity to unobserved biases. Decisions that shape the design are, therefore, critically important, but they are also difficult decisions to make in the absence of data. We consider the possibility of randomly splitting the data from an observational study into a smaller planning sample and a larger analysis sample, where the planning sample is used to guide decisions about design. After reviewing the concept of design sensitivity, we evaluate sample splitting in theory, by numerical computation, and by simulation, comparing it to several methods that use all of the data. Sample splitting is remarkably effective, much more so in observational studies than in randomized experiments: splitting 1000 matched pairs into 100 planning pairs and 900 analysis pairs often materially improves the design sensitivity. An example from genetic toxicology is used to illustrate the method.
- Paper: PDF file.
Heller, R. and Manduchi, E. and Small, D.S.
Matching methods for observational microarray studies(2009)
Bioinformatics, Vol. 25, No. 7, Pp. 904-909.
- Abstract: Motivation:We address the problem of identifying differentially expressed genes between two conditions in the scenario where the data arise from anobservational study, in which confounding factors are likely to be present. Results:We suggest to use matching methods to balance two groups of observed cases on measured covariates, and to identify differentially expressed genes using a test suited to matched data. We illustrate this approach on 2 microarray studies: the first study consists of data from patients with two cancer subtypes, and the second study consists of data from AMKL patients with and without Down syndrome. Availability: R code (www.r-project.org) for implementing our approach is included as supplementary material.
- Paper: PDF file.
- R code: in Software section.
Benjamini, Y. and Heller, R.
Screening for partial conjunction hypotheses(2008)
Biometrics, Vol. 64, No. 4, Pp. 1215-1222.
- Abstract: We consider the problem of testing for partial conjunction of hypothesis, that argues that at least u out of n tested hypotheses are false. It offers an in-between approach to the testing of the conjunction of null hypotheses against the alternative that at least one is not, and the testing of the disjunction of null hypotheses against the alternative that all hypotheses are not null. We suggest powerful test statistics for testing such a partial conjunction hypothesis that are valid under dependence between the test statistics as well as under independence. We then address the problem of testing many partial conjunction hypotheses simultaneously using the false discovery rate (FDR) approach. We prove that if the FDR controlling procedure in Benjamini and Hochberg (1995) is used for this purpose the FDR is controlled under various dependency structures. Moreover, we can screen at all levels simultaneously in order to display the findings on a superimposed map and still control an appropriate FDR measure. We apply the method to examples from Microarray analysis and functional Magnetic Resonance Imaging (fMRI), two application areas where the need for partial conjunction analysis has been identified.
- Paper: PDF file.
- Supplementary Material: PDF file.
- Matlab code: in Software section.
Benjamini, Y. and Heller, R.
False Discovery Rates for Spatial Signals(2007)
Journal of the American Statistical Association, Vol. 102, No. 480, Pp. 1272-1281.
- Abstract: The problem of multiple testing for the presence of signal in spatial data can involve a large number of locations. Traditionally, each location is tested separately for signal presence but then the findings are reported in terms of clusters of nearby locations. This is an indication that the units of interests for testing are clusters rather than individual locations. The investigator may know a-priori these more natural units or an approximation to them. We suggest testing these cluster units rather than individual locations, thus increasing the signal to noise ratio within the unit tested as well as reducing the number of hypotheses tests conducted. Since the signal may be absent from part of each cluster, we define a cluster as containing signal if the signal is present somewhere within the cluster. We suggest controlling the false discovery rate (FDR) on clusters, i.e. the expected proportion of clusters rejected erroneously out of all clusters rejected, or its extension to general weights (WFDR). We introduce a powerful two-stage testing procedure and show that it controls the WFDR. Once the cluster discoveries have been made, we suggest �cleaning� locations in which the signal is absent. For this purpose we develop a hierarchical testing procedure that tests clusters first, then locations within rejected clusters. We show formally that this procedure controls the desired location error rate asymptotically, and conjecture that this is so also for realistic settings by extensive simulations. We discuss an application to functional neuroimaging which motivated this research and demonstrate the advantages of the proposed methodology on an example.
- Paper: PDF file.
Heller, R. and Golland, Y. and Malach, R. and Benjamini, Y.
Conjunction group analysis: An alternative to mixed/random effect analysis(2007)
Neuroimage, Vol. 37, No. 4, Pp. 1178-1185.
- Abstract: We address the problem of testing in every brain voxelv whether at least uout of n conditions (or subjects) considered shows a real effect. The only statistic suggested so far, the maximump-value method, fails under dependency (unless u=n) and in particular under positive dependency that arises if all stimuli are compared to the same control stimulus. Moreover, it tends to have low power under independence. For testing that at leastuout ofnconditions shows a real effect, we suggest powerful test statistics that are valid under dependence between the individual conditionp-values as well as under independence and other test statistics that are valid under independence. We use the above approach, replacing conditions by subjects, to produce informative group maps and thereby offer an alternative to mixed/random effect analysis.
- Paper: PDF file.
Heller, R. and Stanley, D. and Yekutieli, D. and Rubin, N. and Benjamini, Y.
Cluster-based analysis of FMRI data(2006)
NeuroImage, Vol. 33, No. 2, Pp. 599-608.
- Abstract: We propose a method for the statistical analysis of fMRIdata that tests cluster units rather than voxel units for activation. The advantages of this analysis over previous ones are both conceptual and statistical. Recognizing that the fundamental units of interest are the spatially contiguous clusters of voxels that are activated together, we set out to approximate these cluster units from the data by a clustering algorithm especially tailored for fMRIdata. Testing the cluster units has a two-fold statistical advantage over testing each voxel separately: the signal to noise ratio within the unit tested is higher, and the number of hypotheses tests compared is smaller. We suggest controlling FDR on clusters, i.e., the proportion of clusters rejected erroneously out of all clusters rejected and explain the meaning of controlling this error rate. We introduce the powerful adaptive procedure to control the FDR on clusters. We apply our cluster-basedanalysis (CBA) to both an event-related and a block design fMRI vision experiment and demonstrate its increased power over voxel-by-voxel analysis in these examples as well as in simulations.
- Paper: PDF file.
- Matlab code: in Software section.

Abramovich, F. and Heller, R.
Local functional hypothesis testing(2005)
Mathematical Methods Of Statistics, Vol. 14, No. 3.
- Abstract: We consider a standard "signal+white noise" model on the unit interval and want to test whether the signal is present on a subinterval Ω _Δ ⊆[0,1] of length Δ. The composite alternative is that the unknown signal f is separated away from zero in terms of its average power γ (ƒ) = || ƒ || ²_Δ/ Δ on Ω _Δ and also possesses some regularity properties. We evaluate the asymptotically optimal (minimax) rates for testing the presence of a signal on Ω _Δ, where both the noise level and the interv al length tend to zero. We derive corresponding rate-optimal tests for local signal detection.
- Paper: PDF file.
- R Package: Software