
Bogomolov, M. and Heller, R.
Replicability Across Multiple Studies(2023)
Statistical Science.

abstract: Metaanalysis is routinely performed in many scientific disciplines. This analysis is attractive since discoveries are possible even when all the individual studies are underpowered. However, the metaanalytic discoveries may be entirely driven by signal in a single study, and thus nonreplicable. Although the great majority of metaanalyses carried out to date do not infer on the replicability of their findings, it is possible to do so. We provide a selective overview of analyses that can be carried out towards establishing replicability of the scientific findings. We describe methods for the setting where a single outcome is examined in multiple studies (as is common in systematic reviews of medical interventions), as well as for the setting where multiple studies each examine multiple features (as in genomics applications). We also discuss some of the current shortcomings and future directions.

Paper: Link.

Heller, R., Krieger, A., and Rosset, S.
Optimal multiple testing and design in clinical trials(2022)
Biometrics.

abstract: A central goal in designing clinical trials is to find the test that maximizes power (or equivalently minimizes required sample size) for finding a false null hypothesis subject to the constraint of type I error. When there is more than one test, such as in clinical trials with multiple endpoints, the issues of optimal design and optimal procedures become more complex. In this paper, we address the question of how such optimal tests should be defined and how they can be found. We review different notions of power and how they relate to study goals, and also consider the requirements of type I error control and the nature of the procedures. This leads us to an explicit optimization problem with objective and constraints that describe its specific desiderata. We present a complete solution for deriving optimal procedures for two hypotheses, which have desired monotonicity properties, and are computationally simple. For some of the optimization formulations this yields optimal procedures that are identical to existing procedures, such as Hommel's procedure or the procedure of Bittman et al. (2009), while for other cases it yields completely novel and more powerful procedures than existing ones. We demonstrate the nature of our novel procedures and their improved power extensively in a simulation and on the APEX study (Cohen et al., 2016).

Paper: Link.

YoungStatS blog entry: Generalizing the NeymanPearson Lemma for multiple hypothesis testing problems

Jaljuli, I., Benjamini, Y., Shenhav, L., Panagiotou, O., and Heller, R.
Quantifying replicability and consistency in systematic reviews(2022)
Statistics in biopharmaceutical research.

abstract: Systematic reviews and metaanalyses are important tools for synthesizing evidence from multiple studies. They serve to increase power and improve precision, in the same way that large studies can do, but also to establish the consistency of effects and replicability of results across studies. In this work we propose statistical tools to quantify replicability of effect signs (or directions) and their consistency.
We suggest that these tools accompany the fixedeffect or randomeffects metaanalysis, and we show that they convey important information for the assessment of the intervention under investigation.
We motivate and demonstrate our approach and its implications by examples from systematic reviews from the Cochrane Library. Our tools make no assumptions on the distribution of the true effect sizes, so their inferential guarantees continue to hold even if the assumptions of the fixedeffect or randomeffects models do not hold. We also develop a version of this tool under the fixedeffect assumption for cases where it is crucial and justified.

Paper (presubmission version): PDF file.

R Package: metarep

Brill, B., Amir, A., and Heller, R.
Testing for differential abundance in compositional counts data, with application to microbiome studies(2022)
The Annals of Applied Statistics.

abstract: Identifying which taxa in our microbiota are associated with
traits of interest is important for advancing science and health. However,
the identification is challenging because the measured vector of
taxa counts (by amplicon sequencing) is compositional, so a change in
the abundance of one taxon in the microbiota induces a change in the
number of sequenced counts across all taxa. The data are typically
sparse, with many zero counts present either due to biological variance
or limited sequencing depth.We examine the case of Crohn's disease,
where the microbial load changes substantially with the disease.
For this representative example of a highly compositional setting, we
show existing methods designed to identify differentially abundant
taxa may have an inflated number of false positives. We introduce
a novel nonparametric approach that provides valid inference even
when the fraction of zero counts is substantial. Our approach uses
a set of reference taxa that are nondifferentially abundant, which
can be estimated from the data or from outside information. Our
approach also allows for a novel type of testing: multivariate tests
of differential abundance over a focused subset of the taxa. Genera level
multivariate testing discovers additional genera as differentially
abundant by avoiding agglomeration of taxa.

Paper: PDF file.

R Package: DACOMP

Haroush, M., Frostig, T., Heller, R., and Sourdry, D.
A statistical framework for efficient out of distribution detection in deep neural networks(2022)
The tenth International Conference on Learning Representations (ICLR).

abstract: Commonly, Deep Neural Networks (DNNs) generalize well on
samples drawn from a distribution similar to that of the training set. However,
DNNs’ predictions are brittle and unreliable when the test samples are drawn
from a dissimilar distribution. This is a major concern for deployment in real world applications, where such behavior may come at a considerable cost, such as industrial production lines, autonomous vehicles, or health care applications.
We frame Out Of Distribution (OOD) detection in DNNs as a
statistical hypothesis testing problem. Tests generated within our proposed framework combine evidence from the entire network. Unlike previous OOD detection
heuristics, this framework returns a pvalue for each test sample. It is guaranteed
to maintain the Type I Error (T1E  incorrectly predicting OOD for an actual
indistribution sample) for test data. Moreover, this allows to combine several detectors while maintaining the T1E. Building on this framework, we suggest a novel OOD procedure based on loworder statistics. Our method achieves comparable
or better results than stateoftheart methods on wellaccepted OOD benchmarks,
without retraining the network parameters or assuming prior knowledge on the test
distribution — and at a fraction of the computational cost.

Paper: PDF file.

Rosset, S., Heller, R., Painsky, A., and Aharoni, E.
Optimal and maximin procedures for multiple testing problems(2022)
Journal of the Royal Statistical Society (JRSS), series B.

abstract: Multiple testing problems are a staple of modern statistical analysis. The fundamental objective of multiple testing procedures is to reject as many false null hypotheses as possible (that is, maximize some notion of power), subject to controlling an overall measure of false discovery, like familywise error rate (FWER) or false discovery rate (FDR). In this paper we formulate multiple testing of simple hypotheses as an infinitedimensional optimization problem, seeking the most powerful rejection policy which guarantees strong control of the selected measure. In that sense, our approach is a generalization of the optimal NeymanPearson test for a single hypothesis. We show that for exchangeable hypotheses, for both FWER and FDR and relevant notions of power, these problems can be formulated as infinite linear programs and can in principle be solved for any number of hypotheses. We apply our results to derive explicit optimal tests for FWER or FDR control for three independent normal means. We find that the power gain over natural competitors is substantial in all settings examined. We also characterize maximin rules for complex alternatives, and demonstrate that such rules can be found in practice, leading to improved practical procedures compared to existing alternatives.

Paper: Link.

RMD file: Example

HTML file: Example

Panagiotou, O.A. and Heller, R.
Inferential Challenges for Realworld Evidence in the Era of Routinely Collected Health Data:
Many Researchers, Many More Hypotheses, a Single Database
(2021)
JAMA Oncology.

Heifetz, A., Heller R. and Ostreiher, R.
Do Arabian babblers play mixed strategies in a ``voulunteer's dilemma"?(2021)
Journal of Behavioral and Experimental Economics .

Abstract: When groupliving Arabian babbler songbirds hear a sentinel alarm call that indicates a raptor approach, they should instantaneously choose whether to flee to shelter, or rather to expose themselves while calling towards the raptor to communicate to it its detection. If enough group members thus signal to the raptor their vigilance, the raptor is likely to be dissuaded from attacking the group. Groupmates thus engage in a variant of the “volunteer’s dilemma” game (Diekmann, 1985), whose symmetric equilibrium is in mixed strategies. In a field experiment, we check whether in nature Arabian babblers indeed make independent randomized choices upon hearing alarm calls, in natural conditions as well as in a controlled experiment in which recorded alarm calls were broadcast to group members. We use a resampling method to check for independence across group members in their reactions to sentinel alarm calls. In natural conditions independent mixedstrategy behaviour was refuted, and not refuted in the artificial conditions of the experiment. This is the first realworld test of mixedstrategy behaviour in games with more than two players.

Paper: Link.

Heller R. and Rosset, S.
Optimal control of false discovery criteria in the twogroup model(2020)
Journal of the Royal Statistical Society (JRSS), series B .

Abstract: The highly influential twogroup model in testing a large number of statistical hypotheses assumes that the test statistics are drawn independently from a mixture of a high probability null distribution and a low probability alternative. Optimal control of the marginal false discovery rate (mFDR), in the sense that it provides maximal power (expected true discoveries) subject to mFDR control, is known to be achieved by thresholding the local false discovery rate (locFDR), i.e., the probability of the hypothesis being null given the set of test statistics, with a fixed threshold. We address the challenge of controlling optimally the popular false discovery rate (FDR) or positive FDR (pFDR) rather than mFDR in the general twogroup model, which also allows for dependence between the test statistics. These criteria are less conservative than the mFDR criterion, so they make more rejections in expectation. We derive their optimal multiple testing (OMT) policies, which turn out to be thresholding the locFDR with a threshold that is a function of the entire set of statistics. We develop an efficient algorithm for finding these policies, and use it for problems with thousands of hypotheses. We illustrate these procedures on gene expression studies.

Paper (presubmission version): Link.

Panagiotou, O.A., Jaljuli, I., and Heller, R.
Replicability of Treatment Effect in Study of Blood Pressure Lowering With Dementia(2020)
JAMA.

Heller, R.
Comments on: Hierarchical inference for genomewide association studies: a view on methodology with software(2020)
Computational Statistics.

Heller, R., Meir, A., and Chatterjee, N.
Postselection estimation and testing following aggregated association tests(2019)
Journal of the Royal Statistical Society (JRSS), series B .

Abstract: The practice of pooling several individual test statistics to form aggregate tests is common in many statistical applications where individual tests may be underpowered.
While selection by aggregate tests can serve to increase power, the selection process invalidates the inference based on the individual teststatistics, making it difficult to identify the ones that drive the signal in followup inference. Here, we develop a general approach for valid inference following selection by aggregate testing. We present novel powerful postselection tests for the individual null hypotheses which are exact for the normal model and asymptotically justified otherwise. Our approach relies on the ability to characterize the distribution of the individual test statistics after conditioning on the event of selection. We provide efficient algorithms for computation of the postselection maximumlikelihood estimates and suggest confidence intervals which rely on a novel switching regime for good coverage guarantees. We validate our methods via comprehensive simulation studies and apply them to data from the Dallas Heart Study, demonstrating that single variant association discovery following selection by an aggregate test is indeed possible in practice.

Paper (presubmission version): Link.

R Package: PSAT

Bogomolov, M., Heller, R.
Assessing replicability of findings across two studies of multiple features(2018)
Biometrika.

Abstract: Replicability analysis aims to identify the overlapping signals across independent studies that examine the same features. For this purpose we develop hypothesis testing procedures that first select the promising features from each study separately. Only those features selected in both studies are then tested. The proposed procedures have theoretical guarantees regarding their control of the familywise error rate or false discovery rate on the replicability claims. They can also be used for signal discovery in each study separately, with the desired error control. Their power for detecting truly replicable findings is compared to alternatives. We illustrate the procedures on behavioural genetics data.

Paper: HTML.

R Package: radjust

Brill, B., Heller, Y., and Heller, R.
Nonparametric independence tests and Ksample tests for large sample sizes, using package HHG(2018)
R Journal.

Abstract: Nonparametric tests of independence and Ksample tests are ubiquitous in modern applications,
but they are typically computationally expensive. We present a family of nonparametric tests
that are computationally efficient and powerful for detecting any type of dependence between a pair
of univariate random variables. The computational complexity of the suggested tests is subquadratic
in sample size, allowing calculation of test statistics for millions of observations. We survey both
algorithms and the HHG package in which they are implemented, with usage examples showing
the implementation of the proposed tests for both the independence case and the Ksample problem.
The tests are compared to existing nonparametric tests via several simulation studies comparing both
runtime and power. Special focus is given to the design of data structures used in implementation of
the tests. These data structures can be useful for developers of nonparametric distributionfree tests.

Paper: PDF file.

R Package: HHG

Sampson, J., Boca, S., Moore, S., Heller, R.
FWER and FDR control when testing multiple mediators(2018)
Bioinformatics.

Abstract:
The biological pathways linking exposures and disease risk are often poorly understood. To gain insight into these pathways, studies may try to identify biomarkers that mediate the exposure/disease relationship. Such studies often simultaneously test hundreds or thousands of biomarkers. We consider a set of m biomarkers and a corresponding set of null hypotheses, where the jth null hypothesis states that biomarker j does not mediate the exposure/disease relationship. We propose a Multiple Comparison Procedure (MCP) that rejects a set of null hypotheses or, equivalently, identifies a set of mediators, while asymptotically controlling the FamilyWise Error Rate (FWER) or False Discovery Rate (FDR). We use simulations to show that, compared to currently available methods, our proposed method has higher statistical power to detect true mediators. We then apply our method to a breast cancer study and identify nine metabolites that may mediate the known relationship between an increased BMI and an increased risk of breast cancer.

Paper: HTML.

Karmakar, B., Heller, R. and Small, D.
False discovery rate control for effect modification in observational studies(2018)
Electronic Journal of Statistics.

Abstract:
In an observational study, a difference between the treatment
and control group’s outcome might reflect the bias in treatment assignment rather than a true treatment effect. A sensitivity analysis determines
the magnitude of this bias that would be needed to explain away as noncausal a significant treatment effect from a naive analysis that assumed
no bias. Effect modification is the interaction between a treatment and a
pretreatment covariate. In an observational study, there are often many
possible effect modifiers and it is desirable to be able to look at the data to
identify the effect modifiers that will be tested. For observational studies,
we address simultaneously the problem of accounting for the multiplicity
involved in choosing effect modifiers to test among many possible effect
modifiers by looking at the data and conducting a proper sensitivity analysis. We develop an approach that provides finite sample false discovery rate
control for a collection of adaptive hypotheses identified from the data on
matchedpairs design. Along with simulation studies, an empirical study is
presented on the effect of cigarette smoking on lead level in the blood using data from the U.S. National Health and Nutrition Examination Survey.
Other applications of the suggested method are briefly discussed.

Paper: Link.

Heller, R., Chatterjee, N., Krieger, A., and Shi, J.
Postselection Inference Following Aggregate Level Hypothesis Testing in Large Scale Genomic Data(2017)
Journal of the American Statistical Association.

Abstract: In many genomic applications, hypotheses tests are performed by aggregating teststatistics across units within naturally defined classes for powerful identification of signals. Following classlevel testing, it is naturally of interest to identify the lower level units which contain true signals. Testing the individual units within a class without taking into account the fact that the class was selected using an aggregatelevel teststatistic, will produce biased inference. We develop a hypothesis testing framework that guarantees control for false positive rates conditional on the fact that the class was selected. Specifically, we develop procedures for calculating unit level pvalues that allows rejection of null hypotheses controlling for two types of conditional error rates, one relating to family wise rate and the other relating to false discovery rate. We use simulation studies to illustrate validity and power of the proposed procedure in comparison to several possible alternatives. We illustrate the power of the method in a natural application involving wholegenome expression quantitative trait loci (eQTL) analysis across 17 tissue types using data from The Cancer Genome Atlas (TCGA) Project.

Paper: Link.

R Package: PSAT

Jiang L., Amir A., Morton J., Heller R., AriasCastro E., and Knight R.
Discrete FalseDiscovery Rate Improves Identification of Differentially Abundant Microbes
.(2017)
mSystems, doi: 10.1128/mSystems.0009217.

Abstract: Differential abundance testing is a critical task in microbiome studies
that is complicated by the sparsity of data matrices. Here we adapt for microbiome
studies a solution from the field of gene expression analysis to produce a new
method, discrete falsediscovery rate (DSFDR), that greatly improves the power to
detect differential taxa by exploiting the discreteness of the data. Additionally, DSFDR
is relatively robust to the number of noninformative features, and thus removes
the problem of filtering taxonomy tables by an arbitrary abundance threshold. We
show by using a combination of simulations and reanalysis of nine realworld microbiome
data sets that this new method outperforms existing methods at the differential
abundance testing task, producing a falsediscovery rate that is up to threefold
more accurate, and halves the number of samples required to find a given difference
(thus increasing the efficiency of microbiome experiments considerably). We
therefore expect DSFDR to be widely applied in microbiome studies.
IMPORTANCE: DSFDR can achieve higher statistical power to detect significant findings
in sparse and noisy microbiome data compared to the commonly used
BenjaminiHochberg procedure and other FDRcontrolling procedures.

Paper: Link.

Sun L., Subar A.F., Bosire C., Dawsey S.M., Kahle L.L., Zimmerman T.P., Abnet C.C., Heller R., Graubard B.I., Cook M.B., and Petrick J.L.
Dietary Flavonoid Intake Reduces the Risk of Head and Neck but Not Esophageal or Gastric Cancer in US Men and Women.(2017)
J Nutr, pii: jn251579. doi: 10.3945/jn.117.251579.

Abstract: Background: Flavonoids are bioactive polyphenolic compounds found in fruits, vegetables, and beverages of plant origin. Previous studies have shown that flavonoid intake reduces the risk of certain cancers; however, few studies to date have examined associations of flavonoids with upper gastrointestinal cancers or used prospective cohorts.Objective: Our study examined the association between intake of flavonoids (anthocyanidins, flavan3ols, flavanones, flavones, flavonols, and isoflavones) and risk of head and neck, esophageal, and gastric cancers.Methods: The NIHAARP Diet and Health Study is a prospective cohort study that consists of 469,008 participants. Over a mean 12y followup, 2453 head and neck (including 1078 oral cavity, 424 pharyngeal, and 817 laryngeal), 1165 esophageal (890 adenocarcinoma and 275 squamous cell carcinoma), and 1297 gastric (625 cardia and 672 noncardia) cancer cases were identified. We used Cox proportional hazards regression models to estimate HRs and CIs for the associations between flavonoid intake assessed at study baseline and cancer outcomes. For 56 hypotheses examined, Ptrend values were adjusted using the BenjaminiHochberg (BH) procedure for false discovery rate control.Results: The highest quintile of total flavonoid intake was associated with a 24% lower risk of head and neck cancer (HR: 0.76; 95% CI: 0.66, 0.86; BHadjusted 95% CI: 0.63, 0.91; Ptrend = 0.02) compared with the lowest quintile. Notably, anthocyanidins were associated with a 28% lower risk of head and neck cancer (HR: 0.72; 95% CI: 0.62, 0.82; BHadjusted 95% CI: 0.59, 0.87; Ptrend = 0.0005), and flavanones were associated with a 22% lower risk of head and neck cancer (HR: 0.78; 95% CI: 0.68, 0.89; BHadjusted 95% CI: 0.64, 0.94; Ptrend: 0.02). No associations between flavonoid intake and risk of esophageal or gastric cancers were found.Conclusions: Our results indicate that flavonoid intake is associated with lower head and neck cancer risk. These associations suggest a protective effect of dietary flavonoids on head and neck cancer risk, and thus potential as a risk reduction strategy.

Paper: Link.

Eilenberg, R. and Heller, R.
On the use of balancing scores and matching in testing for exposure effect in casecontrol studies(2017)
Statistics and Its Interface, Vol. 11, No. 1, pp.5160.

Abstract: Balancing scores, especially the propensity score, are widely used to adjust for measured confounders in prospective studies.
In casecontrol studies, the distribution of the exposure and outcome given the covariates is distorted when there is an exposure effect, due to the selection process. Therefore, it is less obvious how to estimate balancing scores.
Extensive simulations revealed several interesting findings on the use of estimated balancing scores in testing for exposure effect. First, that with the aid of an estimated balancing score obtaining matched sets with a low absolute standardized difference in covariate means was far easier than without the aid of an estimated balancing score. Second, that the estimation approach matters, and that several potential approaches result in an inflation of the type I error probability. Third, that using full matching on cavariates and on the estimated balancing score for testing for exposure effect is preferred over covariate adjustment (which has reduced power) and over stratification (which is sensitive to the number of strata, and does not make full use of the observed covariates).
We show the usefulness of full matching with the aid of our recommended approach to estimating the balancing score in a casecontrol study.

Paper: Link.

Sofer, T., Heller, R., Bogomolov, M., Avery, C., Graff, M., North, K., Reiner, A., Thornton, T., Rice, K., Benjamini, Y., Lauriee, C., and Kerr, K.
A Powerful Statistical Framework for Generalization Testing in GWAS, with Application to the HCHS/SOL(2017)
Genetic Epidemiology.

Abstract: In GWAS, “generalization” is the replication of genotypephenotype association in a population with different ancestry than the population in which it was first identified. The standard for reporting findings from a GWAS requires a twostage design, in which discovered associations are replicated in an independent followup study. Current practices for declaring generalizations rely on testing associations while controlling the Family Wise Error Rate (FWER) in the discovery study, then separately controlling error measures in the followup study. While this approach limits false generalizations, we show that it does not guarantee control over the FWER or False Discovery Rate (FDR) of the generalization null hypotheses. In addition, it fails to leverage the twostage design to increase power for detecting generalized associations. We develop a formal statistical framework for quantifying the evidence of generalization that accounts for the (in)consistency between the directions of associations in the discovery and followup studies. We develop the directional generalization FWER (FWERg) and FDR (FDRg) controlling rvalues, which are used to declare associations as generalized. This framework extends to generalization testing when applied to a published list of SNPtrait associations. We show that our framework accommodates various SNP selection rules for generalization testing based on pvalues in the discovery study, and still control FWERg or FDRg. A key finding is that it is often beneficial to use a more lenient pvalue threshold then the genomewide significance threshold. For instance, in a GWAS of Total Cholesterol (TC) in the Hispanic Community Health Study/Study of Latinos (HCHS/SOL), when testing all SNPs with pvalues< 5 ª 10^(8) (15 genomic regions) for generalization in a large GWAS of whites, we generalized SNPs from 15 regions. But when testing all SNPs with pvalues< 6.6ª10^(5) (89 regions), we generalized SNPs from 27 regions.

Paper: Link.

Web Applet: ReplicabilityFDR

Karp, N.A., Heller, R., Yaacoby, S., White, J.K., and Benjamini, Y.
Improving the Identification of Phenotypic Abnormalities and Sexual Dimorphism in Mice When Studying Rare Event Categorical Characteristics(2016)
Genetics, DOI: 10.1534/genetics.116.195388.

Abstract: Biological research frequently involves the study of phenotyping data. Many of these studies focus on rare event categorical data, and in functional genomics typically study the presence or absence of an abnormal phenotype. With the growing interest in the role of sex, there is a need to assess the phenotype for sexual dimorphism. The identification of abnormal phenotypes for downstream research is challenged by the small sample size, the rare event nature, and the multiple testing problem, as many variables are monitored simultaneously. Here we develop a statistical pipeline to assess statistical and biological significance whilst managing the multiple testing problem. We propose a twostep pipeline to initially assess for a treatment effect, in our case example genotype, and then test for an interaction with sex. We compare multiple statistical methods and use simulations to investigate the control of the type one error rate and power. To maximize the power whilst addressing the multiple testing issue we implement filters to remove datasets where the hypotheses to be tested cannot achieve significance. A motivating case study utilizing a large scale high throughput mouse phenotyping dataset from the Wellcome Trust Sanger Institute Mouse Genetics Project, where the treatment is a gene ablation, demonstrates the benefits of the new pipeline on the downstream biological calls.

Paper: Link.

Software: Link.

Heller, R. and Heller, Y.
Multivariate tests of association based on univariate tests(2016)
Neural Information Processing Systems (NIPS) 2016, Barcelona, Spain.

Abstract: For testing two random vectors for independence, we consider testing whether the distance of one vector from a center point is independent from the distance of the other vector from a center point by a univariate test. In this paper we provide conditions under which it is enough to have a consistent univariate test of independence on the distances to guarantee that the power to detect dependence between the random vectors increases to one, as the sample size increases. These conditions turn out to be minimal. If the univariate test is distributionfree, the multivariate test will also be distributionfree. If we consider multiple center points and aggregate the centerspecific univariate tests, the power may be further improved, and the resulting multivariate test may be distributionfree for specific aggregation methods (if the univariate test is distributionfree). We show that several multivariate tests recently proposed in the literature can be viewed as instances of this general approach.

Paper: Link.

Heller, R., Heller, Y., Kaufman, S., Brill, B. and Gorfine, M.
Consistent distributionfree Ksample and independence tests for univariate random variables(2016)
Journal of Machine Learning Research , Vol. 17.

Abstract: A popular approach for testing if two univariate random variables are statistically independent consists of partitioning the sample space into bins, and evaluating a test statistic on the binned data. The partition size matters, and the optimal partition size is data dependent. While for detecting simple relationships coarse partitions may be best, for detecting complex relationships a great gain in power can be achieved by considering finer partitions. We suggest novel consistent distributionfree tests that are based on summation or maximization aggregation of scores over all partitions of a fixed size. We show that our test statistics based on summation can serve as good estimators of the mutual information. Moreover, we suggest regularized tests that aggregate over all partition sizes, and prove those are consistent too. We provide polynomialtime algorithms, which are critical for computing the suggested test statistics efficiently. We show that the power of the regularized tests is excellent compared to existing tests, and almost as powerful as the tests based on the optimal (yet unknown in practice) partition size, in simulations as well as on a real data example.

Paper: Link.

R Package: HHG

Angelini, C., Heller, R., Volkinshtein, R., and Yekutieli, D.
Is this the right normalization? A diagnostic tool
for ChIPseq normalization.(2015)
BMC Bioinformatics, Vol. 16, No. 150.

Abstract: Background:
Chipseq experiments are becoming a standard approach for genomewide profiling proteinDNA interactions, such as detecting transcription factor binding sites, histone modification marks and RNA Polymerase II occupancy. However, when comparing a ChIP sample versus a control sample, such as Input DNA, normalization procedures have to be applied in order to remove experimental source of biases. Despite the substantial impact that the choice of the normalization method can have on the results of a ChIPseq data analysis, their assessment is not fully explored in the literature. In particular, there are no diagnostic tools that show whether the applied normalization is indeed appropriate for the data being analyzed.
Results:
In this work we propose a novel diagnostic tool to examine the appropriateness of the estimated normalization procedure. By plotting the empirical densities of log relative risks in bins of equal read count, along with the estimated normalization constant, after logarithmic transformation, the researcher is able to assess the appropriateness of the estimated normalization constant. We use the diagnostic plot to evaluate the appropriateness of the estimates obtained by CisGenome, NCIS and CCAT on several real data examples. Moreover, we show the impact that the choice of the normalization constant can have on standard tools for peak calling such as MACS or SICER. Finally, we propose a novel procedure for controlling the FDR using sample swapping. This procedure makes use of the estimated normalization constant in order to gain power over the naive choice of constant (used in MACS and SICER), which is the ratio of the total number of reads in the ChIP and Input samples.
Conclusions:
Linear normalization approaches aim to estimate a scale factor, r, to adjust for different sequencing depths when comparing ChIP versus Input samples. The estimated scaling factor can easily be incorporated in many peak caller algorithms to improve the accuracy of the peak identification. The diagnostic plot proposed in this paper can be used to assess how adequate ChIP/Input normalization constants are, and thus it allows the user to choose the most adequate estimate for the analysis.

Paper: Link.

M. Gorfine, B. Goldstein, A. Fishman, R. Heller, Y. Heller, A. Lamm
Function of cancer associated genes revealed by modern univariate and multivariate association tests(2015)
PLOS ONE, DOI: 10.1371/journal.pone.0126544.

Abstract: Copy number variation (CNV) plays a role in pathogenesis of many human diseases, especially cancer. Several whole genome CNV association studies have been performed for the purpose of identifying cancer associated CNVs. Here we undertook a novel approach to whole genome CNV analysis, with the goal being identification of associations between CNV of different genes (CNVCNV) across 60 human cancer cell lines. We hypothesize that these associations point to the roles of the associated genes in cancer, and can be indicators of their position in gene networks of cancerdriving processes.
Recent studies show that gene associations are often nonlinear and nonmonotone. In order to obtain a more complete picture of all CNV associations, we performed omnibus univariate analysis by utilizing dCov, MIC, and HHG association tests, which are capable of detecting any type of associations, including nonmonotone relationships. For comparison we used Spearman and Pearson association tests, which detect only linear or monotone relationships. Application of dCov, MIC and HHG tests resulted in identification of twice as many associations compared to those found by Spearman and Pearson alone. Interestingly, most of the new associations were detected by the HHG test.
Next, we utilized dCov's and HHG's ability to perform multivariate analysis. We tested for association between genes of unknown function and known cancerrelated pathways. Our results indicate that multivariate analysis is much more effective than univariate analysis for the purpose of ascribing biological roles to genes of unknown function. We conclude that a combination of multivariate and univariate omnibus association tests can reveal significant information about gene networks of diseasedriving processes. These methods can be applied to any large gene or pathway dataset, allowing more comprehensive analysis of biological processes.

Paper: link

Heller, R., Bogomolov, M., and Benjamini, Y.
Deciding whether followup studies have replicated
findings in a preliminary largescale “omics’ study”(2014)
Proceedings of the National Academy of Sciences (PNAS), Vol. 111, Pp. 1626216267.

Heller R., Yaacoby S., and Yekutieli D.
repfdr: A tool for replicability analysis for genomewide association studies (2014)
Bioinformatics, doi: 10.1093/bioinformatics/btu434

Abstract: Identification of SNPs that are associated with a phenotype in more than one study is of great scientific interest in GWAS research. The empirical Bayes approach for discovering whether results
have been replicated across studies was shown to be a reliable method, and close to optimal in terms of power.
The R package repfdr provides a flexible implementation of the empirical Bayes approach for replicability analysis and metaanalysis,
to be used when several studies examine the same set of null hypotheses.
The usefulness of the package for the GWAS community is discussed.

Paper: PDF file.

R Package: repfdr

Heller, R. and Yekutieli, D.
Replicability analysis for genomewide association studies (2014)
Annals of Applied Statistics, Vol. 8, No. 1, Pp. 481498.

Abstract: The paramount importance of replicating associations is well recognized in the
genomewide associaton (GWA) research community, yet methods for assessing
replicability of associations are scarce. Published GWA studies often combine
separately the results of primary studies and of the followup studies. Informally,
reporting the two separate metaanalyses, that of the primary studies and followup
studies, gives a sense of the replicability of the results. We suggest a formal empirical
Bayes approach for discovering whether results have been replicated across
studies, in which we estimate the optimal rejection region for discovering replicated
results. We demonstrate, using realistic simulations, that the average false
discovery proportion of our method remains small. We apply our method to six type two diabetes
(T2D) GWA studies. Out of 803 SNPs discovered to be associated with T2D using
a typical metaanalysis, we discovered 219 SNPs with replicated associations with
T2D. We recommend complementing a metaanalysis with a replicability analysis
for GWA studies.

Paper: PDF file.

R Package: repfdr

Bogomolov, M. and Heller, R.
Discovering findings that replicate from a primary study of high dimension to a followup study (2013)
Journal of the American Statistical Association, Vol. 108, No. 504, Pp. 14801492.

Abstract: We consider the problem of
identifying whether
findings replicate from one study of high dimension to another, when
the primary study guides the selection of hypotheses to be examined
in the followup study as well as when there is no division of roles
into the primary and the followup study. We show that existing
metaanalysis methods are not appropriate for this problem, and
suggest novel methods instead. We prove that our multiple testing
procedures control for appropriate errorrates.
The suggested FWER controlling procedure is valid for arbitrary dependence among the test statistics within each study. A more powerful procedure is suggested for FDR control. We prove that this procedure controls the FDR if the test statistics are independent within the primary study, and independent or have dependence of type PRDS in the followup study.
For arbitrary dependence within the primary study, and either arbitrary dependence or dependence of type PRDS in the followup study, simple conservative modifications of the procedure control the FDR. We demonstrate the usefulness of these
procedures via simulations and real data examples.

Paper: PDF file.

Heller, R. and Heller, Y. and Gorfine, M.
A consistent multivariate test of association based on ranks of
distances(2013)
Biometrika, Vol. 100, No. 2, Pp. 503510.

Abstract: We consider the detection of associations between random vectors of
any dimension. Few tests of independence exist that are consistent
against all dependent alternatives. We propose a powerful test that
is applicable in all dimensions and is consistent against all
alternatives. The test has a simple form, is easy to implement, and
has good power.

Paper: PDF file.

R Package: HHG

Heller, R. and Gorfine, M. and Heller Y.
A class of multivariate distributionfree tests of independence based on
graphs (2012)
Journal of Statistical Planning and Inference, Vol. 142, No. 12, Pp. 3097–3106.

Abstract: A class of distributionfree tests is proposed for the independence of two
subsets of response coordinates. The tests are based on the pairwise distances
across subjects within each subset of the response. A complete graph is
induced by each subset of response coordinates, with the sample points
as nodes and the pairwise distances as the edge weights. The proposed test
statistic depends only on the rank order of edges in these complete graphs.
The response vector may be of any dimensions. In particular, the number
of samples may be smaller than the dimensions of the response. The test
statistic is shown to have a normal limiting distribution with known expectation
and variance under the null hypothesis of independence. The exact
distribution free null distribution of the test statistic is given for a sample of
size 14, and its MonteCarlo approximation is considered for larger sample
sizes. We demonstrate in simulations that this new class of tests has good
power properties for very general alternatives.

Paper: PDF file.

Heller, R.
Discussion of “Multiple Testing for Exploratory Research” by J. J. Goeman and A. Solari (2012)
Statistical Science, Vol. 26, No. 4, Pp. 598600.

Abstract: Goeman and Solari [Statist. Sci.26(2011) 584–597] have
addressed the interesting topic of multiple testing for exploratory research, and
provided us with nice suggestions for exploratory analysis. They defined
properties that an inferential procedure should have for exploratory analysis:
the procedure should be mild, flexible and post hoc. Their inferential procedure
gives a lower bound on the number of false hypotheses among the
selected hypotheses, and moreover whenever possible identifies elementary
hypotheses that are false. The need to estimate a lower bound on the number
of false hypotheses arises in various applications, and the partial conjunction
approach was developed for this purpose in Biometrics 64(2008) 1215–1222
(see also Philos. Trans. R. Soc. Lond. Ser. A367(2009) 4255–4271 for more
details). For example, in a combined analysis of several studies that examine the same problem,
it is of interest to give a lower bound on the number of studies in which the finding was reproduced.
I will first address the relation between the method of Goeman and Solari and the partial conjunction
approach. Then I will discuss possible extensions and address the issue of exploration in more
general settings, where the local test may not be defined in advance or where the candidate
hypotheses may not be known to begin with.

Paper: PDF file.

Heller, R.
Comment:Correlated zvalues and the accuracy of large scale statistical
estimates(2010)
Journal of the American Statistical Association, Vol. 105, No. 491, Pp. 10571059.

Abstract: Professor Efron has given us an interesting article on how to quantify the uncertainty
in summary statistics of interest in large scale problems, when the summary statistics
are based on correlated normal variates. It is shown that the inflation in the accuracy
estimate due to correlation among the normal variates cannot be ignored (except
possibly at the very far tails of distributions).
Using a series of simplifications of the covariance formula, a simple formula is derived
and it is shown in a numerical example that the approximation is indeed very close to
the truth. In particular it is shown that the entire correlation structure is captured
by one parameter ?, the rms correlation. Several methods of estimating ?, as well as
the other unknown parameters, are suggested.
In what follows I will discuss several topics in large scale significance testing that are
related to the results of this paper.

Paper: PDF file.

Heller, R. and Rosenbaum, P.R. and Small, D.S.
Using the crossmatch test to appraise covariate balance in matched
pairs(2010)
The American Statistician, Vol. 64, No. 4, Pp. 299309

Abstract: Having created a tentative matched design for an observational study,
diagnostic checks are performed to see whether
observed covariates exhibit reasonable balance, or alternatively
whether further effort is required to improve the match. We
illustrate the use of the crossmatch test as an aid to appraising
balance on highdimensional covariates, and we discuss its
close logical connections to the techniques used to construct
matched samples. In particular, in addition to a significance
level, the crossmatch test provides an interpretable measure
of highdimensional covariate balance, specifically a measure
defined in terms of the propensity score. An example from the
economics of education is used to illustrate. In the example,
imbalances in an initial match guide the construction of a better
match. The better match uses a recently proposed technique,
optimal tapered matching, that leaves certain possibly innocuous
covariates imbalanced in one match but not in another, and
yields a test of whether the imbalances are actually innocuous.

Paper: PDF file.

R Package: Crossmatch

Heller, R. and Jensen, S.T. and Rosenbaum, P.R. and Small, D.S.
Sensitivity Analysis for the CrossMatch Test, With Applications in
Genomics(2010)
Journal of the American Statistical Association, Vol. 105, No. 491, Pp. 10051013.

Abstract: The crossmatch test is an exact, distribution free test of no treatment e§ect
on a high dimensional outcome in a randomized experiment. The test uses optimal
nonbipartite matching to pair 2I subjects into I pairs based on similar outcomes, and
the crossmatch statistic A is the number of times a treated subject was paired with a
control, rejecting for small values of A. If the test is applied in an observational study
in which treatments are not randomly assigned, it may be comparing treated and control
subjects who are not comparable, and may therefore falsely reject a true null hypothesis
of no treatment e§ect. We develop a sensitivity analysis for the crossmatch test, and
apply it in an observational study of the e§ects of smoking on gene expression levels. In
addition, we develop a sensitivity analysis for several multiple testing procedures using the
crossmatch test and apply it to 1627 molecular function categories in Gene Ontology.

Paper: PDF file.

R Package: Crossmatch

Benjamini, Y. and Heller, R. and Yekutieli, D.
Selective Inference in Complex Research(2009)
Philosophical Transactions of the Royal Society A, Vol. 367, No. 1906, Pp. 42554271
.

Abstract: We explain the problem of selective inference in complex research using a recently
published study: a replicability study of the associations in order to reveal and establish
risk loci for type 2 diabetes. The false discovery rate approach to such problems will be
reviewed, and we further address two problems: (i) setting confidence intervals on the
size of the risk at the selected locations and (ii) selecting the replicable results.

Paper: PDF file.

Heller, R. and Manduchi, E. and Grant, G.R. and Ewens, W.J.
A flexible twostage procedure for identifying gene sets that are
differentially expressed(2009)
Bioinformatics, Vol. 25, No. 8, Pp. 10191025.

Abstract: Motivation: Microarray data analysis has expanded from testing
individual genes for differential expression to testing gene sets for
differential expression. The tests at the gene set level may focus on
multivariate expression changes or on the differential expression of
at least one gene in the gene set. These tests may be powerful at
detecting subtle changes in expression, but findings at the gene set
level need to be examined further to understand whether they are
informative and if so how.
Results: We propose to first test for differential expression at the
gene set level but then proceed to test for differential expression
of individual genes within discovered gene sets. We introduce the
overall FDR (OFDR) as an appropriate error rate to control when
testing multiple gene sets and genes. We illustrate the advantage of
this procedure over procedures that only test gene sets or individual
genes.
Availability: R code (www.rproject.org) for implementing our
approach is included as supplementary material.

Paper: PDF file.

R code: in Software section.

Heller, R. and Rosenbaum, P.R. and Small, D.S.
Split samples and design sensitivity in observational studies(2009)
Journal of the American Statistical Association, Vol. 104, No. 487, Pp. 10901101.

Abstract: An observational or nonrandomized study of treatment effects may be biased
by failure to control for some relevant covariate that was not measured. The design of
an observational study is known to strongly affect its sensitivity to biases from covariates
that were not observed. For instance, the choice of an outcome to study, or the decision
to combine several outcomes in a test for coherence can materially affect the sensitivity
to unobserved biases. Decisions that shape the design are, therefore, critically important,
but they are also difficult decisions to make in the absence of data. We consider the
possibility of randomly splitting the data from an observational study into a smaller
planning sample and a larger analysis sample, where the planning sample is used to guide
decisions about design. After reviewing the concept of design sensitivity, we evaluate
sample splitting in theory, by numerical computation, and by simulation, comparing it to
several methods that use all of the data. Sample splitting is remarkably effective, much
more so in observational studies than in randomized experiments: splitting 1000 matched
pairs into 100 planning pairs and 900 analysis pairs often materially improves the design
sensitivity. An example from genetic toxicology is used to illustrate the method.

Paper: PDF file.

Heller, R. and Manduchi, E. and Small, D.S.
Matching methods for observational
microarray studies(2009)
Bioinformatics, Vol. 25, No. 7, Pp. 904909.

Abstract: Motivation:We address the problem of identifying differentially
expressed genes between two conditions in the scenario where the
data arise from anobservational study, in which confounding factors
are likely to be present.
Results:We suggest to use matching methods to balance two
groups of observed cases on measured covariates, and to identify
differentially expressed genes using a test suited to matched data.
We illustrate this approach on 2 microarray studies: the first study
consists of data from patients with two cancer subtypes, and the
second study consists of data from AMKL patients with and without
Down syndrome.
Availability: R code (www.rproject.org) for implementing our
approach is included as supplementary material.

Paper: PDF file.

R code: in Software section.

Benjamini, Y. and Heller, R.
Screening for partial conjunction hypotheses(2008)
Biometrics, Vol. 64, No. 4, Pp. 12151222.

Abstract: We consider the problem of testing for partial conjunction of hypothesis, that
argues that at least u out of n tested hypotheses are false. It offers an inbetween approach
to the testing of the conjunction of null hypotheses against the alternative that at least
one is not, and the testing of the disjunction of null hypotheses against the alternative that
all hypotheses are not null. We suggest powerful test statistics for testing such a partial
conjunction hypothesis that are valid under dependence between the test statistics as well
as under independence. We then address the problem of testing many partial conjunction
hypotheses simultaneously using the false discovery rate (FDR) approach. We prove that if
the FDR controlling procedure in Benjamini and Hochberg (1995) is used for this purpose
the FDR is controlled under various dependency structures. Moreover, we can screen at all
levels simultaneously in order to display the findings on a superimposed map and still control
an appropriate FDR measure. We apply the method to examples from Microarray analysis
and functional Magnetic Resonance Imaging (fMRI), two application areas where the need
for partial conjunction analysis has been identified.

Paper: PDF file.

Supplementary Material: PDF file.

Matlab code: in Software section.

Benjamini, Y. and Heller, R.
False Discovery Rates for Spatial Signals(2007)
Journal of the American Statistical Association, Vol. 102, No. 480, Pp. 12721281.

Abstract: The problem of multiple testing for the presence of signal in spatial data
can involve a large number of locations. Traditionally, each location is tested
separately for signal presence but then the findings are reported in terms of
clusters of nearby locations. This is an indication that the units of interests
for testing are clusters rather than individual locations. The investigator may
know apriori these more natural units or an approximation to them. We
suggest testing these cluster units rather than individual locations, thus increasing
the signal to noise ratio within the unit tested as well as reducing
the number of hypotheses tests conducted. Since the signal may be absent
from part of each cluster, we define a cluster as containing signal if the signal
is present somewhere within the cluster. We suggest controlling the false
discovery rate (FDR) on clusters, i.e. the expected proportion of clusters
rejected erroneously out of all clusters rejected, or its extension to general
weights (WFDR). We introduce a powerful twostage testing procedure and
show that it controls the WFDR. Once the cluster discoveries have been made,
we suggest ’cleaning’ locations in which the signal is absent. For this purpose
we develop a hierarchical testing procedure that tests clusters first, then locations
within rejected clusters. We show formally that this procedure controls
the desired location error rate asymptotically, and conjecture that this is so
also for realistic settings by extensive simulations. We discuss an application
to functional neuroimaging which motivated this research and demonstrate
the advantages of the proposed methodology on an example.

Paper: PDF file.

Heller, R. and Golland, Y. and Malach, R. and Benjamini, Y.
Conjunction group analysis: An alternative to mixed/random effect
analysis(2007)
Neuroimage, Vol. 37, No. 4, Pp. 11781185.

Abstract: We address the problem of testing in every brain voxelv whether at
least uout of n conditions (or subjects) considered shows a real effect.
The only statistic suggested so far, the maximumpvalue method, fails
under dependency (unless u=n) and in particular under positive
dependency that arises if all stimuli are compared to the same control
stimulus. Moreover, it tends to have low power under independence.
For testing that at leastuout ofnconditions shows a real effect, we
suggest powerful test statistics that are valid under dependence between
the individual conditionpvalues as well as under independence
and other test statistics that are valid under independence. We use the
above approach, replacing conditions by subjects, to produce informative
group maps and thereby offer an alternative to mixed/random
effect analysis.

Paper: PDF file.

Heller, R. and Stanley, D. and Yekutieli, D. and Rubin, N. and Benjamini, Y.
Clusterbased analysis of FMRI data(2006)
NeuroImage, Vol. 33, No. 2, Pp. 599608.

Abstract: We propose a method for the statistical analysis of fMRIdata that
tests cluster units rather than voxel units for activation. The advantages of this analysis over previous
ones are both conceptual and statistical. Recognizing that the fundamental units of interest are the spatially
contiguous clusters of voxels that are activated together, we set out to approximate these cluster units
from the data by a clustering algorithm especially tailored for fMRIdata. Testing the cluster units has
a twofold statistical advantage over testing each voxel separately: the signal to noise ratio within
the unit tested is higher, and the number of hypotheses tests compared is smaller. We suggest controlling
FDR on clusters, i.e., the proportion of clusters rejected erroneously out of all clusters rejected and
explain the meaning of controlling this error rate. We introduce the powerful adaptive procedure to control
the FDR on clusters. We apply our clusterbasedanalysis (CBA) to both an eventrelated and a block design
fMRI vision experiment and demonstrate its increased power over voxelbyvoxel analysis in these examples
as well as in simulations.

Paper: PDF file.

Matlab code: in Software section.