Combining hypothesis tests#

Intersection-Union Hypotheses#

In many situations, a null hypothesis of interest is the intersection of simpler hypotheses. For instance, the hypothesis that a university does not discriminate in its graduate admissions might be represented as

(does not discriminate in arts and humanities) \(\cap\) (does not discriminate in sciences) \(\cap\) (does not discriminate in engineering) \(\cap\) (does not discriminate in professional schools).

In this example, the alternative hypothesis is a union, viz.,

(discriminates in arts and humanities) \(\cup\) (discriminates in sciences) \(\cup\) (discriminates in engineering) \(\cup\) (discriminates in professional schools).

Framing a test this way leads to an intersection-union test. The null hypothesis is the intersection

\[\begin{equation*} H_0 := \cap_{j=1}^J H_{0j} \end{equation*}\]

and the alternative is the union

\[\begin{equation*} H_1 := \cup_{j=1}^J H_{0j}^c. \end{equation*}\]

There can be good reasons for representating a null hypothesis as such an intersection. In the example just mentioned, the applicant pool might be quite different across disciplines, making it hard to judge at the aggregate level whether there is discrimination, while testing within each discipline is more straightforward (that is, Simpson’s Paradox can be an issue).

Hypotheses about multivariate distributions can sometimes be expressed as the intersection of hypotheses about each dimension separately. For instance, the hypothesis that a \(J\)-dimensional distribution has zero mean could be represented as

(1st component has zero mean) \(\cap\) (2nd component has zero mean) \(\cap\) \(\cdots\) \(\cap\) (\(J\)th component has zero mean)

The alternative is again a union:

(1st component has nonzero mean) \(\cup\) (2nd component has nonzero mean) \(\cup\) \(\cdots\) \(\cup\) (\(J\)th component has nonzero mean)

Combinations of experiments and stratified experiments#

The same kind of issue arises when combining information from different experiments. For instance, imagine testing whether a drug is effective. We might have several randomized, controlled trials in different places, or a large experiment involving a number of centers, each of which performs its own randomization (i.e., the randomization is stratified).

How can we combine the information from the separate (independent) experiments to test the null hypothesis that the drug is ineffective?

Again, the overall null hypothesis is “the drug doesn’t help,” which can be written as an intersection of hypotheses

(drug doesn’t help in experiment 1) \(\cap\) (drug doesn’t help in experiment 2) \(\cap\) \(\cdots\) \(\cap\) (drug doesn’t help in experiment \(J\)),

and the alternative can be written as

(drug helps in experiment 1) \(\cup\) (drug helps in experiment 2) \(\cup\) \(\cdots\) \(\cup\) (drug helps in experiment \(J\)),

a union.

Combining evidence#

Suppose we have a test of each “partial” null hypothesis \(H_{0j}\). Clearly, if the \(P\)-value for one of those tests is sufficiently small, that’s evidence that the overall null \(H_0\) is false. (For instance, we could use Bonferroni’s inequality, aka the union bound.)

But we will not know in advance whether any individual \(P\)-value will turn out to be small. Perhaps none of the individual \(P\)-values turns out to be small, but many are “not large.” Is there a way to combine \(P\)-values across tests to get stronger evidence about \(H_0\)? (Note that our procedure for combining \(P\)-values should be specified before we look at the data, or we are committing selective inference, which requires special methods.)

Combining functions#

Let \(\lambda\) be a \(J\)-vector of statistics such that the distribution of \(\lambda_j\) if hypothesis \(H_{0j}\) is true is known. We assume that smaller values of \(\lambda_j\) are stronger evidence against \(H_{0j}\). For instance, \(\lambda_j\) might be the \(P\)-value of \(H_{0j}\) for some test.

Consider a function

\[\begin{equation*} \phi: [0, 1]^J \rightarrow \Re; \lambda = (\lambda_1, \ldots, \lambda_J) \mapsto \phi(\lambda) \end{equation*}\]

with the properties:

  • \(\phi\) is non-increasing in every argument, i.e., \(\phi( \ldots, \lambda_j, \ldots) \ge \phi(( \ldots, \lambda_j', \ldots)\) if \(\lambda_j \le \lambda_j'\), \(j = 1, \ldots, J\).

  • \(\phi\) attains its maximum if any of its arguments equals 0.

  • \(\phi\) attains its minimum if all of its arguments equal 1.

  • for all \(\alpha > 0\), there exist finite functions \(\phi_-(\alpha)\), \(\phi_+(\alpha)\) such that if every partial null hypothesis \(\{H_{0j}\}\) is true,

\[\begin{equation*} \Pr \{\phi_-(\alpha) \le \phi(\lambda) \le \phi_+(\alpha) \} \ge 1-\alpha\end{equation*}\]

and \([\phi_-(\alpha), \phi_+(\alpha)] \subset [\phi_-(\alpha'), \phi_+(\alpha')]\) if \(\alpha \ge \alpha'\).

Then we can use \(\phi(\lambda)\) as the basis of a test of \(H_0 = \cap_{j=1}^J H_{0j}\).

Fisher’s combining function#

\[\begin{equation*} \phi_F(\lambda) := -2 \sum_{j=1}^J \ln(\lambda_j).\end{equation*}\]

Liptak’s combining function#

\[\begin{equation*} \phi_L(\lambda) := \sum_{j=1}^J \Phi^{-1}(1-\lambda_j),\end{equation*}\]

where \(\Phi^{-1}\) is the inverse standard normal CDF.

Tippet’s combining function#

\[\begin{equation*} \phi_T(\lambda) := \max_{j=1}^J (1-\lambda_j).\end{equation*}\]

Direct combination of test statistics#

\[\begin{equation*} \phi_D := \sum_{j=1}^J f_j(\lambda_j), \end{equation*}\]

where \(\{ f_j \}\) are suitable decreasing functions. For instance, if \(\lambda_j\) is the \(P\)-value for \(H_{0j}\) corresponding to some test statistic \(T_j\) for which larger values are stronger evidence against \(H_{0j}\), we could use \(\phi_D = \sum_j T_j\).

Fisher’s combining function for independent \(P\)-values#

Suppose \(H_0\) is true, that \(\lambda_j\) is the \(P\)-value of \(H_{0j}\) for some pre-specified test, that the distribution of \(\lambda_j\) is continuous under \(H_{0j}\), and that \(\{ \lambda_j \}\) are independent if \(H_0\) is true.

Then, if \(H_0\) is true, \(\{ \lambda_j \}\) are IID \(U[0,1]\).

Under \(H_{0j}\), the distribution of \(-\ln \lambda_j\) is exponential(1):

\[\begin{equation*} \Pr \{ -\ln \lambda_j \le x \} = \Pr \{ \ln \lambda_j \ge -x \} = \Pr \{ \lambda_j \ge e^{-x} \} = 1 - e^{-x}. \end{equation*}\]

The distribution of 2 times an exponential is \(\chi_2^2\): the pdf of a chi-square with \(k\) degrees of freedom is

\[\begin{equation*} \frac{1}{2^{k/2}\Gamma(k/2)} x^{k/2-1} e^{-x/2}. \end{equation*}\]

For \(k=2\), this simplifies to \(e^{-x/2}/2\), the exponential density scaled by a factor of 2.

Thus, under \(H_0\), \(\phi_F(\lambda)\) is the sum of \(J\) independent \(\chi_2^2\) random variables. The distribution of a sum of independent chi-square random variables is a chi-square random variable with degrees of freedom equal to the sum of the degrees of freedom of the variables that were added.

Hence, under \(H_0\),

\[\begin{equation*} \phi_F(\lambda) \sim \chi_{2J}^2, \end{equation*}\]

the chi-square distribution with \(2n\) degrees of freedom.

Let \(\chi_{k}^2(\alpha)\) denote the \(1-\alpha\) quantile of the chi-square distribution with \(k\) degrees of freedom. If we reject \(H_0\) when

\[\begin{equation*} \phi_F(\lambda) \ge \chi_{2J}^2(\alpha), \end{equation*}\]

that yields a significance level \(\alpha\) test of \(H_0\).

## Simulate distribution of Fisher's combining function when all nulls are true

%matplotlib inline
import matplotlib.pyplot as plt
import math

import numpy as np
from numpy.polynomial import polynomial as P

import scipy as sp
from scipy.stats import chi2, binom

from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
from IPython.display import clear_output, display, HTML

def plot_fisher_null(n=5, reps=10000):
    U = sp.stats.uniform.rvs(size=[reps,n])
    vals = np.apply_along_axis(lambda x: -2*np.sum(np.log(x)), 1, U)
    fig, ax = plt.subplots(1, 1)
    ax.hist(vals, bins=max(int(reps/40), 5), density=True, label="simulation")
    mxv = max(vals)
    grid = np.linspace(0, mxv, 200)
    ax.plot(grid, chi2.pdf(grid, df=2*n), 'r-', lw=3, label='chi-square pdf, df='+str(2*n))
    ax.legend(loc='best')
    plt.show()
interact(plot_fisher_null, n=widgets.IntSlider(min=1, max=200, step=1, value=5) )
<function __main__.plot_fisher_null(n=5, reps=10000)>

When \(P\)-values have atoms#

A real random variable \(X\) is first-order stochastically larger than a real random variable \(Y\) if for all \(x \in \Re\),

\[\begin{equation*} \Pr \{ X \ge x \} \ge \Pr \{ Y \ge x \}, \end{equation*}\]

with strict inequality for some \(x \in \Re\).

Suppose \(\{\lambda_j \}\) for \(\{ H_{0j}\}\) satisfy

\[\begin{equation*} \Pr \{ \lambda_j \le p || H_{0j} \} \le p. \end{equation*}\]

This takes into account the possibility that \(\lambda_j\) does not have a continuous distribution under \(H_{0j}\), ensuring that \(\lambda_j\) is still a conservative \(P\)-value.

Since \(\ln\) is monotone, it follows that for all \(x \in \Re\)

\[\begin{equation*} \Pr \{ -2 \ln \lambda_j \ge x \} \le \Pr \{ -2 \ln U \ge x \}. \end{equation*}\]

That is, if \(\lambda_j\) does not have a continuous distribution, the a \(\chi_2^2\) variable is stochastically larger than the distribution of \(-2\ln \lambda_j\).

It turns out that \(X\) is stochastically larger than \(Y\) if and only if there is some probability space on which there exist two random variables, \(\tilde{X}\) and \(\tilde{Y}\) such that \(\tilde{X} \sim X\), \(\tilde{Y} \sim Y\), and \(\Pr \{\tilde{X} \ge \tilde{Y} \} = 1\). (See, e.g., Grimmett and Stirzaker,Probability and Random Processes, 3rd edition, Theorem 4.12.3.)

Let \(\{X_j\}_{j=1}^n\) be IID \(\chi_2^2\) random variables, and let \(Y_j := - 2 \ln \lambda_j\), \(j=1, \ldots, J\).

Then there is some probability space for which we can define \(\{\tilde{Y_j}\}\) and \(\{\tilde{X_j}\}\) such that

  • \((\tilde{Y_j})\) has the same joint distribution as \((Y_j)\)

  • \((\tilde{X_j})\) has the same joint distribution as \((X_j)\)

  • \(\tilde{X_j} \ge \tilde{Y_j}\) for all \(j\) with probability one.

Then

  • \(\sum_j \tilde{Y_j}\) has the same distribution as \(\sum_j Y_j = -2 \sum_j \ln \lambda_j\),

  • \(\sum_j \tilde{X_j}\) has the same distribution as \(\sum_j X_j\) (namely, chi-square with \(2J\) degrees of freedom),

  • \(\sum_j \tilde X_j \ge \sum_j \tilde{Y_j}\).

That is,

\[\begin{equation*} \Pr \left \{-2 \sum_j \ln \lambda_j \ge \chi_{2J}^2(\alpha) \right \} \le \alpha. \end{equation*}\]

Thus, we still get a conservative hypothesis test if one or more of the \(p\)-values for the partial tests have atoms under their respective null hypotheses \(\{H_{0j}\}\).

Accounting for simulation error in stratum-wise \(P\)-values#

Suppose that the \(P\)-value \(\lambda_j\) for \(H_{0j}\) is estimated by \(b_j\) simulations instead of being known exactly. How can we take the uncertainty of the simulation estimate into account?

Here, we will pretend that the simulation itself is perfect: that the PRNG generates true IID \(U[0,1]\) variables, that pseudo-random integers on \(\{0, 1, \ldots, N\}\) really are equally likely, and that pseudo-random samples or permutations really are equally likely, etc.

The error we are accounting for is not the imperfection of the PRNG or other algorithms, just the uncertainty due to approximating a theoretical probability \(\lambda_j\) by an estimate via (perfect) simulation.

A crude approach: simultaneous one-sided upper confidence bounds for every \(\lambda_j\)#

Suppose we find, for each \(j\), an upper confidence bound for \(\lambda_j\) (the “true” \(P\)-value in stratum \(j\)), for instance, by inverting binomial tests based on \(\# \{k > 0: T(\pi_k(x_0)) \ge T(x_0) \}\).

Since \(\phi\) is monotonic in every coordinate, the upper confidence confidence bounds for \(\{ \lambda_j \}\) imply a lower confidence bound for \(\phi(\lambda)\), which translates to an upper confidence bound for the combined \(P\)-value.

What is the confidence level of the bound on the combined \(P\)-value? If the \(P\)-value estimates are independent, the joint coverage probability of a set of \(n\) independent confidence bounds with confidence level \(\alpha\) is \(1-(1-\alpha)^n\), as we shall show.

Let \(A_j\) denote the event that the upper confidence bound for \(\lambda_j\) is greater than or equal to \(\lambda_j\), and suppose \(\Pr \{A_j\} = 1-\alpha_j\).

Regardless of the dependence among the events \(\{A_j \}\), the chance that all of the confidence bounds cover their corresponding parameters can be bounded using Bonferroni’s inequality:

\[\begin{equation*} \Pr \{ \cap_j A_j \} = 1 - \Pr \{ \cup_j A_j^c \} \ge 1 - \sum_j \Pr \{A_j^c \} = 1 - \sum_j (1- \Pr \{A_j \}) = 1 - \sum_j \alpha_j. \end{equation*}\]

If \(\{A_j\}\) are independent,

\[\begin{equation*} \Pr \{ \cap_j A_j \} = \prod_j \Pr \{ A_j \} = \prod_j (1-\alpha_j). \end{equation*}\]

Both of those expressions tend to get small quickly as \(n\) gets large; bounding \(\phi(\lambda)\) by bounding the components of \(\lambda\) is inefficient.

Let’s look for a different approach.

Dependent tests#

If \(\{ \lambda_j \}_{j=1}^J\) are dependent, the distribution of \(\phi_F(\lambda)\) is no longer chi-square when the null hypotheses are true. Nonetheless, one can calibrate a test based on Fisher’s combining function (or any other combining function) by simulation. This is commonly used in multivariate permutation tests involving dependent partial tests using “lockstep” permutations.

See, e.g., Pesarin, F. and L. Salmaso, 2010. Permutation Tests for Complex Data: Theory, Applications and Software, Wiley, 978-0-470-51641-6.

We shall illustrate how the approach can be used to construct nonparametric multivariate tests from univariate tests to address for the two-sample problem (i.e., is there evidence that two samples come from different populations, or is it plausible that they are a single population randomly divided into two groups?). This is equivalent to testing whether treatment has an effect in a controlled, randomized study in which the subjects who receive treatment are a simple random sample of the study group, using the Neyman model for causal inference. (The null hypothesis is that treatment makes no difference whatsoever: each subject’s response would be the same whether that subject was assigned to treatment or to control, and without regard for the assignment of other subjects to treatment or control.)

We have \(N\) subjects of whom \(N_t\) are treated and \(N_c\) are controls. Each subject has a vector of \(J\) measurements. For each of these \(J\) “dimensions” we have a test statistic \(T_j\) (for instance, the difference between the mean of the treated and the mean of the controls on that dimension–but we could use something else, and we don’t have to use the same test statistic for different dimensions).

Each \(T_j\) takes the responses of the treated and the controls on dimension \(j\) and yields a number. We will assume that larger values of \(T_j\) are stronger evidence against the null hypothesis that for dimension \(j\) treatment doesn’t make any difference.

Let \(T\) denote the whole \(J\)-vector of test statistics. Let \(t(0)\) denote the observed value of the \(J\)-vector of test statistics for the original data.

Now consider randomly re-labelling \(N_t\) of the \(N\) subjects as treated and the remaining \(N_c\) as controls by simple random sampling, so that all subsets of size \(N_t\) are equally likely to be labeled “treated.” Each re-labelling carries the subject’s entire \(J\)-vector of responses with it: the dimensions are randomized “in lockstep.”

Let \(t(k)\) denote the observed \(J\)-vector of test statistics for the \(k\)th random allocation (i.e., the \(k\)th random permutation). We permute the data \(K\) times in all, each yielding a \(J\)-vector \(t(k)\) of observed values of the test statistics. This gives \(K+1\) permutations in all, including the original data (the zeroth permutation), for which the vector is \(t(0)\). Let \(t_j(k)\) denote the test statistic for dimension \(j\) for the \(k\)th permutation.

We now transform the \(J\) by \((K+1)\) matrix \([t_j(k)]_{j=1}^J{}_{k=0}^K\) to a corresponding matrix of \(P\)-values for the univariate permutation tests.

For \(j=1, \ldots , J\) and \(k = 0, \ldots , K\), define

\[\begin{equation*} P_j(k) := \frac{\#\{ \ell \in \{0, \ldots, K \} : t_j(\ell) \ge t_j(k) \}}{K+1}. \end{equation*}\]

This is the simulated upper tail probability of the \(k\)th observed value of the \(j\)th test statistic under the null hypothesis.

Think of the values of \(P_j(k)\) as a matrix. Each column corresponds to a random permutation of the original data (the 0th column corresponds to the original data); each row corresponds to a dimension of measurement; each entry is a number between \(1/(K+1)\) and 1.

Now apply Fisher’s combining function \(\phi_F\) (or Tippett’s, or Stouffer’s, or anything else) to each column of \(J\) numbers. That gives \(K+1\) numbers, \(f(k), k=0, \ldots , K\), one for each permutation of the data. The overall “Non-Parametric Combination of tests” (NPC) \(P\)-value is

\[\begin{equation*} P_{\mbox{NPC}} := \frac{\#\{ k \in \{0, \ldots, K\} : f(k) \ge f(0) \}}{K+1}. \end{equation*}\]

This is the simulated lower tail probability of the observed value of the combining function under the randomization.

Ultimately, this whole thing is just a univariate permutation test that uses a complicated test statistic \(\phi_F\) that assigns one number to each permutation of the multivariate data.

Also see the permute Python package.

Stratified Permutation Tests#

Two examples:

  • Boring, A., K. Ottoboni, and P.B. Stark, 2016. Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness, ScienceOpen, doi 10.14293/S2199-1006.1.SOR-EDU.AETBZC.v1

  • Hessler, M., D.M. Pöpping, H. Hollstein, H. Ohlenburg, P.H. Arnemann, C. Massoth, L.M. Seidel, A. Zarbock & M. Wenk, 2018. Availability of cookies during an academic course session affects evaluation of teaching, Medical Education, 52, 1064–1072. doi 10.1111/medu.13627