The Neyman “potential outcomes” model for causal inference#

There are \(N\) subjects and \(T\) possible treatments. Each subject is represented by a ticket. Ticket \(j\) lists \(T\) numbers, \((x_{j0}, \ldots, x_{jT-1})\). The value \(x_{jt}\) is the response subject \(j\) will have if assigned to treatment \(t\). (Treatment \(0\) is commonly control or placebo.)

This mathematical set up embodies the non-interference assumption: subject \(j\)’s response is assumed to depend only on which treatment subject \(j\) receives, and not on the treatment any other subject receives. (That is not a good assumption in situations like vaccine trials, where whether one subject becomes infected may depend on which other subjects are vaccinated, if subjects may come in contact with each other.)

This model is also called the potential outcomes model, because it starts with the potential outcomes each subject will have to each treatment. Assigning a subject to a treatment just reveals the potential outcome that corresponds to that treatment, for that subject. This model was introduced by Jerzy Neyman, the founder of the U.C. Berkeley Department of Statistics, in a 1923 paper in Polish translated into English in 1990. It was popularized by Donald Rubin in the 1970s and 1980s.

There are generalizations of this model, including one in which the “potential outcomes” are random, rather than deterministic, but their distributions are fixed before assignment to treatment: if subject \(j\) is assigned treatment \(t\), a draw from the distribution \(\mathbb{P}_{jt}\) is observed. Draws for different subjects are independent. We shall see an example of this when we discuss nuisance parameters.

Intention-to-treat (ITT) versus per-protocol#

An intention-to-treat analysis compares outcomes in the subjects assigned to treatment to the subjects assigned to control, whether or not the subjects assigned to a treatment received it. It can be though of as measuring the effect of prescribing treatment, in contrast to the effect of receiving treatment. It is common in experiments on humans that some subjects do not receive the treatment they were randomly assigned. Such non-compliance may result from subjects breaking the blinding (e.g., noticing that their “treatment” is a sugar pill and seeking other treatment rather than receiving only the placebo) or seeking alternative treatments. For instance, this was an issue in the TOGETHER study of the effect of ivermectin on COVID-19, in which a substantial fraction of patients assigned to receive a placebo bought ivermectin over the counter and took it. Non-compliance may also result from subjects simply not doing as they are told. In general, ITT is the preferred analysis, but especially when compliance rates are low, per-protocol (PP) analyisis, which compares outcomes for subjects who actually received the treatment to outcomes for subjects who actually received placebo (and no other treatment) may be more informative.

For instance, in a randomized study of the effect of colonoscopy screenings on colorectal cancer and cancer deaths published in 2022, only 42% of subjects who were “invited” to receive a colonoscopy actually underwent the procedure. The apparent effect of “inviting” subjects to receive a colonoscopy was small: the risk of death from colorectal cancer at 10 years was 0.28% in the invited group and 0.31% in the control group, an ITT relative risk of \(0.28/0.31 = 0.9\). This led major news outlets to report that colonoscopies do not prevent cancer deaths. A better description might be that recommending colonoscopies does not appear to have a large effect on deaths. In the same study, the 10-year risk of death among subjects who actually received colonoscopies was 0.15%, a PP relative risk of \(0.15/0.3 = 0.5\), substantially stronger evidence that receiving a colonoscopy has value.

However, for PP it is crucial to know why subjects did not receive their assigned treatment, since self-selection and other processes can lead to substantial biases in estimates of the effect of treatment. For example, imagine a trial of a treatment for a disease. Suppose that treatment is effective, and that subjects who were assigned a placebo and and get worse break protocol and take an active treatment instead. Then the PP analysis will be biased against finding that the treatment is effective. Conversely, suppose that treatment actually makes some people worse, and that those people stop taking the treatment. Then a PP analysis will tend to hide the fact that treatment hurts some subjects.

Censoring and missing data#

It is common in some kinds of experiments that the outcomes for some subjects cannot be observed, or can be observed only partially. For instance, in experiments on people that involve followup over time, some subjects might simply not show up for one or more followup examinations, resulting in missing data.

In experiments involving how long something takes, for instance, time to “failure” or “death,” some subjects might not have experienced “failure” by the time the experiment ends. The experimenter knows that such subjects are “still alive” when the experiment ended, but they do not know how long it will be until those subjects “die.” That is, rather than observing the failure time itself, the experimenter observes that the failure time is greater than some value. This is an example of (right) censoring. There are special techniques for dealing with missing data (including subjects lost to followup) and censoring.

Below, we assume that compliance is perfect, that no subject drops out of the experiment, and that all outcomes can be observed (no missing data and no censoring).

Null hypotheses for the Neyman model#

The strong null hypothesis is that subject by subject, the effect of all \(T\) treatments is the same. That is,

()#\[\begin{equation} x_{j0} = x_{j1} = \cdots = x_{jT-1}, \;\; j=1, \ldots, N. \end{equation}\]

Different subjects may have different responses (\(x_{jt}\) might not equal \(x_{kt}\) if \(j \ne k\)), but each subject’s response is the same regardless of the treatment assigned to that subject. This is the null hypothesis Fisher considered in The Design of Experiments and which he generally considered the “correct” null in practice.

The weak null hypothesis is that on average across subjects, all treatments have the same effect. That is,

()#\[\begin{equation} \frac{1}{N} \sum_{j=1}^N x_{j0} = \frac{1}{N} \sum_{j=1}^N x_{j1} = \ldots = \frac{1}{N} \sum_{j=1}^N x_{jT-1}. \end{equation}\]

Much of Neyman’s work on experiments involves this null hypothesis. The statistical theory is more complex for the weak null hypothesis than for the strong null, and most results about the weak null are asymptotic, while the strong null is more amenable to exact or conservative inferences.

The strong null is indeed a stronger hypothesis than the weak null, because if the strong null is true, the weak null must also be true: if \(T\) lists are equal, element by element, then their means are equal. (However, see Ding, P., 2017. A Paradox from Randomization-Based Causal Inference, Statist. Sci. 32, 331-345. 10.1214/16-STS571.)

The converse is not true: the weak null can be true even if the strong null is false. For example, for \(T=2\) and \(N=2\), we might have potential responses \((0, 1)\) for subject 1 and \((1,0)\) for subject 2. The effect of treatment is to increase subject 1’s response from 0 to 1 and to decrease subject 2’s response from 1 to 0. The treatment affects both subjects, but the average effect of treatment is the same: the average response across subjects is 1/2, with or without treatment.

The effect of treatment \(t\) on subject \(j\) (compared to control) is

()#\[\begin{equation} \tau_{jt} := x_{jt} - x_{j0}, \;\;t=1, \ldots, T-1. \end{equation}\]

The average effect of treatment \(t\) (compared to control) is

()#\[\begin{equation} \bar{\tau}_t := \frac{1}{N} \sum_{j=1}^N (x_{jt} - x_{j0}), \;\; t=1, \ldots, T-1. \end{equation}\]

The strong null hypothesis for treatment \(t > 0\) is that \(\tau_{jt} = 0\) for all \(j = 1, \ldots, N\). (Treatment \(t\) has no effect on any subject in the study.) The “overall” or “grand” or “omnibus” strong null hypothesis is that \(\tau_{jt} = 0\) for all \(j = 1, \ldots, N\) and all \(t = 1, \ldots, T-1\). (No treatment makes any difference to any subject in the study.) Equivalently, the omnibus strong null hypothesis is the intersection of the strong null hypotheses for all the individual treatments.

The weak null hypothesis for treatment \(t > 0\) is that \(\bar{\tau}_t = 0\). (On average, treatment \(t\) has no effect on the subjects in the study.) The “overall” or “grand” or “omnibus” weak null hypothesis is that \(\bar{\tau}_t = 0\) for all \(t=1, \ldots, T-1\). (No treatment affects the average response of the subjects in the study.) Equivalently, the omnibus weak null hypothesis is the intersection of the weak null hypotheses for all the individual treatments.

In general we have to make assumptions about how treatment affects responses in order to make rigorous inferences about the average treatment effect.

Testing the strong null hypothesis#

Under the strong null that the treatment makes no difference whatsoever–as if the response had been predetermined before assignment to treatment or control–the null distribution of any test statistic is completely determined once the data have been observed: we know what the data would have been for any other random assignment, namely, the same. And we know the chance that each of those possible datasets would have resulted from the experiment, since we know how subjects were assigned at random to treatments.

For alternatives that allow us to find \(x_{j0}\) from \(x_{jt}\) and vice versa, the alternative also completely determines the probability distribution of any test statistic, once the data have been observed.

Alternative hypotheses in the Neyman model#

Constant shift#

For example, if we assume that the effect of treatment is to shift every subject’s response by the same amount, then we can use a test of the strong null to make inferences about that constant effect. In symbols, this alternative states that there are some numbers \(\{\Delta_t\}_{t=1}^T\) such that \(x_{jt} = x_{j0}+\Delta_t\) for all subjects \(j\).

Again, once the original data are observed, this hypothesis completely specifies the probability distribution of the data: we know what subject \(j\)’s response would have been had the subject been assigned any other treatment.

Other tractable alternative hypotheses#

Similarly, it is straightforward to test against any fixed alternative set of values of \(\{\tau_{jt}\}\) using methods for testing the strong null. It is also possible to test against the alternative \(x_{jt} = f_t(x_{j0})\) for some strictly monotonic (and thus invertible) functions \(\{f_t\}\). A simple example is that treatment multiplies the baseline response by a nonzero constant.

In some contexts, it can be reasonable to assume that treatment can only help, that is that \(x_{jt} \ge x_{j0}\) for all \(t\), without specifying a functional relationship between them.

See also Caughey, D., A. Dafoe, X. Li, and L. Miratrix, 2021. Randomization Inference beyond the Sharp Null: Bounded Null Hypotheses and Quantiles of Individual Treatment Effects https://arxiv.org/abs/2101.09195 and Wu and Ding, 2021. Randomization Tests for Weak Null Hypotheses in Randomized Experiments, JASA, 116. https://www.tandfonline.com/doi/abs/10.1080/01621459.2020.1750415

Binary treatments#

Suppose \(T=2\): we are comparing two treatments (typically, an “active” treatment and a placebo or other control). Let \(\boldsymbol{1}_N\) be the \(N\)-dimensional vector of 1s, and let \(\boldsymbol{0}_N\) be the \(N\)-dimensional zero vector. We will omit the subscript \(N\) where the dimension is clear from context.

Recall that the individual treatment effect for subject \(j\) is

()#\[\begin{equation} \tau_j := x_{1j}-x_{0j}. \end{equation}\]

Let \(\tau := (\tau_1, \ldots \tau_N)\). The average treatment effect for the subjects in the experiment is

()#\[\begin{equation} \bar{\tau} := \frac{1}{N} \sum_{j=1}^N \tau_j = \boldsymbol{1} \cdot \tau/N = \frac{1}{N} \sum_{j=1}^N (x_{1j}-x_{0j}) = \bar{x}_1 - \bar{x}_0. \end{equation}\]

(Whether \(\bar{\tau}\) says anything about any larger population depends on how the subjects in the experiment were selected–i.e., on whether the experimental subjects are representative of that larger population.) The strong null hypothesis is \(\tau = \boldsymbol{0}\), i.e.,

()#\[\begin{equation} \tau_1 = \tau_2 = \ldots = \tau_N = 0. \end{equation}\]

This is a composite hypothesis, because it does not specify the “baseline” responses \(\{x_{j0}\}\) of the subjects, while the distribution of the data depends on \(\{x_{j0}\}\) in addition to the individual treatment effects \(\{\tau_{jt}\}\). The baseline responses are nuisance parameters if the goal is to make inferences about treatment effects.

For any vector \(\xi \in \mathbb{R}^N\), let \(H_\xi\) denote the hypothesis that \(\tau = \xi\). For any set \(\Xi \subset \mathbb{R}^N\), let \(H_\Xi\) denote the hypothesis that \(\tau \in \Xi\). Equivalently, \(H_\Xi = \cup_{\xi \in \Xi} H_\xi\). Then the strong null is \(H_\boldsymbol{0}\) and the weak null is \(H_{\Xi_0}\), where

()#\[\begin{equation} \Xi_0 := \left \{ \xi \in \mathbb{R}^N : \boldsymbol{1}_N \cdot \xi/N = 0 \right \}. \end{equation}\]

Inference about the average treatment effect#

Suppose we assign \(n\) subjects at random to treatment 0 (control or placebo) and the other \(m = N-n\) to treatment 1 (treatment or active treatment).

Then testing the strong null hypothesis is identical to the two-sample problem: under the strong null, each subject’s response would have been the same, regardless of treatment. If we know the mechanism for allocating subjects to treatments (e.g., by simple random sampling, by Bernoulli sampling, using a blocked design, etc.) we can calculate or simulate the distribution of any test statistic whatsoever. The question is which test statistic to use. The answer depends on the alternatives against which we want to have power: if we think a particular set of alternatives is most plausible or most important, we can choose the test statistic to maximize some summary of power over that set of alternatives, such as the minimum or a weighted average (the weights can be thought of as a prior probability distribution on alternatives).

We now consider the weak null hypothesis and the average treatment effect.

Let \(W_j\) denote the treatment assigned to subject \(j\): \(W_j = 0\) if subject \(j\) is assigned to control and \(W_j = 1\) if subject \(j\) is assigned to the active treatment. Assume for now that the assignment is by simple random sampling or by Bernoulli sampling. For Bernoulli sampling, conditioning on the number of subjects assigned to active treatment, \(n\), makes the math the same as it is for simple random sampling, where \(n\) is fixed in advance.

The random variables \(\{W_j\}\) are identically distributed but not independent (given \(n\)), because \(\sum_{j=1}^N W_j = n\). Because \(\{W_j\}\) are identically distributed, they share the same expected value, and thus \(\mathbb{E} \sum_{j=1}^N W_j = N \mathbb{E} W_j\) for each \(j\). Since \(\sum_{j=1}^N W_j = n\) is constant, \(\mathbb{E} \sum_{j=1}^N W_j = n\). Hence, \(\mathbb{E} W_j = n/N\), \(j = 1, \ldots, N\).

Here is a bit more notation:

()#\[\begin{equation} X_0^* := \sum_{j=1}^N (1-W_j)x_{j0} \end{equation}\]

is the sum of the responses of the subjects assigned to treatment 0 and

()#\[\begin{equation} \bar{X}_0 := X_0^*/m \end{equation}\]

is their mean response.

()#\[\begin{equation} X_1^* := \sum_{j=1}^N W_j x_{j1} \end{equation}\]

is the sum of the responses of the subjects assigned to treatment 1 and

()#\[\begin{equation} \bar{X}_1 := X_1^*/n, \end{equation}\]

is their mean response.

The statistics \(\bar{X}_0\) and \(\bar{X}_1\) are unbiased estimates of the corresponding population parameters,

()#\[\begin{equation} \bar{x}_0 := \frac{1}{N} \sum_{j=1}^n x_{j0} \end{equation}\]

and

()#\[\begin{equation} \bar{x}_1 := \frac{1}{N} \sum_{j=1}^n x_{j1}. \end{equation}\]

Therefore, \(\hat{\tau} := \bar{X}_1 - \bar{X}_0\) is an unbiased estimate of the average treatment effect \(\bar{\tau}\). One could base a permutation test of the strong null using as the test statistic any of \(X_0^*\), \(X_1^*\), \(\bar{X}_0\), \(\bar{X}_1\), or \(\hat{\tau}\): all lead to equivalent tests, in the sense that they reject for the same data.

Two treatments, binary responses.#

Imagine testing whether a vaccine prevents a disease. We assign a random sample of \(n\) of the \(N\) subjects to receive treatment 1; the other \(m = N-n\) receive a placebo, treatment 0. Let \(W_j\) denote the treatment assigned to subject \(j\). We observe

()#\[\begin{equation} X_j := (1-W_j) x_{j0} + W_j x_{j1}, \;\; j=1, \ldots, N. \end{equation}\]

These are random variables, but (in this model) the only source of randomness is \(\{W_j\}\), the treatment assignment variables.

The total number of infections among the vaccinated is

()#\[\begin{equation} X_1^* := \sum_{j=1}^N W_j x_{j1} \end{equation}\]

and the total among the unvaccinated is

()#\[\begin{equation} X_0^* := \sum_{j=1}^N (1-W_j) x_{j0}. \end{equation}\]

Under the strong null that the vaccine makes no difference whatsoever–as if whether a subject would become ill was predetermined before the assignment to treatment or control–the distribution of the number of infections among the vaccinated would have a hypergeometric distribution with parameters \(N\), \(G=X_0^*+X_1^*\), and \(n=n\). Testing the strong null using this hypergeometric distribution yields Fisher’s Exact Test.

This approach can be generalized by considering the sample sizes \(n\) and \(m\) and/or the total number of infections \(X_0^*+X_1^*\) to be random, then conditioning on their observed values to get a conditional test.

If responses are binary, each \(x_{jt}\) is \(0\) or \(1\), and each \(\tau_j\) is either \(-1\), \(0\), or \(1\). Thus the only possible values of \(\bar{\tau}\) are multiples of \(1/N\). The largest and smallest possible values are \(\bar{\tau}=-1\) (which occurs if \(x_{j1}=0\) and \(x_{j0}=1\) for all \(j\)) and \(\bar{\tau}=1\) (which occurs if \(x_{j1}=1\) and \(x_{j0}=0\) for all \(j\)), a range of 2.

The core question#

What can we say about the average treatment effect from the data?

The potential responses of the subjects in the experiment can be summarized by four numbers, the number of subjects with each possible combination of potential outcomes. Let \(N_{ik}\) be the number of subjects \(j\) for whom \(x_{j0} = i\) and \(x_{j1} = k\), for \(i, k \in \{0, 1\}\). That is, \(N_{00}\) is the number of subjects whose response is “0” whether they are assigned to treatment or to control, while \(N_{01}\) is the number of subjects whose response would be “0” if assigned to control and “1” if assigned to treatment, etc. Of course, \(N = N_{00} + N_{01} + N_{10} + N_{11}\). (Note: this notation has the indices in the opposite order from that in Li and Ding (2016).) These four numbers tell us everything relevant about the study population. Let \(\mathbf{N} := (N_{ij})_{i, j \in \{0, 1\}}\) be the 2x2 summary potential outcome table.

Let \(N_{1+} := N_{10} + N_{11}\) be the number of subjects whose response would be “1” if assigned to control, and let \(N_{+1} := N_{01} + N_{11}\) be the number of subjects whose response would be “1” if assigned to treatment. (Note: this notation also has the indices in the opposite order from that in Li and Ding (2016).)

Now \(\sum_{j=1}^N x_{j0} = N_{1+}\) and \(\sum_{j=1}^N x_{j1} = N_{+1}\), so the average treatment effect can be written

()#\[\begin{equation} \bar{\tau} = \frac{1}{N} \sum_{j=1}^N (x_{j1}-x_{j0}) = \frac{1}{N} (N_{+1} - N_{1+}), \end{equation}\]

the average response among the treated minus the average among the controls.

Once the data have been observed, we know that \(N_{1+}\) is at least \(X_0^*\) (it was that big for the control group alone, even if none of the \(n\) treated subjects would have had a control response of 1) and at most \(X_0^* + n\) (if all \(n\) of the treated subjects would have had a response of 1 if they had been assigned to control instead of the active treatment). Similarly, we know that \(N_{+1}\) is at least \(X_1^*\) (we saw that many 1s in the active treatment group) and at most \(X_1^* + m\) (if every subject assigned to control would have responded 1 if they had received the active treatment). Their difference is thus between

()#\[\begin{equation} X_1^* - X_0^* - n \end{equation}\]

and

()#\[\begin{equation} X_1^* - X_0^* + m, \end{equation}\]

a range of \(n+m = N\). Thus

()#\[\begin{equation} \bar{\tau} \in \{ (X_1^* - X_0^* - n)/N, (X_1^* - X_0^* - n + 1)/N, \ldots, (X_1^* - X_0^* + m)/N\}, \end{equation}\]

which has range \(1\).

First (conservative) approach to confidence bounds for \(\bar{\tau}\): Bonferroni simultaneous confidence intervals#

A tuple of confidence set procedures \((\mathcal{I}_i )_{i=1}^C\) for a corresponding tuple of parameters \((\theta_i)_{i=1}^C \in \Theta\) has simultaneous coverage probability \(1-\alpha\) if

()#\[\begin{equation} \mathbb{P}_{\theta_1, \ldots, \theta_C} \bigcap_{i=1}^C ( \mathcal{I}_i \ni \theta_i ) \ge 1-\alpha, \end{equation}\]

whatever the true values \((\theta_i) \in \Theta\) are.

Bonferroni’s inequality, also called the union bound, says that for any collection of events \(\{A_i\}\),

()#\[\begin{equation} \mathbb{P} \left ( \bigcup_i A_i \right ) \le \sum_i \mathbb{P}(A_i). \end{equation}\]

It follows that if \(\mathcal{I}_i\) is a confidence interval procedure for \(\theta_i\) with coverage probability \(1-\alpha_i\), \(i=1, \ldots, C\), then the tuple of \(C\) confidence set procedures \((\mathcal{I}_i)_{i=1}^C\) has simultaneous coverage probability not smaller than \(1-\sum_{i=1}^C \alpha_i\).

In the current situation, if we had simultaneous confidence sets for both \(N_{1+}\) and \(N_{+1}\), we could find a confidence set for \(\bar{\tau}\), because \(\bar{\tau} = (N_{+1} - N_{1+})/N\).

Now, \(X_1^*\) is the number of “1”s in a random sample of size \(n\) from a population of size \(N\) of which \(N_{+1}\) are labeled “1” and the rest are labeled “0.” Similarly, \(X_0^*\) is the number of “1”s in a random sample of size \(m\) from a population of size \(N\) of which \(N_{1+}\) are labeled “1” and the rest are labeled “0.” That is, the probability distribution of \(X_1^*\) is hypergeometric with parameters \(N\), \(G=N_{+1}\), and \(n\), and the probability distribution of \(X_0^*\) is hypergeometric with parameters \(N\), \(G=N_{1+}\), and \(m\). We can thus use methods for finding confidence bounds for the hypergeometric \(G\) to find confidence bounds for \(N_{+1}\) and \(N_{1+}\) from \(X_1^*\) and \(X_0^*\). The variables \(X_1^*\) and \(X_0^*\) are dependent, but Bonferroni’s inequality holds for dependent events.

To construct an upper 1-sided \(1-\alpha\) confidence bound for \(\bar{\tau}\), we can find an upper 1-sided \(1-\alpha/2\) confidence bound for \(N_{+1}\) (using the hypergeometric distribution), subtract a lower 1-sided \(1-\alpha/2\) confidence bound for \(N_{1+}\), and divide the result by \(N\).

To construct a lower 1-sided \(1-\alpha\) confidence bound for \(\bar{\tau}\), we can find a lower 1-sided \(1-\alpha/2\) confidence bound for \(N_{+1}\), subtract an upper 1-sided \(1-\alpha/2\) confidence bound for \(N_{1+}\), and divide the result by \(N\).

To construct a 2-sided confidence interval for \(\bar{\tau}\), we can find a 2-sided \(1-\alpha/2\) confidence bound for \(N_{+1}\) and a 2-sided \(1-\alpha/2\) confidence bound for \(N_{1+}\). The lower endpoint of the \(1-\alpha\) confidence interval for \(\bar{\tau}\) is the lower endpoint of the 2-sided interval for \(N_{+1}\) minus the upper endpoint of the 2-sided interval for \(N_{1+}\), divided by \(N\). The upper endpoint of the \(1-\alpha\) confidence interval for \(\bar{\tau}\) is the upper endpoint of the 2-sided interval for \(N_{+1}\) minus the lower endpoint of the 2-sided interval for \(N_{1+}\), divided by \(N\).

This approach has a number of advantages: it is conceptually simple, conservative, and only requires the ability to compute confidence intervals for \(G\) for hypergeometric distributions. But the intervals will in general be quite conservative, i.e., unnecessarily wide. (See Table 1 of Li and Ding (2016).) We now examine a sharper approach.

Second approach: testing all tables of potential outcomes#

After the randomization, for each subject \(j\), we observe either \(x_{j0}\) or \(x_{j1}\). At that point, we know \(N\) of the \(2N\) numbers \(\{x_{jt}\}_{j=1}^N{}_{t=0}^1\). The other \(N\) numbers–the responses that were not observed–can be any combination of 0s and 1s: there are \(2^N\) possibilities in all. But not all of those yield distinguishable sets of responses: what matters is how many subjects have each possible pattern of potential outcomes: (0, 0), (0, 1), (1, 0), and (1, 1).

The average treatment effect \(\bar{\tau}\) is determined by the summary potential outcome table \(\mathbf{N}\), i.e., the four numbers, \(N_{00}\), \(N_{01}\), \(N_{10}\), and \(N_{11}\), which to sum to \(N\). How many possible values are there for those four numbers? The total number of ways there are of partitioning \(N\) items into 4 groups can be found by Feller’s “bars and stars” argument (see the notes on nuisance parameters); the answer is \(\binom{N+3}{3} = (N+3)(N+2)(N+1)/6\). This is \(O(N^3)\) tables (see Rigdon and Hudgens (2015)).

But many of those tables are incompatible with the observed data. For instance, we know that \( N_{1+} \ge X_0^*\) and \(N_{+1} \ge X_1^*\). Li and Ding (2016) show that taking into account the observed data reduces the number of tables that must be considered to \(O(N^2)\), greatly speeding the computation.

More recent work cuts the number of tables from \(O(N^2)\) to \(O(N \ln N)\) (Aronow, P.M., H. Chang, and P. Lopatto, 2022?. Fast computation of exact confidence intervals for randomized experiments with binary outcomes. https://lopat.to/permutation.pdf).

A permutation approach

Here is the Li and Ding (2016) approach. Together, \(N_{00}\), \(N_{01}\), \(N_{10}\), and \(N_{11}\) determine the sampling distribution of any statistic, through the random allocation of subjects to treatments. (This is true whether the allocation to treatment is by simple random sampling or Bernoulli sampling; for a blocked design, the distribution also depends on how many subjects of each kind are in each block.) To test the null hypothesis \(H_0: \mathbf{N} = \mathbf{N}_0\), we can define some function \(T\) of \(X_0^*\) and \(X_1^*\) and reject \(H_0\) if the observed value of \(T\) is in the tail of the probability distribution corresponding to \(H_0\).

What should we use for \(T\)? Since we are interested in \(\bar{\tau}\), we might base the test on

()#\[\begin{equation} \hat{\tau} := \bar{X}_1 - \bar{X}_0 = \frac{X_1^*}{n} - \frac{X_0^*}{m}. \end{equation}\]

This is an unbiased estimator of \(\bar{\tau}\): \(X_1^*/n\) is the sample mean of a simple random sample of size \(n\) from \(\{x_{j1}\}_{j=1}^N\), so it is unbiased for \(N_{+1}/N\) and \(X_0^*/m\) is the sample mean of a simple random sample of size \(m\) from \(\{x_{j0}\}_{j=1}^N\), so it is unbiased for \(N_{1+}/N\). Their difference is thus unbiased for \(N_{+1}/N - N_{1+}/N = \bar{\tau}\).

Rigdon and Hudgens (2015) and Li and Ding (2016) take the test statistic to be \(T=|\hat{\tau} - \bar{\tau}|\) (they look at other things, too).

Recall that \(N_{1+} := N_{10} + N_{11}\), \(N_{+1} := N_{01} + N_{11}\), and

()#\[\begin{equation} \bar{\tau} = \frac{N_{+1} - N_{1+}}{N} = \frac{N_{01}-N_{10}}{N}. \end{equation}\]

For any 2x2 summary table \(\mathbf{M}\) of counts of potential outcomes, define

()#\[\begin{equation} \bar{\tau}(\mathbf{M}) := \frac{M_{01} - M_{10}}{\sum_{j,k} M_{jk}}. \end{equation}\]

If we do not reject the simple hypothesis \(\mathbf{N} = \mathbf{N}_0\), then we cannot reject the hypothesis \(\bar{\tau} = \bar{\tau}(\mathbf{N}_0) := \bar{\tau}_0\). We can reject the composite hypothesis \(\bar{\tau} = \bar{\tau}_0\) if we can reject the simple hypothesis \(\mathbf{N} = \mathbf{M}\) for every summary table \(\mathbf{M}\) for which \(\bar{\tau}(\mathbf{M}) = \bar{\tau}_0\).

We can calibrate the test of the hypothesis \(\mathbf{N} = \mathbf{N}_0\) by finding out the probability distribution of \(T\) under \(H_0\): Given the summary potential outcome table \(\mathbf{N}_0\) (the counts of subjects with each possible combination of potential outcomes), we can construct a full table of potential outcomes that is consistent with that table. For each of the \(\binom{N}{n}\) equally likely possible treatment assignments, we can find the corresponding value of \(T\) by allocating the subjects accordingly, finding \(\hat{\tau}\), subtracting \(\bar{\tau}(\mathbf{N}_0)\), and taking the absolute value.

When \(N\) is large and \(n\) is not close to \(0\) or \(N\), it is impractical to construct all \(\binom{N}{n}\) possible treatment assignments. Then, we might approximate the probability distribution by simulation: actually drawing pseudo-random samples of size \(n\) from the subjects, considering them to be the treatment group, calculating \(\hat{\tau}\), etc. As discussed in the chapter on testing, that can be viewed as an approximation to a theoretical \(P\)-value or as the exact \(P\)-value for a randomized test.

Enumerating all 2x2 tables arithmetically consistent with the data

The table \(\mathbf{N}\) summarizing potential outcomes is constrained by the data. For instance, \(N_{10} + N_{11} \ge X_0^*\) and \(N_{01} + N_{11} \ge X_1^*\). There are other constraints, too, as we shall see.

Define

()#\[\begin{equation} n_{wk} := \sum_{j=1}^N 1_{W_j=w, x_{jw}=k} \end{equation}\]

for \(w \in \{0, 1\}\) and \(k \in \{0, 1\}\). That is, \(n_{00} = m-X_0^*\), \(n_{01} = X_0^*\), \(n_{10} = n-X_1^*\), and \(n_{11} = X_1^*\). Clearly, \(\sum_{w,k} n_{wk} = N\): these numbers count the elements in a partition of the subjects. Using this notation, \(\hat{\tau} = n_{11}/n - n_{01}/m\). Let \(\mathbf{n} := (n_{wk})_{w, k \in \{0, 1\}}\) be these four numbers as a matrix, the observed outcome matrix, the number of subjects assigned to treatment \(w\) whose response was \(k\), for \(w, k \in \{0, 1\}\).

Rigdon and Hudgens (2014) show that it suffices to examine \(n_{RH} := (n_{11} + 1)(n_{10} + 1)(n_{01} + 1)(n_{00} + 1)\) 2x2 tables to find confidence bounds for \(\bar{\tau}\) using \(\hat{\tau}\) as the test statistic.

Their argument is as follows: Consider the \(n_{11}\) subjects who were assigned to treatment \(w=1\) and whose response was \(x_{j1}=1\). Fix the unobserved values of the remaining \(N-n_{11}\) subjects. As the unobserved responses of those \(n_{11}\) vary, the value of \(\bar{\tau}\) and the probability distribution of \(T\) depend only on how many of them have \(x_{j0}=0\) and how many have \(x_{j0}=1\). The number \(m\) for which \(x_{j0}=0\) could be 0, 1, \(\ldots\), or \(n_{11}\); the other \(n_{11}-m\) have \(x_{j0}=1\). There are thus only \(n_{11}+1\) tables that need to be examined, given the unobserved values in the other three groups. By the same argument, as the unobserved values of the \(n_{01}\) subjects vary, holding constant the unobserved values for the rest, there are at most \(n_{01}+1\) distinct values of \(\bar{\tau}\) and distinct probability distributions for \(T\). The same goes for \(n_{10}\) and \(n_{11}\). By the fundamental rule of counting, the total number of tables that give rise to distinguishable probability distributions for \(T\) is thus at most \(n_{RH}\).

Li and Ding (2016) improve on that result. They prove that a table \(\mathbf{N} = (N_{00}, N_{01}, N_{10}, N_{11})\) is consistent with the observed values \((n_{wk})\) iff

()#\[\begin{equation} \max \{0,n_{11}-N_{01}, N_{11}-n_{01}, N_{10}+N_{11} - n_{10} - n_{01} \} \le \min \{N_{11}, n_{11}, N_{10}+N_{11}-n_{01}, N - N_{01} - n_{01} - n_{10} \}. \end{equation}\]

The algorithm

The overall approach of Li and Ding (2016) is as follows:

The values \(N\), \((n_{wk})\), and \(\alpha\) are given. Set

()#\[\begin{equation} \hat{\tau}^* := \frac{n_{11}}{n} - \frac{n_{01}}{m}, \end{equation}\]

the observed value of the unbiased estimate of \(\bar{\tau}\).

  • for each table \(\mathbf{N}\) that is algebraically consistent with the observed values \((n_{wk})\):

    • find \(\bar{\tau}(\mathbf{N}) = \frac{N_{01}-N_{10}}{N}\).

    • calculate \(t = |\hat{\tau}^* - \bar{\tau}(\mathbf{N})|\)

    • create a “full” list of potential outcomes \((x_{ik})\) consistent with the summary table \(\mathbf{N}\).

    • find or simulate the sampling distribution of \(|\hat{\tau} - \bar{\tau}|\) using \((x_{ik})\)

    • if \(t\) is above the \(1-\alpha\) percentile of the sampling distribution, pass; otherwise include \(\bar{\tau}(\mathbf{N})\) in the confidence set

  • report the smallest and largest values of \(\bar{\tau}\) in the confidence set as the endpoints of the confidence interval

Below, the main step–generating tables that are algebraically consistent with the data–is implemented in python (N_generator()), as is the step of creating a table of potential outcomes consistent with \(\mathbf{N}\) (potential_outcomes()).

import math
import numpy as np
import scipy as sp
from itertools import filterfalse
def N_generator(N: int, n00: int, n01: int, n10: int, n11: int) -> tuple:
    ''' 
    generate tables algebraically consistent with data from an experiment with binary outcomes
    
    Parameters
    ----------
    N : int
        number of subjects
    n00 : int
        number of subjects assigned to treatment 0 who had outcome 0
    n01 : int
        number of subjects assigned to treatment 0 who had outcome 0
    n10 : int
        number of subjects assigned to treatment 1 who had outcome 0
    n11 : int
        number of subjects assigned to treatment 1 who had outcome 1
    
    Returns
    -------
    Nt : list of 4 ints 
        N00, subjects with potential outcome 0 under treatments 0 and 1
        N01, subjects with potential outcome 0 under treatment 0 and 1 under treatment 1
        N10, subjects with potential outcome 1 under treatment 0 and 0 under treatment 1
        N11, subjects with potential outcome 1 under treatments 0 and 1
    '''
    for i in range(min(N-n00, N-n10)+1):               # allocate space for the observed 0 outcomes, n00 and n10
        N11 = i                                           
        for j in range(max(0, n01-N11), N-n00-N11+1):    # N11+N10 >= n01; N11+N10+n00 <= N
            N10 = j                                        
            for k in range(max(0, n11-N11), min(N-n10-N11, N-N11-N10)+1): 
                                                       # N11+N01 >= n11; N11+N01+n10 <= N; no more than N subjects
                N01 = k                                  
                N00 = N-N11-N10-N01                  
                if filter_table([N00, N01, N10, N11], n00, n01, n10, n11):
                    yield [N00, N01, N10, N11]
                else:
                    pass
def filter_table(Nt: list, n00: int, n01: int, n10: int, n11: int) -> bool:
    '''
    check whether summary table Nt of binary outcomes is consistent with observed counts
    
    implements the test in Theorem 1 of Li and Ding (2016):
    
       \max \{0,n_{11}-N_{01}, N_{11}-n_{01}, N_{10}+N_{11}-n_{10}-n_{01} \}
           \le
       \min \{N_{11}, n_{11}, N_{10}+N_{11}-n_{01}, N-N_{01}-n_{01}-n_{10} \}
    
    Parameters:
    ----------
    Nt : list of four ints
        the table of counts of subjects with each combination of potential outcomes, in the order
        N_00, N_01, N_10, N_11
       
    n01 : int
        number of subjects assigned to control whose observed response was 1

    n11 : int
        number of subjects assigned to treatment whose observed response was 1
        
    Returns:
    --------
    ok : bool
        True if table is consistent with the data
    '''
    N = np.sum(Nt)   # total subjects
    '''
    '''
    return max(0,n11-Nt[1], Nt[3]-n01, Nt[2]+Nt[3]-n10-n01) <= min(Nt[3], n11, Nt[2]+Nt[3]-n01, N-Nt[1]-n01-n10)
def potential_outcomes(Nt: list) -> np.array:
    '''
    make a 2xN table of potential outcomes from the 2x2 summary table Nt
    
    Parameters
    ----------   
    Nt : list of 4 ints
        N00, N01, N10, N11
    
    Returns
    -------
    po : Nx2 table of potential outcomes consistent with Nt
    '''
    return np.reshape(np.array([0,0]*Nt[0]+[0,1]*Nt[1]+[1,0]*Nt[2]+[1,1]*Nt[3]), [-1,2])
# test
Nt = [5, 4, 3, 2]
potential_outcomes(Nt)
array([[0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 1],
       [0, 1],
       [0, 1],
       [0, 1],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 1],
       [1, 1]])
# Hypothetical experiment
N = 10
n = 5
n00 = 3
n01 = 2
n10 = 1
n11 = 4
[Nt for Nt in N_generator(N, n00, n01, n10, n11)]
[[4, 4, 2, 0],
 [3, 5, 2, 0],
 [2, 6, 2, 0],
 [1, 7, 2, 0],
 [3, 4, 3, 0],
 [2, 5, 3, 0],
 [1, 6, 3, 0],
 [0, 7, 3, 0],
 [4, 4, 1, 1],
 [3, 5, 1, 1],
 [2, 6, 1, 1],
 [1, 7, 1, 1],
 [4, 3, 2, 1],
 [3, 4, 2, 1],
 [2, 5, 2, 1],
 [1, 6, 2, 1],
 [0, 7, 2, 1],
 [3, 3, 3, 1],
 [2, 4, 3, 1],
 [1, 5, 3, 1],
 [0, 6, 3, 1],
 [4, 4, 0, 2],
 [3, 5, 0, 2],
 [2, 6, 0, 2],
 [1, 7, 0, 2],
 [4, 3, 1, 2],
 [3, 4, 1, 2],
 [2, 5, 1, 2],
 [1, 6, 1, 2],
 [0, 7, 1, 2],
 [4, 2, 2, 2],
 [3, 3, 2, 2],
 [2, 4, 2, 2],
 [1, 5, 2, 2],
 [0, 6, 2, 2],
 [3, 2, 3, 2],
 [2, 3, 3, 2],
 [1, 4, 3, 2],
 [0, 5, 3, 2],
 [4, 3, 0, 3],
 [3, 4, 0, 3],
 [2, 5, 0, 3],
 [1, 6, 0, 3],
 [4, 2, 1, 3],
 [3, 3, 1, 3],
 [2, 4, 1, 3],
 [1, 5, 1, 3],
 [0, 6, 1, 3],
 [4, 1, 2, 3],
 [3, 2, 2, 3],
 [2, 3, 2, 3],
 [1, 4, 2, 3],
 [0, 5, 2, 3],
 [3, 1, 3, 3],
 [2, 2, 3, 3],
 [1, 3, 3, 3],
 [0, 4, 3, 3],
 [4, 2, 0, 4],
 [3, 3, 0, 4],
 [2, 4, 0, 4],
 [1, 5, 0, 4],
 [4, 1, 1, 4],
 [3, 2, 1, 4],
 [2, 3, 1, 4],
 [1, 4, 1, 4],
 [0, 5, 1, 4],
 [4, 0, 2, 4],
 [3, 1, 2, 4],
 [2, 2, 2, 4],
 [1, 3, 2, 4],
 [0, 4, 2, 4],
 [3, 0, 3, 4],
 [2, 1, 3, 4],
 [1, 2, 3, 4],
 [0, 3, 3, 4],
 [4, 1, 0, 5],
 [3, 2, 0, 5],
 [2, 3, 0, 5],
 [1, 4, 0, 5],
 [4, 0, 1, 5],
 [3, 1, 1, 5],
 [2, 2, 1, 5],
 [1, 3, 1, 5],
 [0, 4, 1, 5],
 [3, 0, 2, 5],
 [2, 1, 2, 5],
 [1, 2, 2, 5],
 [0, 3, 2, 5],
 [4, 0, 0, 6],
 [3, 1, 0, 6],
 [2, 2, 0, 6],
 [1, 3, 0, 6],
 [3, 0, 1, 6],
 [2, 1, 1, 6],
 [1, 2, 1, 6],
 [0, 3, 1, 6]]
# Regeneron data from 
# https://investor.regeneron.com/news-releases/news-release-details/phase-3-prevention-trial-showed-81-reduced-risk-symptomatic-sars
n=753
m=752
N=n+m
n01 = 59
n11 = 11
n00 = m-n01
n10 = n-n11

Other measures of the effect of treatment#

The average treatment effect is one of many ways of quantifying the effect of treatment. For vaccines, a common measure of effect size is vaccine efficacy, defined as

()#\[\begin{equation} \mbox{VE} := \frac{\mbox{risk among unvaccinated} - \mbox{risk among vaccinated}}{\mbox{risk among unvaccinated}} = 1 - \frac{\mbox{risk among vaccinated}}{\mbox{risk among unvaccinated}}. \end{equation}\]

This is the naive estimate of the fraction of infections that vaccinating everyone would prevent.

Here, risk is the fraction of subjects who become infected. In the notation we’ve been using,

()#\[\begin{equation} \mbox{VE}(\mathbf{N}) = \frac{N_{1+}/N - N_{+1}/N}{N_{1+}/N} = \frac{N_{1+} - N_{+1}}{N_{1+}} = 1 - \frac{ N_{+1}}{N_{1+}} \end{equation}\]

if \(N_{1+} > 0\).

We can make confidence intervals for vaccine efficiacy in the same way as for the average treatment effect:

  • Choose a reasonable test statistic \(T\). Here, we might use the plug-in estimator of VE, which is biased, but still a sensible choice:

()#\[\begin{equation} \widehat{\mbox{VE}} := \left \{ \begin{array}{ll} 1 - \bar{X}_1^*/\bar{X}_0^*, & \bar{X}_0^* > 0 \\ 0, & \bar{X}_0^* = 0. \end{array} \right . \end{equation}\]
  • For each summary table of potential outcomes, \(\mathbf{N}\), test the hypothesis that the observations came from that table using the randomization distribution of \(T\): reject if the observed value of \(T\) exceeds the \(1-\alpha\) percentile of that distribution.

  • If the hypothesis is not rejected, include \(\mbox{VE}(\mathbf{N})\) in the confidence set.

A similar approach works with any measure of effect size in a randomized experiment with binary outcomes.

What’s special about binary outcomes?#

The approach above uses the assumption that the potential outcomes can be only 0 or 1. If the potential outcomes take values in a known finite set, a similar approach works, but if there are more than 2 possible values, constructing all tables algebraically consistent with the data is in general harder, and the number of such tables will be larger.

If the potential outcomes are bounded with known bounds (but not necessarily discrete with known support), the approach can be generalized using nonparametric methods for estimating the mean of bounded, finite populations (see other chapters in the notes for such methods, including Confidence bounds via the Chebychev and Hoeffding Inequalities, Lower confidence bounds for the mean via Markov’s Inequality and methods based on the Empirical Distribution, Penny sampling, Wald’s Sequential Probability Ratio Test (SPRT), Kaplan-Wald Confidence Bound for a Nonnegative Mean, and Method shootout.