Fisher’s exact test#

A special case of the permutation test based on the sample sum occurs when the only possible responses are 0 and 1. The distribution of the test statistic under the strong null hypothesis is then hypergeometric, which leads to Fisher’s Exact test. The following material is adapted from SticiGui.

Suppose we own a start-up company that offers e-tailers a service for targeting their web advertising. Consumers register with our service by filling out a form indicating their likes and dislikes, gender, age, etc. We put “cookies” on their computers to keep track of who they are. When they get to the website of any of our clients, we use their likes and dislikes to select (from a collection of the client’s ads) the ad we think they are most likely to respond to. The service is free to consumers; we charge the e-tailers.

We can raise venture capital if we can show that targeting makes e-tailers’ advertisements more effective. To measure the effectiveness, we offer our service for free to a large e-tailer. The e-tailer has a collection of web ads that it usually uses in rotation: each time a consumer arrives at the site, the e-tailer’s server selects the next ad in the sequence to show to the consumer, then starts over when it runs out of ads.

We install our software on the e-tailer’s server, in the following way: each time a consumer arrives at the site, with probability 50% the server shows the consumer the ad our targeting software suggests, and with probability 50% the server shows the consumer the next ad in the rotation, the way the e-tailer used to choose which ad to show. For each consumer, the software records which strategy was used (target or rotation), and whether the consumer buys anything. We call the consumers who were shown the targeted ad the _treatment group; we call the other consumers the _control group. If a consumer visits the site more than once during the trial period, we ignore all of that consumer’s visits but the first.

Suppose that \(N\) consumers visit the site during the trial, that \(n\) of them are assigned to the treatment group, that \(m\) of them are assigned to the control group, and that \(N_S\) of the consumers buy something. In essence, we want to know whether there would have been more purchases if everyone had been shown the targeted ad than if everyone had been shown the control ad. Only some of the consumers saw the targeted ad, and only some saw the control ad, so answering this question involves extrapolating from the data to an hypothetical counterfactual situation. Of course, we really want to extrapolate further, to people who have not yet visited the site, to decide whether more product would be sold if those people are shown the targeted ad.

We can think of the experiment in the following way: the \(i\)th consumer has a ticket with two numbers on it: the first number (\(x_i\)) is 1 if the consumer would have bought something if shown the control ad, and 0 if not. The second number (\(y_i\)) is 1 if the consumer would have bought something if shown the targeted ad, and 0 if not. There are \(N\) tickets in all.

For each consumer \(i\) who visits the site, we observe either \(x_i\) or \(y_i\), but not both. The percentage of consumers who would have made purchases if every consumer had been shown the control ads is

()#\[\begin{equation} p_c := \frac{1}{n}\sum_{i=1}^N x_i. \end{equation}\]

Similarly, the percentage of consumers who would have made purchases if every consumer had been shown the targeted ads is

()#\[\begin{equation} p_t := \frac{1}{n}\sum_{i=1}^N y_i. \end{equation}\]

Let \(\mu := p_t - p_c\) be the difference between the rate at which consumers would have bought had all of them been shown the targeted ad, and the rate at which consumers would have bought had all of them been in the control group. The null hypothesis, that targeting does not make a difference, is that \(\mu = 0\). (The strong null hypothesis is that \(x_i = y_i\), for \(i =1, 2, \ldots , N\).) The alternative hypothesis, that targeting helps, is that \(\mu > 0\). We would like to test the null hypothesis at significance level 5%.

Let \(m\) be the number of consumers in the control group, and let \(n\) be the number of consumers in the treatment group, so \(N = n+m\).

Let \(N_S\) be the total number of sales to the treatment and control groups. Let \(Y\) be the number of sales to consumers in the treatment group. Under the strong null hypothesis (which implies that \(\mu = 0\)), for any fixed value of \(N_S\), \(Y\) has an hypergeometric distribution with parameters \(n\), \(N_S\), and \(N\) (we consider \(n\) to be fixed, i.e., we condition on the attained value of \(n\)):

()#\[\begin{equation} \mathbb{P} \{Y=y \} = \frac{{{N_S}\choose{y}} {{N-N_S}\choose{n-y}}}{{N}\choose{n}}. \end{equation}\]

This is a situation with many tied data, because there are only two possible responses for each subject: 0 if the subject does not buy anything, and 1 if the subject buys something. If the alternative hypothesis is true, \(Y\) will tend to be larger than it would if the null hypothesis is true, so we should design our test to reject the null hypothesis for large values of \(Y\). (Technically, it is stochastically larger: for every \(y\), the chance that \(Y > y\) is at least as large if the alternative hypothesis is true as if the null hypothesis is true.) That is, our rejection region should contain all values above some threshold value \(y\); we will reject the null hypothesis if \(Y > y\).

We cannot calculate the critical value \(y\) until we know \(N\), \(N\), and \(N_S\). Once we observe them, we can find the smallest value \(y\) so that the probability that \(Y\) is larger than \(y\) if the null hypothesis be true is at most 5%, the significance level we chose for the test. Our rule for testing the null hypothesis then would be to reject the null hypothesis if \(Y > y\), and not to reject the null hypothesis otherwise. This is called Fisher’s exact test for the equality of two percentages (against the one-sided alternative that treatment increases the response). It is a permutation test, and it is also essentially a (mid-) rank test, because there are only two possible values for each response.

The Normal Approximation to Fisher’s Exact Test#

If \(N\) is large and \(n\) is neither close to zero nor close to \(N\), computing the hypergeometric probabilities may be difficult, but the normal approximation to the hypergeometric distribution should be accurate provided \(N_S\) is neither too close to zero nor too close to \(N\). To use the normal approximation, we need to convert to standard units, which requires that we know the expected value and standard error of \(Y\). The expected value of \(Y\) is \(\mathbb{E} Y = n \cdot N_S/N\), and the standard error of \(Y\) is

()#\[\begin{equation} SE(Y) = f n^{1/2} SD, \end{equation}\]

where \(f\) is the finite population correction

()#\[\begin{equation} f := \sqrt{\frac{N-n}{N-1}} \end{equation}\]

\(SD\) is the standard deviation of a list of \(N\) values of which \(N_S\) equal one and \((N - N_S)\) equal zero:

()#\[\begin{equation} SD = \sqrt{N_S/N \times (1 - N_S/N)} \end{equation}\]

In standard units, \(Y\) is

()#\[\begin{equation} Z = (Y - \mathbb{E}Y)/SE(Y) = (Y - n N_S/N)/(f N^{1/2} SD). \end{equation}\]

The area under the normal curve to the right of 1.645 standard units is 5%, which corresponds to the threshold value

()#\[\begin{equation} y = \mathbb{E}Y + 1.645 SE(Y) = n(N_S/N) + 1.645 f N^{1/2} SD \end{equation}\]

in the original units, so if we reject the null hypothesis when \(Z > 1.645\) or, equivalently,

()#\[\begin{equation} Y > n (N_S/N) + 1.645 f N^{1/2} SD, \end{equation}\]

we have an (approximate) 5% significance level test of the null hypothesis that ad targeting and ad rotation are equally effective. This is the normal approximation to Fisher’s exact test; \(Z\) is called the \(Z\) statistic, and the observed value of \(Z\) is called the \(z\) score.

Fisher’s Lady Tasting Tea Experiment#

In his 1935 book, The Design of Experiments (London, Oliver and Boyd, 260pp.), Sir R.A. Fisher writes:

A LADY declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup. We will consider the problem of designing an experiment by means of which this assertion can be tested. …
Our experiment consists in mixing eight cups of tea, four in one way and four in the other, and presenting them to the subject for judgment in a random order. The subject has been told in advance of what the test will consist, namely that she will be asked to taste eight cups, that these shall be four of each kind, and that they shall be presented to her in a random order, that is in an order not determined arbitrarily by human choice, but by the actual manipulation of the physical apparatus used in games of chance, cards, dice, roulettes, etc., … Her task is to divide the 8 cups into two sets of 4, agreeing, if possible, with the treatments received.

There are \({{8}\choose{4}} = 70\) ways to distribute the four “milk-first” cups among the 8. Under the null hypothesis that the lady cannot taste any difference, her labeling of the 8 cups—4 milk-first and 4 tea-infusion-first—can be thought of as fixed in advance. The probability that her labeling exactly matches the truth is thus 1/70. A test that rejects the null hypothesis only when she matches all 8 cups has significance level \(1/70 = 1.4\)%. If she misses one cup, she must in fact miss at least two, because she will have mislabeled a milk-first as tea-first, and vice versa. The possible numbers of “hits” are 0, 2, 4, 6, and 8. To get 6 hits, she must label as milk-first three of the four true milk-first cups, and must mislabel as milk-first one of the four tea-first cups. The number of arrangements that give exactly 6 hits is thus

()#\[\begin{equation} {{4}\choose{3}} {{4}\choose{1}} = 16 \end{equation}\]

Thus if we reject the null hypothesis when she correctly identifies 6 of the 8 cups or all 8 of the 8 cups, the significance level of the test is \((16+1)/70 = 24.3\)%. Such good performance is pretty likely to occur by chance—about as likely as getting two heads in two tosses of a fair coin—even if the lady and the tea are in different parlors.

There are other experiments we might construct to test this hypothesis. Lindley (1984, A Bayesian Lady Tasting Tea, in Statistics, an Appraisal H.A. David and H.T. David, eds., Iowa State Univ. Press) lists two, which he attributes to Jerzy Neyman:

  1. Present the lady with \(N\) pairs of cups of tea with milk, where one cup in each pair (determined randomly) has milk added first and one has tea added first. Tell the lady that each pair has one of each kind of cup; ask her to identify which member of each pair had the milk added first. Count the number of pairs she categorizes correctly. (This approach is sometimes called two-alternative forced choice in the study of perception in the psychometric literature: each trial has two alternative responses, the subject is forced to pick one.)

  2. Present the lady with \(N\) cups of tea with milk, each of which has probability 1/2 of having the milk added first and probability 1/2 of having the tea added first. Do not tell the lady how many of each sort of cup there are. Ask her to identify for each cup whether milk or tea was added first. Count the number of cups she categorizes correctly.

These experiments lead to different tests. A permutation test for the first of them is straightforward; to base a permutation test on the second requires conditioning on the number of cups she categorizes each way and on the number of cups that had the milk added first.