Tests of randomness and independence#

Tests of independence#

References:

Lehmann, E.L. (1998). Nonparametrics: Statistical Methods Based on Ranks, Prentice Hall, Upper Saddle River, NJ.
Walther, G., 1997. Absence of correlation between the solar neutrino flux and the sunspot number, Phys. Rev. Lett., 79, 4522–4524.
Walther, G., 1999. On the solar-cycle modulation of the Homestake solar neutrino capture rate and the shuffle test, Astrophys. J., 513, 990–996.

The null hypothesis; models#

Suppose a treatment has $N$ ordered levels, and one response is observed per treatment level, with $N$ subjects assigned at random, one to each treatment level. Let $T_j$ denote the rank among the responses of the response of the subject assigned to the treatment level with rank $j$. Then we have a set of pairs of ranks of treatments and responses:

()#\[\begin{equation} \{(j, T_j): j= 1, 2, \ldots N \} \end{equation}\]

There are $N!$ possible sets of pairs, depending on which treatment rank goes with with response rank. If treatment has no effect, all $N$! pairings are equally likely; each has probability $1/N!$. This is the Hypothesis of Randomness.

Now consider drawing a random sample of $N$ subjects from a population, and assigning one to each of the $N$ treatment levels. Let $Z_j$ denote the response of the subject assigned treatment $j$, and define $\mathbb{P}\{Z_j \le z \} := F_j(z)$, $j= 1, 2, \ldots N$. If treatment has no effect, $F_1 = F_2 = \cdots = F_N$. If the size of the population is much much larger than $N$, we can ignore the dependence among the subjects. That leads to the population model, in which $\{Z_j \}_{j=1}^N$ are independent. Under the null hypothesis, they are thus IID, and so the null distribution of the pairs $\{(j, T_j)\}$ is the same as it is for the hypothesis of randomness.

Consider a third statistical model: we make two measurements $(X_j, Y_j)$ for each of $N$ subjects. Under the null hypothesis, the $\{X_j\}$ are IID, $\{Y_j\}$ are IID, and $X$ and $Y$ are independent. Let $T_j$ be the rank of the $Y$ measurement for the subject whose $X$ measurement has rank $j$ among the $X$s. Then this experiment also leads to the same joint distribution for the pairs $\{(j, T_j)\}$ under the null.

The key to all three models is that one set of measurements is exchangeable given the other: all pairings are equally likely.

Tests for Trend#

We are going to start with tests for a trend, the alternative hypothesis that increasing levels of treatment $j$ are associated with increasing (or decreasing) levels of response $T_j$.

The run test#

For many alternatives, high values tend to cluster and low values tend to cluster, leading to “runs.” We can use this to test the hypothesis of randomness by looking at the lengths and numbers of runs. For example, suppose we toss a (possibly biased) coin $N$ times. We can think of this as $N$ trials. We consider the outcome of the $j$th trial to be 1 if the coin lands heads on the $j$th toss and 0 otherwise. Under the null hypothesis, the $N$ outcomes are IID Bernoulli($p$) random variables, where $P$ is the chance that the coin lands heads.

Let $R$ denote the number of runs. For example, the sequence HHTHTTTHTH has 7 runs: HH, T, H, TTT, H, T and H. Condition on the number $n$ of heads in the $N$ tosses. If the null hypothesis is true, each arrangement of the $n$ heads among the $N$ tosses has probability ${{N}\choose{n}}^{-1}$. We will compute the (conditional) probability distribution of $R$ given $n$, under the null hypothesis of independence.

If $n=N$ or if $n = 0$, then $R = 1$. If $R = 2$, there are only two possibilities: first all the heads, then all the tails, or first all the tails, then all the heads. I.e., the sequence is either (HH … HTT … T) or (TT … THH … H). where there are $n$ heads and $m := N-n$ tails in all. The probability that $R=2$ is thus $2{{N}\choose{n}}^{-1}$ if the null hypothesis is true.

More generally, how can $R$ be even, i.e., $R = 2k$ for some integer $k \ge 1$? If the sequence starts with a head, it has to end with a tail for $R$ to be even. We need to choose where to break the sequence of heads to insert tails, then where to break that sequence of tails to insert heads, etc. If the sequence starts with a tail, it has to end with a head for $R$ to be even. We need to choose where to break the sequence of tails to insert heads, then where to break that sequence of heads to insert tails, etc.

We need to break the $n$ heads into $k$ groups, which means picking $k - 1$ breakpoints, but the first breakpoint needs to come after the first head, and the last breakpoint needs to come before the $n$th head, so there are only $n - 1$ places those $k - 1$ breakpoints can be. And we need to break the $m$ tails into $k$ groups, which means picking $k - 1$ breakpoints, but the first needs to be after the first tail and the last needs to be before the $m$th tail, so there are only $m - 1$ places those $k - 1$ breakpoints can be. The number of sequences with $R = 2k$ that start with heads is thus

()#\[\begin{equation} {{n-1}\choose{k-1}}{{m-1}\choose{k-1}}. \end{equation}\]

The number of sequences with $R = 2k$ that start with tails is the same (just read each sequence right-to-left instead of left-to-right).

Thus, if the null hypothesis is true,

()#\[\begin{equation} \mathbb{P}\{R = 2k \} = \frac{2 {{n-1}\choose {k-1}} {{m-1}\choose{k-1}}}{{{N}\choose{n}}}. \end{equation}\]

Of course, there need to be enough heads and enough tails! So, there’s an implicit restriction that $n \ge k$ and $N-n \ge k$. We will just define ${{N} \choose {k}}$ to be 0 if $k > n$.

How can $R$ be odd, $R = 2k+1$? Either the sequence starts and ends with heads or it starts and ends with tails. Suppose it starts with heads. Then we need to break the string of $n$ heads in $k$ places to form $k + 1$ groups using $k$ groups of tails formed by breaking the $m$ tails in $k-1$ places. If the sequence starts with tails, we need to break the $m$ tails in $k$ places to form $k + 1$ groups using $k$ groups of heads formed by breaking the $n$ heads in $k-1$ places. Thus, under the null hypothesis,

()#\[\begin{equation} \mathbb{P}\{R = 2k+1 \} = \frac{ {{n-1}\choose{k}} {{m-1}\choose{k-1}} + {{n-1}\choose{k-1}} {{m-1}\choose{k}}} {{{N}\choose {n}}}. \end{equation}\]

Again, there are implicit restrictions on the relative sizes of $n$, $k$, and $N-n$.

Nothing in the derivation of the null distribution of $R$ used the probability of heads. The conditional distribution of the test statistic under the null hypothesis depends only on the fact that the tosses are IID, not that the coin is fair.

Let $I_j$ be the indicator of the event that the outcome of the $j+1$st toss differs from the outcome of the $j$th toss, $j=1, \ldots, N-1$. Then $R = 1+ \sum_{j=1}^{N-1} I_j$. Under the null, conditional on $n$,

()#\[\begin{eqnarray} \mathbb{P}(I_j = 1) & = & \mathbb{P}(I_j = 1 | \mbox{ $j$th toss lands H}, n)\mathbb{P}( \mbox{ $j$th toss lands H} | n) + \mathbb{P}(I_j = 1 | \mbox{ $j$th toss lands T}, n)\mathbb{P}(\mbox{ $j$th toss lands T} | n) \\ & = &\mathbb{P}(\mbox{ $j+1$st toss lands T }| \mbox{ $j$th toss lands H}, n)\mathbb{P}( \mbox{ $j$th toss lands H} | n) \\ & & + \mathbb{P}(\mbox{ $j+1$st toss lands H } | \mbox{$j$th toss lands T}, n)\mathbb{P}(\mbox{$j$th toss lands T} | n) \\ & = & (m/(N-1)) (n/N) + (n/(N-1))(m/N) \\ & = & 2nm/(N(N-1)). \end{eqnarray}\]

The indicators $\{I_j\}$ are identically distributed under the null hypothesis, so if the null holds,

()#\[\begin{equation} \mathbb{E} R = \mathbb{E} \left ( 1 + \sum_{j=1}^{N-1} I_j \right ) = 1 + (N-1)2nm/(N(N-1)) = 1 + 2nm/N. \end{equation}\]

Example#

Air temperature is measured at noon in a climate-controlled room for 20 days in a row. We want to test the null hypothesis that temperatures on different days are independent and identically distributed.

Let $T_j$ be the temperature on day $j$, $j = 1, \ldots, 20$. If the measurements were IID, whether each day’s temperature is above or below a given temperature $t$ is like a toss of a possibly biased coin, with tosses on different days independent of each other. We could consider a temperature above $t$ to be a head and a temperature below $t$ to be a tail.

Let’s take $t$ to be the median of the 20 measurements. In this example, $n=10$, $m=10$, $N=20$. We will suppose that there are no ties among the measured temperatures. Under the null hypothesis, the expected number of runs is $\mathbb{E}R = 1+2mn/N = 11$.

The minimum possible number of runs is 2 and the maximum is 20. Since we expect temperature on successive days to have positive serial correlation (why?), we might expect to see fewer runs than if temperatures on different days were independent. So, let’s do a one-sided test that rejects if there are too few runs. We will aim for a test with significance level $\alpha = 0.05$.

()#\[\begin{eqnarray} \mathbb{P}_0 \{ R = 2\} & = & \frac{2}{{20} \choose {10}} = 1.082509e-05 \\ \mathbb{P}_0 \{ R = 3\} & = & \frac{2 {{9} \choose {1}} {{9} \choose {0}}}{{20} \choose {10}} \approx 9.74258e-05 \\ \mathbb{P}_0 \{R = 4\} & = & \frac{2 {{9} \choose {1}} {{9} \choose {1}}}{{20} \choose {10}} \approx 8.768321e-04 \\ \mathbb{P}_0 \{R = 5\} & = & \frac{2 {{9} \choose {2}} {{9} \choose {1}}}{{20} \choose {10}} \approx 0.003507329 \\ \mathbb{P}_0 \{R = 6\} & = & \frac{ 2 {{9} \choose {2}} {{9} \choose {2}}}{{20} \choose {10}} \approx 0.01402931 \\ \mathbb{P}_0 \{R = 7\} & = & \frac{2 {{9} \choose {3}} {{9} \choose {2}}}{{20} \choose {10}} \approx 0.03273507 \\ \mathbb{P}_0 \{ R \le 6\} & = & \frac{2 (2+9+81+324+1296)}{{20} \choose {10}} \approx 0.0185. \end{eqnarray}\]

So, we should reject the null hypothesis if $R \le 6$, which gives a significance level of 1.9%. Including 7 in the rejection region would make the significance level 5.1%, slightly too big.

When $N$, $n$ and $m$ are large, the combinatorics can be difficult to evaluate numerically. There are at least two options: asymptotic approximation and simulation. The null distribution of $R$ is asymptotically normal: as $n$ and $m \rightarrow \infty$ and $m/n \rightarrow \gamma$,

()#\[\begin{equation} \frac{R - 2m/(1+\gamma)}{\sqrt{4 \gamma m/(1+\gamma)^{3}}} \rightarrow N(0, 1) \end{equation}\]

in distribution.

Below is a python function to simulate the null distribution of the number $R$ of runs, and evaluate the $P$-value of the observed value of $R$ conditional on $n$, for a one-sided test against the alternative that the distribution produces fewer runs than independent trials would tend to produce. The input is a vector of length $N$; each element is equal to either “1” (for heads) or “-1” (for tails). The test statistic is calculated by finding $1 + \sum_{j=1}^{N-1} I_j$, as we did above in finding $\mathbb{E}R$.

import numpy as np
import scipy as sp
import math
prng = np.random.default_rng(seed=12345678)

def simRunTest(x: np.array, reps: int=10**5):
    '''
    simulate the P-value of the runs test, for the alternative of positive serial correlation
    
    Parameters
    ----------
    x: np.array
        array of +1 and -1
    reps: int
        number of replications
        
    Returns
    -------
    float: fraction of reps in which the test statistic was less than or equal to its observed value
    '''
    N = len(x)
    mis_match = lambda v: 1+np.sum(v[:-1] != v[1:])
    ts = mis_match(x)
    hits = 0
    for i in np.arange(reps):
        prng.shuffle(x)
        hits += (mis_match(x) <= ts)
    return (hits+1)/(reps+1), ts

As an example, suppose the observed sequence is $x = (-1, -1, 1, 1, 1, -1, 1)$, for which $N = 7$, $n = 4$, $m = 3$ and $R = 4$.

x = np.array([-1, -1,  1,  1,  1, -1,  1])
simRunTest(x)

(0.5394846051539485, 4)

Tthe simulated $P$-value using simRunTest was about 0.54. Exact calculation gives:

()#\[\begin{equation} P_{0} \{R \le 4 \} = (2 + 5 + 12)/35 = 19/35 \approx 0.5429. \end{equation}\]

The standard error of the estimated probability is thus

()#\[\begin{equation} SE = \sqrt{(19/35\times 16/35)/10000} \approx 0.005. \end{equation}\]

The simulation was off by about $(0.54 - 0.5429)/0.005 \approx -0.58SE$.

Independence of two time series#

The crucial element of the null hypothesis for the Spearman rank correlation test is that one of the sets of measurements is (conditionally) exchangeable given the other. That implies that all $N!$ pairings $(X_i, Y_j)$ are equally likely. However, time series data often have serial correlation, such as trends, which break the exchangeability, violating that null hypothesis.

Suppose we have two independent time series, each of which has a trend. For example, we might observe the pairs $(X_j, Y_j)_{j=1}^N$, where

()#\[\begin{equation} X_j = j + \epsilon_j, \;\; j=1, \ldots, N, \end{equation}\]

()#\[\begin{equation} Y_j = j + \nu_j, \;\; j=1, \ldots, N, \end{equation}\]

and the $2N$ “noise” terms $\{\epsilon_j\}$ and $\{\nu_j\}$ are IID with zero mean and finite variance. Even though the time series $\{X_j\}$ and $\{Y_j\}$ are independent, neither $\{X_j\}$ nor $\{Y_j\}$ is exchangeable. As a result, the Spearman rank correlation test will tend to reject the hypothesis of independence, not because the two sets of measurements are dependent, but because a different feature of the null hypothesis is false. The trend makes pairings $(X_i, Y_j)$ with both terms larger than average or both smaller than average more likely than pairings where one is relatively large and the other is relatively small.

Walther’s Examples#

The following examples are from Walther (1997, 1999). Suppose that $\{X_j\}_{j=1}^{100}$ and $\{Y_j\}_{j=1}^{100}$ are IID N(0,1). Define

()#\[\begin{eqnarray} S_k &:=& \sum_{j=1}^k X_j \\ T_k & := & \sum_{j=1}^k Y_j \end{eqnarray}\]

Then

()#\[\begin{equation} \mathbb{P}\{ r_S(S, T) > c_{0.01} \} \approx 0.67, \end{equation}\]

where $c_{0.01}$ is the Spearman critical value for a one-sided level 0.01 test against the alternative of positive association. That is, even though the two series are independent, the probability that the Spearman rank correlation coefficient exceeds the 0.01 critical value for the test is over 2/3. That is because the two series $S$ and $T$ each have serial correlation, so not all pairings $(S_i, T_j)$ are equally likely—even though the two series are independent.

Serial correlation is not the only way that exchangeability can break down. For example, if the mean or the noise level varies with time, that violates the null hypothesis. Here is an example of the latter. Take $X = (1, 2, 3, 4)$ to be fixed. Let $(Y_1, \ldots, Y_4)$ be independent, jointly Gaussian with zero mean, $\mbox{SD}(Y_j) = 1$ for $j$ = 1, 2, 3, and $\mbox{SD}(Y_4) = 4$. Then

()#\[\begin{equation} \mathbb{P}_0 \{ r_S(X, Y) = 1 \} = 1/4! \approx 4.17\%. \end{equation}\]

Note that in this example, $r_S = 1$ whenever $Y_1 < Y_2 < Y_3 < Y_4$. Simulation shows that in fact $\mathbb{P}\{ r_S(X, Y)=1 \}$ is about 7%:

reps = int(10**6)
s = np.array([1,1,1,4]) # standard deviations
mu = np.zeros(len(s))
x = np.random.default_rng(seed=12345678).normal(loc=mu, scale=s, size=(reps, len(s)) )

is_sorted = lambda x: np.all(x[:-1] <= x[1:]) # True if the values are in increasing order
hits = np.sum(np.apply_along_axis(is_sorted, 1, x))
print(f'{100*hits/reps:.2f}% of the realizations are in increasing order')

6.97% of the realizations are in increasing order

In a similar vein, take X = (1, 2, 3, 4, 5) to be fixed, let $(Y_1, Y_2, Y_3, Y_4, Y_5)$ be be independent, jointly Gaussian with zero mean and SDs 1, 1, 1, 3, and 5, respectively. Then, under the (false) null hypothesis that every $(X, Y)$ pairing is equally likely,

()#\[\begin{equation} \mathbb{P}_0 \{r_S(X, Y) = 1 \approx 0.83\%, \end{equation}\]

but simulation shows that the actual probability is about 2%:

reps=int(10**6)
x0 = np.arange(1,6)          # the constant X vector (not used)
s = np.array([1,1,1,3,5])    # standard deviations

x = prng.normal(loc=np.zeros(len(s)), scale=s, size=(reps, len(s)) )
hits = np.sum(np.apply_along_axis(is_sorted, 1, x))
print(f'{100*hits/reps:.2f}% of the realizations have Spearman correlation 1')

1.94% of the realizations have Spearman correlation 1

In these examples, the null hypothesis for the Spearman correlation test is false, but not because $X$ and $Y$ are dependent: it is false because not all pairings $(X_i, Y_j)$ are equally likely under the null. The “identically distributed” part of the null hypothesis fails.

Walther recommends using the block bootstrap instead of Spearman’s correlation. Unfortunately, the bootstrap test is approximate–neither conservative nor exact–and there’s no way to tell how far its nominal significance level is from its true level.

An Idiosyncratic Subset of Statistics

Tests of randomness and independence

Contents

Tests of randomness and independence#

Tests of independence#

The null hypothesis; models#

Tests for Trend#

The run test#

Example#

Independence of two time series#

Walther’s Examples#

An Idiosyncratic Subset of Statistics

Tests of randomness and independence

Contents

Tests of randomness and independence#

Tests of independence#

The null hypothesis; models#

Tests for Trend#

The Spearman rank correlation and related statistics#

The run test#

Example#

Independence of two time series#

Walther’s Examples#