Inference about binary populations from stratified samples#

Many (if not most) books on survey sampling recommend stratified sampling, but do not address how to construct exact or conservative confidence sets from stratified samples. Instead, they recommend approximate confidence intervals based on the normal approximation. As we have seen, confidence intervals based on the normal approximation can have terrible coverage probability in practice–much lower than they claim to have–even for large sample sizes, depending on the population distribution.

This chapter constructs conservative and exact tests and confidence bounds for the population total (or mean) of a binary population from a stratified random sample, where the sample from each stratum is a simple random sample.

Three approaches are discussed:

Wright’s method, which finds simultaneous confidence bounds for the number of 1s in each stratum using Šidák’s method, then adds the bounds.
Wendell & Schmee’s method (1996, https://www.tandfonline.com/doi/abs/10.1080/01621459.1996.10476950), which is based on inverting tests of hypotheses that specify the number of 1s in each stratum. The test statistic is $\hat{p}$, the unbiased estimate of the overall population proportion. The $P$-value for the hypothesis that the total number of 1s in the population is $g$ is the largest $P$-value across all ways of allocating $g$ ones across the strata. Finding the largest $P$-value involves examining a combinatorial number of allocations, so the method becomes infeasible as the number of strata grows (beyond approximately 3).
A new method based on inverting tests of hypotheses that specify the number of 1s in each stratum, where such hypotheses are viewed as an intersection hypothesis: the intersection of separate hypotheses about the number of 1s in each stratum by itself. The test statistic is Fisher’s combining function (or any other $P$-value combining function) applied to the $P$-values from the individual stratumwise tests. The $P$-value for the hypothesis that the total number of 1s in the population is $g$ is the largest $P$-value across all ways of allocating $g$ ones across the strata. In contrast to the Wendell-Schmee method, that maximization can be performed constructively rather than requiring a combinatorial search. That makes the method applicable even when the number of strata is large.

This chapter derives and illustrates specific methods for a particular problem, but along the way it illustrates a number of broadly useful things:

Simultaneous confidence intervals based on independent observations via Šidák’s inequality.
Finding a confidence set for a multidimensional parameter by finding simultaneously valid confidence sets for the individual components of the parameter.
Finding a confidence set for a function of a (multidimensional) parameter by looking at the range of the function over a confidence set for the (multidimensional) parameter.
Maximizing $P$-values over nuisance parameters, a general approach that can be applied widely to test composite hypotheses, including situations with nuisance parameters.
Representing composite nulls as a union of intersection nulls.
Combining independent $P$ values using Fisher’s combining function (see also [combining tests](./
Different test statistics that produce valid tests in a given problem can have vastly different statistical performance (power) and computational performance (speed).
Some combinatorial problems can be solved extremely quicly by greedy algorithms.

Maximizing the $P$-value over all allocations of $G$ ones across $S$ strata is combinatorially complex: Feller’s “bars and stars” argument shows that there are $\binom{G+S-1}{S-1}$ ways to allocate $G$ indistinguishable objects among $S$ strata. (Some of those can be ruled out, for instance if $G$ exceeds the size of any stratum.) For $S=10$ strata of size $N_s = 400$ and $G = 300$, there are roughly 6.3e+16 allocations: impractical by any standard.

This chapter replicates Wendell & Schmee’s method in python and introduces a different conservative strategy for stratified inference, also based on maximizing the $P$-value over the nuisance parameters.

The test statistic Wendell & Schmee use is sensible, but only one of infinitely many. A new method developed here uses a different test statistic: a $P$-value combining function (e.g., Fisher’s combining function) applied to the tail probabilities of individual hypergeometric counts for the separate strata. A naive approach to maximizing this $P$-value over the nuisance parameters would also involve a search over a combinatorial number of possible allocations. However:

No combinatorial search is necessary: an allocation that yields the largest $P$-values and corresponding confidence bounds can be constructed by a simple algorithm in order $N \log N$ operations, where $N$ is the number of items in the population. Runtime can be reduced further to $O( N \log S )$. The number of strata has little effect on the complexity of the calculation.
The resulting tests and confidence intervals are in some cases sharper than those of Wendell and Schmee, in particular when the strata are heterogeneous–which is often the justification for drawing a stratified sample in the first place.

The code below implements Wright’s method; the brute-force approach (enumerate all ways of allocating a given number of ones across the strata, find the maximum $P$-value across those allocations); the new, more efficient approach, which exploits special structure of the problem; and the Wendell & Schmee approach.

It replicates a number of calculations in the Wendell & Schmee paper. Several algorithms for finding confidence intervals are implemented, including a line search, a bisection-like method that takes advantage of the fact that the possible values are integers, and a fast constructive algorithm for the new method.

The problem#

A population of $N$ items of which $G$ are labeled “1” and $N-G$ are labeled “0” is allocationed into $S$ strata. Stratum $s$ contains $N_s$ items, of which $G_s$ are labeled “1.” Thus $N = \sum_{s=1}^S N_s$ and $G := \sum_{s=1}^S G_s$. We draw a simple random sample of size $n_s$ from stratum $s$, independently across strata. (I.e., from stratum $s$ we draw a sample of size $n_s$ in such a way that every subset of $n_s$ distinct items of the $N_s$ items is equally likely; and the $S$ samples are drawn independently.)

Let $Y_s$ denote the number of items labeled “1” in the sample from stratum $s$. The variables $\{Y_s \}_{s=1}^S$ are independent. The observed value of $Y_s$ is $y_s$.

We seek hypothesis tests and confidence bounds for $G$. We first consider one-sided tests of the hypothesis $G = g$ against the alternative $G > g$, and corresponding lower confidence bounds for $G$; reversing the roles of “0” and “1” gives upper confidence bounds, mutatis mutandis.

The general strategy for testing the hypothesis $G=g$ is to find the largest $P$-value among all ways of allocating $g$ items labeled “1” among the $S$ strata (honoring the stratum sizes $\{N_s\}$). That is a $P$-value for the composite hypothesis $G=g$. The maximum can be found by examining all such allocations and calculating the $P$-value for each.

Wright’s Method: Sum of Šidák Intervals#

One easy way to get a lower confidence bound for the sum is to take the sum of simultaneous lower confidence bounds for each stratum. Because the samples from different strata are independent, Šidák’s adjustment works. Wright (1991) suggests this approach.

A confidence bound for $G_s$ can be constructed from $Y$ by inverting hypergeometric tests.

To have joint confidence level $1-\alpha$, make each confidence interval at $(1-\alpha)^{1/S}$.

This is an example of a much more general approach: make a joint $1-\alpha$ confidence set for all the parameters $\{G_j\}_{j=1}^S$, then find a lower bound on a functional of interest (here, their sum) over the joint set. Whenever the joint confidence set covers the parameter, the lower bound does not exceed the true value of the functional of the parameter.

The Wendell-Schmee Test#

The test statistic for the Wendell-Schmee test is the unbiased estimate of the population proportion, $G/N$:

\[ \hat{p} := \frac{1}{N} \sum_{s=1}^S N_s y_s/n_s. \]

The $P$-value of the hypothesis $G_s=g_s$, $s=1, \ldots, S$, is the “lower tail probability” of $\hat{p}$ computed on the assumption that $G_s=g_s$, $s=1, \ldots, S$.

Wendell and Schmee consider maximizing this lower tail probability over all allocations of $g$ ones across strata, either by exhaustive search, or by numerical optimization using a descent method from some number of random starting points. (They show graphically that the tail probability is not convex in the original parametrization.)

Constructive maximization#

For some test statistics, there is a much more efficient approach, developed here.

Define $$ p_s(g_s) := \mathbb{P} \{ Y_s \ge y_s || G_s = g_s \} = \sum_{y = y_s}^{g_s} \frac{\binom{g_s}{y} \binom{N_s-g_s}{n_s - y}}{\binom{N_s}{n_s}}, $$ where $\binom{a}{b} := 0$ if $a \le 0$ or $b > a$. (The double vertical bars denote "computed on the assumption that.") This upper tail probability is a $P$-value for the most powerful family of tests of the hypothesis $G_s = g_s$ against the alternative $G_s > g_s$.

Aside: Suppose $U \sim U[0, 1]$. Then $-2\ln U$ has an exponential distribution, which is also the chi-square distribution with 2 degrees of freedom. The sum of independent random variables with chi-square distributions has a chi-square distribution with degrees of freedom equal to the sum of the degrees of freedom of the variables in the sum. Hence, if $\{U_j\}_{j=1}^S$ are IID uniform, then $-2 \sum_j \ln U_j$ has a chi-square distribution with $2S$ degrees of freedom.

A test of the conjunction hypothesis $G_s = g_s$, $s=1, \ldots, S$ can be constructed using Fisher’s combining function: if all $S$ hypotheses are true, the distribution of $$ X^2(\vec{g}) := -2 \sum_{s=1}^S \log p_s(g_s) $$ is dominated by the chi-square distribution with $2S$ degrees of freedom. Let $\chi_d(z)$ denote the chance that a random variable with the chi-square distribution with $d$ degrees of freedom is greater than or equal to $z$. Then a conservative $P$-value for the allocation $\vec{g}$ is $$ P(\vec{g}) = \chi_{2S}(X^2(\vec{g})). $$ The allocation $\vec{g}$ of $g$ ones across strata that maximizes the $P$-value is the allocation that minimizes $X^2(\vec{g})$ and satisfies $\sum_s g_s = g$. Equivalently, it is the allocation that maximizes $\sum_{s=1}^S \log p_s(g_s)$.

Let $$ a_s(j) := \left \{ \begin{array}{ll} \log p_s(y_s), & j = y_s \\ \log \left (p_s(j)/p_s(j-1) \right ), & j = y_s+1, \ldots N_s-(n_s-y_s). \end{array} \right . $$

Then $\log p_s(g_s) = \sum_{j=y_s}^{g_s} a_s(j)$ if $y_s \le g_s \le N-(n_s-y_s)$, and $\log p_s(g_s) = -\infty$ otherwise. Moreover, $$ X^2(\vec{g}) = -2\sum_{s=1}^S a_s(y_s) -2\sum_{s=1}^S \sum_{j=y_s+1}^{g_s} a_s(j) $$ provided $y_s \le g_s \le N-(n_s-y_s)$, $s=1, \ldots, S$; otherwise, it is infinite.

An allocation of $g$ ones across strata is inconsistent with the data unless $g_s \ge y_s$, $s=1, \ldots, S$. Thus, in considering how to allocate $g$ ones to maximize the $P$-value, the first sum above, accounting for $\sum_s y_s$ ones, is “mandatory,” or the $P$-value will be zero. The question is how to allocate the remaining $g - \sum_s y_s$ ones to maximize the $P$-value (equivalently, to minimize $X^2(\vec{g})$).

Let $b_k$ denote the $k$th largest element of the set

\[ \{a_s(j): j=y_s+1, \ldots, N_s-(n_s-y_s), \;\; s=1, \ldots, S \}, \]

with ties broken arbitrarily. Define $\tilde{g}_y := g - \sum_{s=1}^S y_s$.

Proposition. For every $\vec{g}$ with $\sum_s g_s = g$, $$ X^2(\vec{g}) \ge X_*^2(g) := \left \{ \begin{array}{ll} -2 \left ( \sum_{s=1}^S a_s(y_s) + \sum_{k=1}^{\tilde{g}_y} b_k \right ), & \sum_s y_s \le g \le N - \sum_s (n_s-y_s) \\ \infty, & \mbox{ otherwise. } \end{array} \right . $$

Proof. Any $\vec{g}$ for which $X^2(\vec{g})$ is finite includes the first sum and a sum of $\tilde{g}_y$ elements of $\{b_k\}$; the latter is at most the sum of the $\tilde{g}_y$ largest elements of $\{b_k\}$. $\Box$

Moreover, the bound is sharp, because $a_s(j)$ decrease monotonically with $j$ for $j = y_s+1, \ldots, N_s-(n_s-y_s)$. Thus, if $a_s(i)$ is a term in the second sum for some $i > y_s+1$, so is every $a_s(j)$, $y_s \le j \le i-1$: the second sum indeed corresponds to a particular allocation $\vec{g}$ of $g$ ones across the $S$ strata, with $y_s \le g_s \le N_s-(n_s-y_s)$. Among all allocations of $g$ items labeled “1,” this one minimizes has the smallest tail probability, because it corresponds to the exponentiation of the smallest sum of logs (the largest negative sum of logs). $\Box$

Proposition: For $j \in y_s+1, \ldots, N_s-(n_s-y_s)$, $a_s(j)$ is monotone decreasing in $j$.

Theorem: If $\sum_s y_s \le g \le N - \sum_s (n_s-y_s)$,

\[ P(g) \le \chi_d(X_*^2(g)). \]

Proof: Immediate from the definitions.

The theorem shows that a “greedy” approach finds a conservative $P$-value: Construct the values $a_s(j)$ and the set $\{b_k\}$. Add the $S$ values $\{a_s(x_k)$ to the $g-g_y$ largest elements of $\{b_k\}$ and multiply the sum by $-2$. The upper tail probability of the chi-square distribution with $2S$ degrees of freedom is a conservative $P$-value for the hypothesis $G=g$.

A conservative upper $1-\alpha$ confidence bound for $G$ is the largest $g$ for which $P(g) \ge \alpha$.

Changing the direction of the test#

The test of the hypothesis $G=g$ given above is a one-sided test against the alternative $G > g$: it rejects if the chance of observing “so few” good objects is small.

To test against the alternative $G < g$ (i.e., to reject if the chance of observing “so many” good objects is small), exchange the role of “good” and “bad.” The hypothesis $G < g$ is equivalent to the hypothesis $(N-G) > (N-g)$.

The resulting null hypothesis is $G = N-g$, and the data are $n_s - Y_s$.

Sampling with replacement#

The same approaches work for sampling with replacement.

For simplicity, assume that the population proportion $\pi = G/N$ and the stratum proportions $\pi_s G_s/N_s$ can be any numbers between $0$ and $1$, not just multiples of $1/N$ or $1/N_s$.

We observe $Y_s \sim \mathrm{Bin}(n_s, \pi_s)$, $s = 1, \ldots, S$. The observations are independent. The population proportion is $\pi := G/N = N^{-1}\sum_s \pi_s n_s$.

Define $$ p_s(\mu_s) := \mathbb{P} \{ Y_s \ge y_s || \pi_s = \mu_s \} = \sum_{y = y_s}^{n_s} \binom{n_s}{y} \mu_s^y (1-\mu_s)^{n_s-y}. $$ This is the $P$-value of the hypothesis $\pi_s = \mu_s$ tested against the alternative $\pi_s > \mu_s$.

A test of the conjunction hypothesis $\pi_s = \mu_s$, $s=1, \ldots, S$ can be constructed using Fisher’s combining function: Let $g_s = N_s\mu_s$, $s = 1, \ldots, S$. If all $S$ hypotheses are true, the distribution of $$ X^2(\vec{g}) := -2 \sum_{s=1}^S \log p_s(\mu_s) $$ is dominated by the chi-square distribution with $2S$ degrees of freedom. Let $\chi_d(z)$ denote the survival function for the chi-sqare distribution with $d$ degrees of freedom, i.e., the chance that a random variable with the chi-square distribution with $d$ degrees of freedom is greater than or equal to $z$. Then a conservative $P$-value for the allocation $\vec{g}$ is $$ P(\vec{g}) = \chi_{2S}(X^2(\vec{\mu})). $$ The allocation $\vec{\mu}$ that maximizes the $P$-value is the allocation that minimizes $X^2(\vec{g})$ and satisfies $\sum_s \mu_s = g$. Equivalently, it is the allocation that maximizes $\sum_{s=1}^S \log p_s(g_s)$.

Lemma. The Binomial pdf is log concave in $p$.

Proof.

\[\begin{split} \begin{align*} \frac{d^2}{dp^2} \log \left [ \binom{n}{k}p^k(1-p)^{n-k} \right ] & = \frac{d^2}{dp^2} \left [ C + k\log p + (n-k)\log(1-p) \right ] \\ &= \frac{d}{dp} (k/p - (n-k)/(1-p)) \\ &= -k/p^2 - (n-k)/(1-p)^2 < 0. \Box \end{align*} \end{split}\]

Open questions#

The Wendell-Schmee test coincides exactly with the unadjusted new test (i.e., using the joint probability without calibrating it with Fisher’s combining function) for small observed counts. The optimal parameter values are identical, and the Wendell-Schmee $P$-value is equal to the tail probability in the new test before Fisher’s adjustment. Why?

For what observations is the optimal allocation the same for the two tests?

Can a similar constructive/greedy approach find the optimizer (or a bound) for the Wendell-Schmee test? (Implausible because of the lack of convexity, at least in the original parametrization.)

What is the empirical coverage of the two methods? What’s the worst-case?

When is the new method sharper than Wendell-Schmee? (Seems to be when the strata are heterogeneous.)

# Install permute and cryptorandom in the current kernel if needed
import sys
!{sys.executable} -m pip install permute --user

Requirement already satisfied: permute in /Users/stark/.local/lib/python3.10/site-packages (0.2)
Requirement already satisfied: scipy>=1.6 in /opt/anaconda3/lib/python3.10/site-packages (from permute) (1.7.3)
Requirement already satisfied: cryptorandom>=0.3 in /opt/anaconda3/lib/python3.10/site-packages (from permute) (0.3)
Requirement already satisfied: numpy>=1.20 in /opt/anaconda3/lib/python3.10/site-packages (from permute) (1.21.5)

import math
from scipy.stats import binom, hypergeom, chi2
import scipy as sp
import numpy as np
import itertools
import warnings
from permute.utils import binom_conf_interval, hypergeom_conf_interval

# New utility functions

def fisher_log(log_p, **kwargs) -> float:
    '''
    Fisher's combining function for the log of independent P-values
    
    Parameter
    ---------
    log_p : np.array or float
        vector of logarithms of independent P-values or sum of the logs
    df : int
        twice the number of log P-values in the sum. 
        Required if log_p is a scalar; otherwise, inferred from len(log_p)
        
    Returns
    -------
    P : float
        combined P-value (not on a log scale)
    '''
    if not isinstance(log_p, (list, tuple, np.ndarray)):  # log_p is already the sum; need sensible df
        df = kwargs.get('df',0)
        assert df >= 2, f'{df=} incorrect or not set'
    else: # there's a vector of log P-values; df is twice its length
        df = 2*len(log_p)
    return sp.stats.chi2.sf(-2*np.sum(log_p), df=df)

def bars_stars(strata: list, found: list, good: int) -> object:
    '''
    Generate all allocations of `good` 1s across the strata
    
    Parameters
    ----------
    strata : list of ints
        stratum sizes
    found : list of ints
        number of 1s found in the sampl from each stratum
    good : int
        number of 1s to distribute across the strata
    
    Returns
    -------
    generator that iterates over all allocations
    '''
    n_strata = len(strata)
    barsNstars = good + n_strata
    bars = [0]*n_strata + [barsNstars]
    return ([bars[j+1] - bars[j] - 1 for j in range(n_strata)]
            for bars[1:-1] in itertools.combinations(range(1, barsNstars), n_strata-1)
            if all(((bars[j+1] - bars[j] - 1 <= strata[j]) and \
            (bars[j+1] - bars[j] -1 >= found[j])) for j in range(n_strata)))            


class StratifiedBinary:
    '''
    allocation of 1s to strata

    Parameters
    ----------
    strata : numpy array of ints
        stratum sizes
    sams : numpy array of ints
        sample sizes
    found : numpy array of ints
        number of 1s in the sample from each stratum
    alloc : numpy array of ints
        initial allocation of 1s to strata (found)
    log_p : numpy array of floats
        tail probabilities for the allocation
    next_up : numpy array of floats
        log of the probability multipliers for including an additional 1 in each stratum
    '''
    
    def __init__(self, strata=None, sams=None, found=None, alloc=None, log_p=None, next_up=None):
        self.strata = strata
        self.sams = sams
        self.found = found
        self.alloc = alloc
        self.log_p = log_p
        self.next_up = next_up
    
    def __str__(self):
        return f'{strata=}\n{sams=}\n{found=}\n{alloc=}\n{log_p=}\n{next_up=}'        
    
    def allocate_first(self):
        '''
        initialize the allocation of 1s to strata and ingredients for constructive max
        
        Returns
        -------
        True
        
        Side effects
        ------------
        initializes the allocation to its minimal feasible values
        
        '''
        self.alloc = self.found.copy()                    # allocation that maximizes the P-value so far
        self.log_p = np.array([sp.stats.hypergeom.logsf(self.found[s]-1, self.strata[s], self.alloc[s], \
                             self.sams[s]) for s in range(len(self.strata))])  # stratumwise log tail probabilities
        self.next_up = np.array([np.NINF if self.alloc[s]+1 > self.strata[s]-self.sams[s]+self.found[s] \
                                 else sp.stats.hypergeom.logsf(self.found[s]-1, self.strata[s], self.alloc[s]+1,\
                                self.sams[s]) - self.log_p[s] for s in range(len(self.strata))])
        return True

    def allocate_next(self) -> bool:
        '''
        allocate an additional 1 to the stratum that gives largest tail probability

        updates alloc and next_up in place
        
        Returns
        -------
        success: bool
            true unless there's nothing left to allocate
        
        Side effects
        ------------
        updates alloc and next_up
        
        '''
        big = np.argmax(self.next_up)
        self.alloc[big] += 1
        self.log_p[big] = sp.stats.hypergeom.logsf(self.found[big]-1, self.strata[big], self.alloc[big], 
                                                   self.sams[big])
        self.next_up[big] = (np.NINF if self.alloc[big]+1 > self.strata[big]-self.sams[big]+self.found[big]
                             else sp.stats.hypergeom.logsf(self.found[big]-1, self.strata[big], self.alloc[big]+1, 
                             self.sams[big]) - self.log_p[big])
        return (True if np.max(self.next_up) > np.NINF else False)
    
    def fisher_p(self) -> float:
        '''
        Fisher P-value
        '''
        return fisher_log(self.log_p)
    
    def total(self) -> float:
        '''
        total 1s allocated
        '''
        return np.sum(self.alloc)
    
    def n_strata(self) -> int:
        '''
        number of strata
        '''
        return len(self.strata)

# Maximum P-values over allocations 
            
def strat_test_brute(strata: list, sams: list, found: list, good: int, **kwargs) -> tuple:
    '''
    p-value of the hypothesis that the number of 1s in a binary population is 
    less than or equal to `good`, from a stratified random sample.
    
    Assumes that a simple random sample of size sams[s] was drawn from stratum s, 
    which contains strata[s] objects in all.
    
    The P-value is the maximum Fisher combined P-value across strata
    over all allocations of good 1s among the strata. The allocations are
    enumerated using Feller's "bars and stars" construction, constrained to honor the
    stratum sizes and the data (each stratum can contain no more 1s than it has items in all
    minus the observed number of 0s, nor fewer "good" items than the sample contains).
    
    The number of allocations grows combinatorially: there can be as many as
    [(#strata + #1s) choose (#strata-1)] allocations, making the brute-force approach computationally 
    infeasible when the number of strata and/or the number of 1s is large.
    
    The test is a union-intersection test: the null hypothesis is the union over allocations
    of the intersection across strata of the hypothesis that the number of 1s
    in the stratum is less than or equal to a constant.
    
    Parameters:
    -----------
    strata : list of ints
        sizes of the strata
    sams : list of ints
        sample sizes from the strata
    found : list of ints
        the numbers of 1s found in the samples from the strata
    good : int
        the hypothesized total number of 1s in the population
    kwargs : keyword arguments for this function and the functions it calls
        alternative : string {'lower', 'upper'}
            test against the alternative that the true number of 1s is less than good (lower)
            or greater than good ('upper'). Default 'lower'
        combining_function : callable 
            combining function; default is fisher_log. 
            kwarg is also passed to combining_function
        cheap_combiner : callable
            monotone increasing function of the combining function. 
            Default np.sum if combining_function == fisher_log
            kwarg also passed to cheap_combiner
        warn : int
            warn if the number of allocations exceeds this. Default 10**7        
    
    Returns:
    --------
    p : float
        maximum combined p-value over all ways of allocating good "good" objects
        among the strata, honoring the stratum sizes.        
    alloc : list
        an allocation that attains the maximum p-value
    '''
    alternative = kwargs.get('alternative','lower')
    assert alternative in ['lower','upper'], f'alternative {alternative} not implemented'
    alloc = None
    sams = np.array(sams, dtype=int)
    found = np.array(found, dtype=int)
    strata = np.array(strata, dtype=int)
    if good < np.sum(found):     
        p = 0 if alternative == 'lower' else 1 
    elif good > np.sum(strata) - np.sum(sams) + np.sum(found):
        p = 1 if alternative == 'lower' else 0
    else:  
        if alternative == 'upper':                   # exchange roles of 1s and 0s
            compl = sams - found                     # 0s found 
            bad = np.sum(strata) - good              # total 0s hypothesized
            kwargs_c = kwargs.copy()
            kwargs_c['alternative'] = 'lower'
            p, alloc_c = strat_test_brute(strata, sams, compl, bad, **kwargs_c)
            alloc = None if alloc_c is None else list(strata-np.array(alloc_c, dtype=int))
        else:
            p = np.NINF   # initial value for the max
            n_strata = len(strata)
            parts = sp.special.binom(good+n_strata-1, n_strata-1)
            combining_function = kwargs.get('combining_function', fisher_log)
            if combining_function == fisher_log:
                kwargs['df'] = 2*n_strata
                cheap_combiner = lambda p_vec, **kwargs: np.sum(p_vec)
            else:
                cheap_combiner = kwargs.get('cheap_combiner', combining_function)
            warn = kwargs.get('warn',10**7)
            if parts >= warn:
                print(f'warning--large number of allocations: {parts}')
            alloc = found.copy()
            for part in bars_stars(strata, found, good):
                p_new = cheap_combiner( 
                            np.array([sp.stats.hypergeom.logsf(found[s]-1, strata[s], part[s], sams[s])
                            for s in range(n_strata)]),
                            **kwargs)
                if p_new > p:
                    alloc = part
                    p = p_new
            p = combining_function(p, **kwargs)
    return p, (None if alloc is None else list(alloc))
    
def strat_test(strata: list, sams: list, found: list, good: int, **kwargs) -> tuple:
    """
    P-value for the hypothesis that the number of 1s in a binary population is not 
    greater than (or not less than) a hypothesized value, based on a stratified 
    random sample without replacement.
    
    Uses the fast algorithm to find the P-value constructively.
    
    Uses Fisher's combining function to combine stratum-level P-values.
    
    Parameters:
    -----------    
    strata : list of ints
        stratum sizes
    sams : list of ints
        sample sizes in the strata
    found : list of ints
        number of ones found in each stratum in each sample
    good : int
        hypothesized number of ones in the population
    kwargs : dict
        alternative : string {'lower', 'upper'} default 'lower'
            test against the alternative that the true number of 1s is less than (lower) 
            or greater than (upper) the hypothesized number, good
    
    Returns:
    --------
    p : float
        P-value
    alloc : list
        an allocation that attains the maximum p-value
    """
    alternative = kwargs.get('alternative','lower')
    assert alternative in ['lower', 'upper'], f'alternative {alternative} not implemented'
    strata = np.array(strata, dtype=int)
    sams = np.array(sams, dtype=int)
    found = np.array(found, dtype=int)
    good = int(good)
    alloc = None
    if good < np.sum(found):     
        p = 0 if alternative == 'lower' else 1 
    elif good > np.sum(strata) - np.sum(sams) + np.sum(found):
        p = 1 if alternative == 'lower' else 0
    else:
        if alternative == 'upper':                  # exchange roles of "good" and "bad"
            compl = sams - found                    # bad items found 
            bad = np.sum(strata) - good             # total bad items hypothesized
            kwargs_c = kwargs.copy()
            kwargs_c['alternative'] = 'lower'
            p, alloc_c = strat_test(strata, sams, compl, bad, **kwargs_c)
            alloc = (None if alloc_c is None 
                          else list(strata-np.array(alloc_c, dtype=int)))
        else:  
            if good < np.sum(found) or good > np.sum(strata - sams + found): # impossible
                p = 0  
            elif good == np.sum(strata-sams+found): # the "packed" allocation guarantees this outcome or more 1s
                p = 1
                alloc = strata-sams+found      
            else:                                   # outcome is possible but not certain under the composite null 
                optimal = StratifiedBinary(strata=strata, sams=sams, found=found)
                optimal.allocate_first()
                while optimal.total() < good:
                    optimal.allocate_next()
                p = optimal.fisher_p()
                alloc = optimal.alloc
    return p, (None if alloc is None else list(alloc))

# Confidence intervals

def strat_ci_bisect(strata: list, sams: list, found: list, **kwargs) -> tuple:
    """
    Confidence bound on the number of ones in a stratified binary population,
    based on a stratified random sample without replacement
    
    If alternative=='lower', finds an upper confidence bound.
    If alternative=='upper', finds a lower confidence bound.

    Uses an integer bisection search to find an exact confidence bound.
    The starting upper endpoint for the search is the unbiased estimate
    of the number of ones in the population. That could be refined in various
    ways to improve efficiency.
    
    The lower endpoint for the search is the Šidák joint lower confidence bounds,
    which should be more conservative than the exact bound.
    
    Parameters:
    -----------    
    strata : list of ints
        stratum sizes
    sams : list of ints
        sample sizes in the strata
    found : list of ints
        number of ones found in each stratum in each sample
    kwargs:
        alternative : string in {'lower', 'upper'}
            if alternative=='lower', finds an upper confidence bound.
            if alternative=='upper', finds a lower confidence bound.
            While this is not mnemonic, it corresponds to the sidedness of the tests
            that are inverted to get the confidence bound.
        cl : float
            confidence level. Assumed to be at least 0.5. Default 0.95.
        p_value : callable
            method for computing the p-value
        kwargs is also passed to p_value
    
    Returns:
    --------
    b : int
        confidence bound
    alloc : list of ints
        allocation that attains the confidence bound
    """
    cl = kwargs.get('cl',0.95)
    p_value = kwargs.get('p_value', strat_test_brute)
    alternative = kwargs.get('alternative','lower')
    strata = np.array(strata, dtype=int)
    sams = np.array(sams, dtype=int)
    found = np.array(found, dtype=int)
    assert alternative in ['lower', 'upper'], f'alternative {alternative} not implemented'    
    if alternative == 'upper':  # interchange good and bad
        compl = sams-found      # bad items found
        kwargs_c = kwargs.copy()
        kwargs_c['alternative'] = 'lower'
        cb, alloc_c = strat_ci_bisect(strata, sams, compl, **kwargs_c)
        b = np.sum(strata) - cb    # good from bad
        alloc = strata-np.array(alloc_c, dtype=int)
    else:
        cl_sidak = math.pow(cl, 1/len(strata))  # Šidák adjustment
        tail = 1-cl
        a = sum((hypergeom_conf_interval( \
                sams[s], found[s], strata[s], cl=cl_sidak, alternative="lower")[0] \
                for s in range(len(strata)))) # Šidák should give a lower bound
        b = int(np.sum(np.array(strata)*np.array(found)/np.array(sams)))-1 # expected good
        p_a, alloc_a = p_value(strata, sams, found, a, alternative=alternative)
        p_b, alloc = p_value(strata, sams, found, b, alternative=alternative)
        tot_found = np.sum(found)
        while p_a > tail and a > tot_found:
            a = math.floor(a/2)
            p_a, alloc_a = p_value(strata, sams, found, a, **kwargs)
        if p_a > tail:
            b = a
            alloc = alloc_a
        else:
            while b-a > 1:
                c = int((a+b)/2)
                p_c, alloc_c = p_value(strata, sams, found, c, **kwargs)
                if p_c > tail:
                    b, p_b, alloc = c, p_c, alloc_c
                elif p_c < tail:
                    a, p_a, alloc_a = c, p_c, alloc_c
                elif p_c == tail:
                    b, p_b, alloc = c, p_c, alloc_c
                    break
    return b, list(alloc)

def strat_ci_search(strata: list, sams: list, found: list, **kwargs) -> tuple:
    """
    Confidence bound on the number of ones in a stratified population,
    based on a stratified random sample (without replacement) from
    the population.
        
    If alternative=='lower', finds an upper confidence bound.
    If alternative=='upper', finds a lower confidence bound.
    
    Searches for the allocation of items that attains the confidence bound
    by increasing the number of ones from the minimum consistent
    with the data (total found in the sample) until the P-value is greater
    than 1-cl.
    
    Uses the fast method for finding the maximum P-value for Fisher's combining function
    
    Parameters:
    -----------    
    strata : list of ints
        stratum sizes
    sams : list of ints
        sample sizes in the strata
    found : list of ints
        number of ones found in each stratum in each sample
    kwargs : dict
        alternative : string {'lower', 'upper'} Default 'lower'
            if alternative=='lower', finds an upper confidence bound.
            if alternative=='upper', finds a lower confidence bound.
            While this is not mnemonic, it corresponds to the sidedness of the tests
            that are inverted to get the confidence bound.
        cl : float Default 0.95
            confidence level. Assumed to be at least 50%.
    
    Returns:
    --------
    cb : int
        confidence bound
    alloc : list of ints
        allocation that attains the confidence bound (give or take one item)
    """
    cl = kwargs.get('cl',0.95)
    alternative = kwargs.get('alternative','lower')
    assert alternative in ['lower', 'upper'], f'alternative {alternative} not implemented'
    strata = np.array(strata, dtype=int)
    sams = np.array(sams, dtype=int)
    found = np.array(found, dtype=int)
    if alternative == 'upper':  # interchange good and bad
        kwargs_c = kwargs.copy()
        kwargs_c['alternative'] = 'lower'
        compl = sams-found  # bad items found
        cb, alloc_c = strat_ci(strata, sams, compl, **kwargs_c)
        cb = np.sum(strata) - cb    # good from bad
        alloc = strata - alloc_c
    else:
        cb = int(np.sum(strata*found/sams))-1 # expected good
        p_attained, alloc = strat_test(strata, sams, found, cb, alternative=alternative)
        while p_attained >= 1-cl:
            cb -= 1
            p_attained, alloc = strat_test(strata, sams, found, cb, alternative=alternative)
        cb += 1
        p_attained, alloc = strat_test(strata, sams, found, cb, alternative=alternative)
    return cb, list(alloc)

def strat_ci(strata: list, sams: list, found: list, **kwargs) -> tuple:
    """
    Confidence bound on the number of ones in a population,
    based on a stratified random sample (without replacement) from
    the population.
    
    If alternative=='lower', finds an upper confidence bound.
    If alternative=='upper', finds a lower confidence bound.
    
    Constructs the confidence bound directly by constructing the
    allocation of the maximum number of ones that would not be
    rejected at (conservative) level 1-cl.
    
    Parameters:
    -----------    
    strata : list of ints
        stratum sizes
    sams : list of ints
        sample sizes in the strata
    found : list of ints
        number of ones found in each stratum in each sample
    kwargs : dict
        alternative : string {'lower', 'upper'} default 'lower'
            if alternative=='lower', finds an upper confidence bound.
            if alternative=='upper', finds a lower confidence bound.
            While this is not mnemonic, it corresponds to the sidedness of the tests
            that are inverted to get the confidence bound.
        cl : float default 0.95
            confidence level

    Returns:
    --------
    cb : int
        confidence bound
    alloc : list of ints
        allocation that attains the confidence bound (give or take one item)
    """
    cl = kwargs.get('cl',0.95)
    alternative = kwargs.get('alternative','lower')
    strata = np.array(strata, dtype=int)
    sams = np.array(sams, dtype=int)
    found = np.array(found, dtype=int)
    assert alternative in ['lower', 'upper'], f'alternative {alternative} not implemented'
    if alternative == 'upper':  # interchange role of good and bad
        compl = sams - found  # bad found
        kwargs_c = kwargs.copy()
        kwargs_c['alternative'] = 'lower'
        cb, alloc = strat_ci(strata, sams, compl, **kwargs_c)
        alloc = strata - alloc
    else:                
        threshold = -sp.stats.chi2.ppf(cl, df=2*len(strata))/2
        # g is in the confidence set if 
        #          chi2.sf(-2*log(p), df=2*len(strata)) >= 1-cl
        #  i.e.,   -2*log(p) <=  chi2.ppf(cl, df)
        #  i.e.,   log(p) >= -chi2.ppf(cl, df)/2
        optimal = StratifiedBinary(strata=strata, sams=sams, found=found)
        optimal.allocate_first()
        while np.sum(optimal.log_p) < threshold:
            optimal.allocate_next()
        alloc = optimal.alloc
    return np.sum(alloc), list(alloc)

# older methods

def strat_p_ws(strata: list, sams: list, found: list, hypo: list, **kwargs) -> float:
    """
    Finds Wendell-Schmee P-value for the hypothesized population counts `hypo` for 
    simple random samples of sizes `sams` from strata of sizes `strata` if 
    `found` 1s are found in the samples from the strata.
        
    Parameters:
    -----------    
    strata : list of ints
        stratum sizes
    sams : list of ints
        sample sizes from the strata
    found : list of ints
        number of ones found in each stratum in each sample
    hypo : list of ints
        hypothesized numbers of ones the strata
    
    Returns:
    --------
    p : float
        tail probability
    """
    alternative = kwargs.get('alternative','lower')
    assert alternative in ['lower', 'upper']
    if alternative == 'lower':                     # exchange roles of "good" and "bad"
        kwargs_c = kwargs.copy()
        kwargs_c['alternative'] = 'upper'
        compl = np.array(sams) - np.array(found)   # bad items found 
        hypo_c = np.array(strata) - np.array(hypo) # total bad items hypothesized
        p = strat_p_ws(strata, sams, compl, hypo_c, **kwargs_c)
    else:    
        p_hat = lambda f, st=strata, sa=sams: np.sum(np.array(st)*np.array(f)/np.array(sa))/np.sum(st) # pooled estimate
        p_hat_0 = p_hat(found)
        per_strat = (np.array(strata)/np.array(sams))/np.sum(strata)
        strat_max = np.floor(p_hat_0/per_strat)
        lo_t = (t for t in itertools.product(*[range(int(s+1)) for s in strat_max])
                    if p_hat(t) <= p_hat_0)
        p = sum(np.prod(sp.stats.hypergeom.pmf(t, strata, hypo, sams))
                    for t in lo_t)
    return p

def strat_test_ws(strata: list, sams: list, found: list, good: list, **kwargs) -> tuple:
    """
    Find p-value of the hypothesis that the number G of "good" objects in a 
    stratified population is less than or equal to good, using a stratified
    random sample.
    
    Assumes that a simple random sample of size sams[s] was drawn from stratum s, 
    which contains strata[s] objects in all.
    
    The P-value is the maximum Windell-Schmee P-value over all allocations of 
    good objects among the strata. The allocations are enumerated using Feller's 
    "bars and stars" construction, constrained to honor the stratum sizes (each 
    stratum can contain no more "good" items than it has items in all, nor fewer 
    "good" items than the sample contains).
    
    The number of allocations grows combinatorially: there can be as many as
    [(#strata + #good items) choose (#strata-1)] allocations, making the brute-force
    approach computationally infeasible when the number of strata and/or the number of
    good items is large.
    
    Parameters:
    -----------
    strata : list of ints
        sizes of the strata. One int per stratum.
    sams : list of ints
        the sample sizes from each stratum
    found : list of ints
        the numbers of "good" items found in the samples from the strata
    good : int
        the hypothesized total number of "good" objects in the population
    alternative : string {'lower', 'upper'}
        test against the alternative that the true value is less than good (lower)
        or greater than good (upper)
    
    Returns:
    --------
    p : float
        maximum combined p-value over all ways of allocating good "good" objects
        among the strata, honoring the stratum sizes.        
    alloc : list
        the allocation that attained the maximum p-value
    """
    alternative = kwargs.get('alternative','lower')
    assert alternative in ['lower', 'upper']
    if alternative == 'lower':                   # exchange roles of "good" and "bad"
        compl = np.array(sams) - np.array(found) # bad items found 
        bad = np.sum(strata) - good              # total bad items hypothesized
        kwargs_c = kwargs.copy()
        kwargs_c['alternative'] = 'upper'
        res = strat_test_ws(strata, sams, compl, bad, **kwargs_c)
        return res[0], list(np.array(strata, dtype=int)-np.array(res[1], dtype=int))        
    alloc = found # start with what you see
    good = int(good)
    if good < np.sum(found):     
        p = 0 if alternative == 'lower' else 1 
    elif good > np.sum(strata) - np.sum(sams) + np.sum(found):
        p = 1 if alternative == 'lower' else 0
    else:  # use Feller's "bars and stars" enumeration of combinations, constrained
        p_hat = lambda f, st=strata, sa=sams: np.sum(np.array(st)*np.array(f)/np.array(sa))/np.sum(st) # pooled estimate
        p_hat_0 = p_hat(found)
        per_strat = (np.array(strata)/np.array(sams))/np.sum(strata)
        strat_max = np.floor(p_hat_0/per_strat)
        p = 0   # initial value for the max
        n_strata = len(strata)
        parts = sp.special.binom(good+n_strata-1, n_strata-1)
        for part in bars_stars(strata, found, good):
            lo_t = (t for t in itertools.product(*[range(int(s+1)) for s in strat_max]) \
                    if p_hat(t) <= p_hat_0)
            p_new = 0
            for t in lo_t:
                p_temp = 1
                for s in range(len(strata)):
                    p_temp *= sp.stats.hypergeom.pmf(t[s], strata[s], part[s], sams[s])
                p_new += p_temp
            if p_new > p:
                alloc = part
                p = p_new
    return p, list(alloc)

def strat_ci_wright(strata: list, sams: list, found: list, **kwargs) -> int:
    """
    Confidence bound on the number of ones in a stratified population,
    based on a stratified random sample (without replacement) from
    the population.
    
    If alternative=='lower', finds an upper confidence bound.
    If alternative=='upper', finds a lower confidence bound.
    
    Constructs the confidence bound by finding Šidák multiplicity-adjusted
    joint lower confidence bounds for the number of ones in each stratum.
    
    This approach is mentioned in Wright, 1991.
    
    Parameters:
    -----------    
    strata : list of ints
        stratum sizes
    sams : list of ints
        sample sizes in the strata
    found : list of ints
        number of ones found in each stratum in each sample
    alternative : string {'lower', 'upper'}
        if alternative=='lower', finds an upper confidence bound.
        if alternative=='upper', finds a lower confidence bound.
        While this is not mnemonic, it corresponds to the sidedness of the tests
        that are inverted to get the confidence bound.
    cl : float
        confidence level
        
    Returns:
    --------
    cb : int
        confidence bound
    """
    alternative = kwargs.get('alternative','lower')
    assert alternative in ['lower', 'upper']
    inx = 0 if alternative == 'lower' else 1
    cl = kwargs.get('cl',0.95)
    cl_sidak = math.pow(cl, 1/len(strata))  # Šidák-adjusted confidence level per stratum
    cb = sum((hypergeom_conf_interval(
                sams[s], found[s], strata[s], cl=cl_sidak, alternative=alternative)[inx] 
                for s in range(len(strata))))
    return cb

def test_strat_test(verbose = False):
    strata = [[10, 20, 30, 40], [20, 20, 20]]
    sams = [[2, 3, 4, 5], [5, 5, 10]]
    found = [[1, 2, 3, 4], [0, 3, 2]]
    for i in range(len(strata)):
        compl = np.array(sams[i]) - np.array(found[i])
        for j in list(range(4, 20)) + [60]:
            for alternative in ['lower', 'upper']:
                if verbose:
                    print(f'{i=} {j=} {alternative=}')
                p_exact = strat_test_brute(strata[i], sams[i], found[i], j, alternative=alternative)
                p_exact_c = strat_test_brute(strata[i], sams[i], compl, j, alternative=alternative)
                p_fast = strat_test(strata[i], sams[i], found[i], j, alternative=alternative)
                p_fast_c  = strat_test(strata[i], sams[i], compl, j, alternative=alternative)
                np.testing.assert_allclose(p_exact[0], p_fast[0])
                np.testing.assert_allclose(p_exact_c[0], p_fast_c[0])

def test_strat_ci(verbose = False):
    strata = [[10, 20], [10, 20, 30, 40], [10, 20, 20, 30]]
    sams = [[5, 5], [2, 3, 4, 5], [2, 4, 5, 6]]
    found = [[2, 2], [1, 2, 3, 4], [0, 1, 2, 3]]
    for i in range(len(strata)):
        for alternative in ['lower','upper']:
            brute = strat_ci_bisect(strata[i], sams[i], found[i], alternative=alternative)
            fast = strat_ci(strata[i], sams[i], found[i], alternative=alternative)
            fast_s = strat_ci_search(strata[i], sams[i], found[i], alternative=alternative)
            if verbose:
                print(f'{i}-{alternative}')
                print(f'i:{i} brute:{brute[0]} fast:{fast[0]} fast_s:{fast_s[0]}\nbest_brute: {brute[1]} best_fast:{fast[1]} best_fast_s: {fast_s[1]}')
            np.testing.assert_allclose(brute[0], fast[0])
            np.testing.assert_allclose(brute[0], fast_s[0])
            np.testing.assert_allclose(brute[1], fast[1])
            np.testing.assert_allclose(brute[1], fast_s[1])

test_strat_test(verbose = False)

test_strat_ci(verbose = False)

Comparison with Wendell & Schmee (1996) $P$-values and upper confidence bounds#

def test_strat_test_ws(verbose = False):
    strata = [[10, 20, 30, 40], [20, 20, 20]]
    sams = [[2, 3, 4, 5], [5, 5, 10]]
    found = [[1, 2, 3, 4], [0, 3, 2]]
    for i in range(len(strata)):
        compl = np.array(sams[i]) - np.array(found[i])
        alternative = 'upper'
        for j in list(range(4, 20)) + [60]:
            if verbose:
                print(f'{i=} {j=} {alternative=}')
            p_ws = strat_test_ws(strata[i], sams[i], found[i], j, alternative=alternative)
            p_ws_c = strat_test_ws(strata[i], sams[i], compl, j, alternative=alternative)
            p_fast = strat_test(strata[i], sams[i], found[i], j, alternative=alternative)
            p_fast_c = strat_test(strata[i], sams[i], compl, j, alternative=alternative)
            print(f'ws: {p_ws[0]} ws_c: {p_ws_c[0]} best: {p_ws[1]} best_c: {p_ws_c[1]}')
            print(f'fast: {p_fast[0]} fast_c: {p_fast_c[0]} best: {p_fast[1]} best_c: {p_fast_c[1]}')

# time-consuming!
test_strat_test_ws(verbose = False)

ws: 1 ws_c: 1.0000000000000002 best: [1, 2, 3, 4] best_c: [1, 1, 1, 1]
fast: 1 fast_c: 1 best: None best_c: [1, 1, 1, 1]
ws: 1 ws_c: 0.9999444444444434 best: [1, 2, 3, 4] best_c: [2, 1, 1, 1]
fast: 1 fast_c: 0.9999999988567956 best: None best_c: [1, 1, 1, 2]
ws: 1 ws_c: 0.999833333333334 best: [1, 2, 3, 4] best_c: [3, 1, 1, 1]
fast: 1 fast_c: 0.9999999789845687 best: None best_c: [1, 1, 2, 2]
ws: 1 ws_c: 0.9996666666666645 best: [1, 2, 3, 4] best_c: [4, 1, 1, 1]
fast: 1 fast_c: 0.9999998660332918 best: None best_c: [1, 2, 2, 2]
ws: 1 ws_c: 0.999444444444445 best: [1, 2, 3, 4] best_c: [5, 1, 1, 1]
fast: 1 fast_c: 0.9999992860845038 best: None best_c: [2, 2, 2, 2]
ws: 1 ws_c: 0.9991666666666645 best: [1, 2, 3, 4] best_c: [6, 1, 1, 1]
fast: 1 fast_c: 0.999997522402079 best: None best_c: [2, 2, 2, 3]
ws: 0.9999999999999987 ws_c: 0.9988333333333342 best: [1, 2, 3, 4] best_c: [7, 1, 1, 1]
fast: 1 fast_c: 0.9999931894258484 best: [1, 2, 3, 4] best_c: [2, 2, 3, 3]
ws: 0.9999999999810852 ws_c: 0.9984444444444441 best: [2, 2, 3, 4] best_c: [8, 1, 1, 1]
fast: 1.0 fast_c: 0.999982932037665 best: [1, 2, 3, 5] best_c: [2, 3, 3, 3]
ws: 0.9999999999432605 ws_c: 0.9980000000000002 best: [3, 2, 3, 4] best_c: [9, 1, 1, 1]
fast: 1.0 fast_c: 0.9999614134317659 best: [1, 2, 3, 6] best_c: [2, 3, 3, 4]
ws: 0.9999999998865183 ws_c: 0.9974999999999993 best: [4, 2, 3, 4] best_c: [10, 1, 1, 1]
fast: 1.0 fast_c: 0.9999200792544627 best: [1, 2, 3, 7] best_c: [2, 3, 4, 4]
ws: 0.9999999998108691 ws_c: 0.9921367521367547 best: [5, 2, 3, 4] best_c: [10, 1, 1, 2]
fast: 1.0 fast_c: 0.9998400267645521 best: [1, 2, 4, 7] best_c: [2, 3, 4, 5]
ws: 0.9999999997163019 ws_c: 0.9836369770580294 best: [6, 2, 3, 4] best_c: [10, 1, 1, 3]
fast: 1.0 fast_c: 0.9997087317388961 best: [1, 2, 4, 8] best_c: [2, 4, 4, 5]
ws: 0.9999999996028268 ws_c: 0.9718094612831454 best: [7, 2, 3, 4] best_c: [10, 1, 1, 4]
fast: 0.9999999999999999 fast_c: 0.9995066807966366 best: [1, 2, 4, 9] best_c: [3, 4, 4, 5]
ws: 0.9999999994704348 ws_c: 0.9565397887766312 best: [8, 2, 3, 4] best_c: [10, 1, 1, 5]
fast: 0.9999999999999992 fast_c: 0.9992064404749985 best: [1, 2, 5, 9] best_c: [3, 4, 5, 5]
ws: 0.9999999993191306 ws_c: 0.9377850725219145 best: [9, 2, 3, 4] best_c: [10, 1, 1, 6]
fast: 0.9999999999999958 fast_c: 0.9987578291088032 best: [1, 2, 5, 10] best_c: [3, 4, 5, 6]
ws: 0.9999999991489137 ws_c: 0.915568686095002 best: [10, 2, 3, 4] best_c: [10, 1, 1, 7]
fast: 0.9999999999999745 fast_c: 0.9980704752694506 best: [1, 2, 5, 11] best_c: [3, 4, 6, 6]
ws: 0.9809423942999241 ws_c: 0.005696151392875078 best: [10, 2, 8, 40] best_c: [7, 13, 18, 22]
fast: 0.9984282967061362 fast_c: 0.16107135673314985 best: [2, 10, 19, 29] best_c: [7, 12, 18, 23]
ws: 1 ws_c: 1 best: [0, 3, 2] best_c: [5, 2, 8]
fast: 1 fast_c: 1 best: None best_c: None
ws: 0.9999999999999991 ws_c: 1 best: [0, 3, 2] best_c: [5, 2, 8]
fast: 1 fast_c: 1 best: [0, 3, 2] best_c: None
ws: 0.9992124273532111 ws_c: 1 best: [0, 4, 2] best_c: [5, 2, 8]
fast: 0.9999999998166784 fast_c: 1 best: [0, 4, 2] best_c: None
ws: 0.9974471783173097 ws_c: 1 best: [0, 3, 4] best_c: [5, 2, 8]
fast: 0.9999999802958488 fast_c: 1 best: [0, 5, 2] best_c: None
ws: 0.9934753679865279 ws_c: 1 best: [0, 3, 5] best_c: [5, 2, 8]
fast: 0.9999995445513106 fast_c: 1 best: [0, 6, 2] best_c: None
ws: 0.9847170170007047 ws_c: 1 best: [0, 3, 6] best_c: [5, 2, 8]
fast: 0.9999950631446368 fast_c: 1 best: [0, 7, 2] best_c: None
ws: 0.9690334582586491 ws_c: 1 best: [0, 3, 7] best_c: [5, 2, 8]
fast: 0.9999663745711513 fast_c: 1 best: [0, 8, 2] best_c: None
ws: 0.9446636208286848 ws_c: 1 best: [0, 3, 8] best_c: [5, 2, 8]
fast: 0.9998332328548011 fast_c: 1 best: [0, 9, 2] best_c: None
ws: 0.9096534575901524 ws_c: 1 best: [0, 3, 9] best_c: [5, 2, 8]
fast: 0.9993435952582238 fast_c: 1 best: [0, 10, 2] best_c: None
ws: 0.8615152873035397 ws_c: 1 best: [0, 3, 10] best_c: [5, 2, 8]
fast: 0.9978334049132793 fast_c: 1 best: [0, 11, 2] best_c: None
ws: 0.7977494641581321 ws_c: 1 best: [0, 3, 11] best_c: [5, 2, 8]
fast: 0.9939841431087776 fast_c: 1 best: [0, 11, 3] best_c: None
ws: 0.7172269504435058 ws_c: 0.9999999999999961 best: [0, 3, 12] best_c: [5, 2, 8]
fast: 0.9872797224870071 fast_c: 1 best: [0, 12, 3] best_c: [5, 2, 8]
ws: 0.6217241051545268 ws_c: 0.9999999999999971 best: [0, 3, 13] best_c: [6, 2, 8]
fast: 0.9736250862338577 fast_c: 1.0 best: [0, 13, 3] best_c: [6, 2, 8]
ws: 0.5199586389671542 ws_c: 0.999999999999997 best: [2, 3, 12] best_c: [7, 2, 8]
fast: 0.946965994360294 fast_c: 1.0 best: [0, 14, 3] best_c: [7, 2, 8]
ws: 0.43870488123980333 ws_c: 0.9999999999999979 best: [7, 7, 4] best_c: [8, 2, 8]
fast: 0.9062534551334417 fast_c: 1.0 best: [0, 14, 4] best_c: [8, 2, 8]
ws: 0.37086615075619256 ws_c: 0.9999999999999977 best: [8, 7, 4] best_c: [9, 2, 8]
fast: 0.8457350377876189 fast_c: 1.0 best: [0, 15, 4] best_c: [9, 2, 8]
ws: 0 ws_c: 0 best: [0, 3, 2] best_c: [5, 2, 8]
fast: 0 fast_c: 0 best: None best_c: None

Reproduce tables from Wendell & Schmee#

# Wendell & Schmee lead example, p. 827
strata = [100, 100]
sams = [60, 40]
found = [1, 1]
good = 10 
p = .067081
alternative = 'upper'
print(p, strat_test(strata, sams, found, good, alternative=alternative),\
      strat_test_ws(strata, sams, found, good, alternative=alternative))

0.067081 (0.23485578913926813, [2, 8]) (0.06708103400254271, [2, 8])

# Table 2, row 1
strata = [200, 100]
sams = [50, 25] 
found = [0,0]
good = 15 
p = .01194
alternative = 'upper'
print(p, strat_test(strata, sams, found, good, alternative=alternative), \
      strat_test_ws(strata, sams, found, good, alternative=alternative))

0.01194 (0.06481873546070493, [10, 5]) (0.011942274979969247, [10, 5])

# Table 2, row 2
strata = [200, 100]
sams = [50,50] 
found = [1,0]
good = 15 
p = .07232
alternative = 'upper'
print(p, strat_test(strata, sams, found, good, alternative=alternative), \
      strat_test_ws(strata, sams, found, good, alternative=alternative))

0.07232 (0.26226855041248087, [15, 0]) (0.07231577125487271, [15, 0])

# Table 2, row 3
strata = [2000, 1000]
sams = [50,50]
found = [1,0]
good = 150 
p = .09958
alternative = 'upper'
print(p, strat_test(strata, sams, found, good, alternative=alternative), \
      strat_test_ws(strata, sams, found, good, alternative=alternative))

0.09958 (0.3292825209349363, [150, 0]) (0.09957652360481824, [150, 0])

# Table 2, row 9. 
strata = [300, 200, 100]
sams = [50,50,50]
found = [1, 1, 0]
good = 30 
p = .03775
alternative = 'upper'
print(p, strat_test(strata, sams, found, good, alternative=alternative),\
      strat_test_ws(strata, sams, found, good, alternative=alternative))

0.03775 (0.349110173834897, [25, 5, 0]) (0.037749475267350535, [24, 6, 0])

# Table 2, row 10. 
strata = [3000, 2000, 1000]
sams = [50,50,50]
found = [1, 1, 0]
good = 300 
p = .04902
alternative = 'upper'
print(p, strat_test(strata, sams, found, good, alternative=alternative),\
      strat_test_ws(strata, sams, found, good, alternative=alternative))

0.04902 (0.4006985692204492, [247, 53, 0]) (0.04902098371080246, [228, 72, 0])

# compare with Wendell & Schmee's R code.
# 
# > pmax.function(c(500,300,200),c(75,50,25),c(2,1,0),50)
#            [,1] [,2] [,3] [,4]
# [1,] 0.02767569   28   21    1
#
# indicating a maximum p-value of 0.02768 occurring under the null of 50 errors
# with those errors distributed as (28,21,1).
print(strat_test([500,300,200],[75,50,25],[2,1,0],50, alternative='upper'),\
      strat_test_ws([500,300,200],[75,50,25],[2,1,0],50, alternative='upper'))

(0.25567352840348917, [38, 12, 0]) (0.02767568706972813, [28, 21, 1])

# Table 3, Row 1
strata = [200, 100]
sams = [50,25]
found = [0, 0]
ub = 10
alternative = 'upper'
print(ub,
      strat_ci_bisect(strata, sams, found, alternative=alternative, p_value=strat_test_ws),
      strat_ci(strata, sams, found, alternative=alternative),
      strat_ci_wright(strata, sams, found, alternative=alternative))

10 (10, [6, 4]) (16, [11, 5]) 23

# Table 3, Row 7.
strata = [5000, 5000]
sams = [100,50]
found = [2,1] 
ub = 599
alternative = 'upper'
print(ub,
      strat_ci_bisect(strata, sams, found, alternative=alternative, p_value=strat_test_ws),
      strat_ci(strata, sams, found, alternative=alternative),
      strat_ci_wright(strata, sams, found, alternative=alternative))

599 (599, [2, 597]) (701, [124, 577]) 876

# Table 3, Row 10. WARNING: LONG RUN TIME
strata = [3000, 2000, 1000] 
sams = [50,50,50]
found = [1, 1, 0]
ub = 298
alternative = 'upper'
print(ub,
      strat_ci_bisect(strata, sams, found, alternative=alternative, p_value=strat_test_ws),
      strat_ci(strata, sams, found, alternative=alternative),
      strat_ci_wright(strata, sams, found, alternative=alternative))

298 (298, [226, 72, 0]) (499, [423, 76, 0]) 643

# Table 3, last row. LONG RUN TIME
strata = [5000, 3000, 2000]
sams = [75,50,25]
found = [2, 1, 0] 
ub = 471
alternative = 'upper'
print(ub, strat_ci(strata, sams, found, alternative=alternative),
     strat_ci_bisect(strata, sams, found, alternative=alternative, p_value=strat_test_ws),
     strat_ci_wright(strata, sams, found, alternative=alternative))

471 (716, [503, 162, 51]) (471, [258, 213, 0]) 1133

When is the new method sharper than Wendell & Schmee?#

In general, when there is large heterogeneity of rates across strata.

strata = [100, 100] sams = [30,30] found = [15,0] (68, [34, 34]) (68, [67, 1]) 74

strata = [100, 100] sams = [30,30] found = [20,0] (85, [43, 42]) (83, [79, 4]) 89

strata = [100, 100, 100] sams = [25, 25, 25] found = [20, 0, 0] (105, [35, 35, 35]) (102, [88, 7, 7]) 118

strata = [100, 100, 100, 100] sams = [25, 25, 25, 25] found = [10, 0, 0, 0]

strata = [100, 100, 100, 100] sams = [25,25,25,25] found = [20,0,0,0] Ran for a week without completing. Fast method took under a second. —- (107, [88, 6, 6, 7]) 131

strata = [100, 100, 100, 100]
sams = [25,25,25,25]
found = [20,0,0,0] 
alternative = 'upper'
print(
#      strat_ci_bisect(strata, sams, found, alternative=alternative, p_value=strat_test_ws), # takes forever
      strat_ci(strata, sams, found, alternative=alternative),
      strat_ci_wright(strata, sams, found, alternative=alternative))

(107, [88, 6, 6, 7]) 131

strata = [100, 100, 100]
sams = [25, 25, 25]
found = [10, 0, 0]
alternative = 'upper'
print(strat_ci_bisect(strata, sams, found, alternative=alternative, p_value=strat_test_ws),
      strat_ci(strata, sams, found, alternative=alternative),
      strat_ci_wright(strata, sams, found, alternative=alternative))

(61, [20, 21, 20]) (67, [60, 3, 4]) 86

strata = [100]*100
sams = list(range(5,55))*2
found = [1]*100

alternative = 'upper'
print(\
      strat_ci(strata, sams, found, alternative=alternative),\
      strat_ci_wright(strata, sams, found, alternative=alternative))

(1384, [76, 70, 64, 58, 53, 47, 42, 37, 32, 28, 24, 20, 17, 14, 12, 10, 9, 8, 7, 6, 5, 4, 4, 4, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 76, 70, 64, 58, 53, 47, 42, 37, 32, 28, 24, 20, 17, 14, 12, 10, 9, 8, 7, 6, 5, 4, 4, 4, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) 3198

# test empirical coverage
reps = 1000
cl = 0.95
alternative = 'upper'
strata = [5000, 3000, 2000]
sams = [75,50,25]
good = [100, 100, 500]
g = np.sum(good)
cover = 0
verb = False
for i in range(reps):
    found = sp.stats.hypergeom.rvs(strata, good, sams)
    ub = strat_ci(strata, sams, found, alternative=alternative, cl=cl)
    if verb: 
        print("f: {} ub: {} best: {}".format(found, ub[0], ub[1]))
    cover = cover+1 if ub[0] >= g else cover
    if i % 100 == 1:
        print(i+1, cover, cover/(i+1))
cover/reps

2 1.0
102 1.0
202 1.0
302 1.0
401 0.9975124378109452
501 0.99800796812749
601 0.9983388704318937
701 0.9985754985754985
801 0.9987531172069826
901 0.9988913525498891

0.999

An Idiosyncratic Subset of Statistics

Inference about binary populations from stratified samples

Contents