Supermartingale-based Tests
Contents
Supermartingale-based Tests#
Martingales, supermartingales, submartingales#
Let \((X_j)_{j \in \mathbb{N}} = X_1, X_2, \ldots\) and \((Y_j)_{j \in \mathbb{N}} = Y_1, Y_2, \ldots\) be sequences of random variables (stochastic processes). Let \(Y^{i} := (Y_1, \ldots, Y_i)\) be the first \(i\) elements of the sequence \((Y_j)_{j \in \mathbb{N}}\). Suppose \(\mathbb{E}|X_j| < \infty\) for all \(j \in \mathbb{N}\).
Then \((X_j)_{j \in \mathbb{N}}\) is a martingale with respect to \((Y_j)_{j \in \mathbb{N}}\) if
It is a supermartingale if
and a submartingale if
A martingale, supermartingale, or submartingale is nonnegative if \(\mathbb{P} \{X_j \ge 0 \} = 1\) for all \(j \in \mathbb{N}\).
Terminology: A sequence \((\lambda_i)_{i \in \mathbb{N}}\) is predictable with respect to \((Y_j)_{j \in \mathbb{N}}\) if \(\lambda_j\) does not depend on \((Y_i)_{i>j}\).
Examples#
Product of independent random variables with expected value 1.#
Suppose \((Y_j)_{j \in \mathbb{N}}\) are independent and have expected value 1. Define \(X_j := \prod_{i=1}^j Y_j\), \(j \in \mathbb{N}\). Then
Note that if \((Z_i)\) are independent random variables, then \(\prod_{i=1}^j Z_i/\mathbb{E}Z_i\) is a martingale.
Sum of independent random variables with expected value 0.#
Suppose \((Y_j)_{j \in \mathbb{N}}\) are independent and have expected value 0. Define \(X_j := \sum_{i=1}^j Y_j\), \(j \in \mathbb{N}\). Then
Gambler’s fortune in repeated bets on independent fair coin tosses.#
Suppose we bet on the outcomes of a sequence of independent fair coin tosses \(Y_1, Y_2, \ldots\), i.e., \(\{Y_j\}_{j \in \mathbb{N}}\) are IID \(\mbox{Bernoulli}(1/2)\). Our initial fortune \(X_0\); our fortune after the \(j\)th wager is \(X_j\). We are not allowed to wager more than our current fortune on the next bet. If \(X_i\) is zero then \(X_k = 0\) for all \(k > i\): we can’t bet if we go broke. The wager on the \(j\) toss is \(b_j \in [0, X_{j-1}]\); \(b_j\) may depend on \(Y^{j-1}\). If \(Y_j=1\), we win the \(j\)th bet and our fortune increases by \(b_j\); if \(Y_j=0\), we lose the \(j\)th bet and our fortune decreases by \(b_j\). Then \(X_j = X_{j-1}+b_jY_j - b_j(1-Y_j)\).
Thus the fortune \((X_j)_{j \in \mathbb{N}}\) is a nonnegative martingale with respect to the Bernoulli coin tosses \((Y_j)_{j \in \mathbb{N}}\).
Likelihood ratios#
We have a family \(\mathcal{P} := \{ \mathbb{P}_\theta : \theta \in \Theta \}\) of probability distributions on a measurable space \(\mathcal{X}\), dominated by a common \(\sigma\)-finite measure \(\mu\). Let \(p_\theta(x)\), \(x \in \mathcal{X}\) denote the density of \(P_\theta\) with respect to \(\mu\). For fixed \(x \in \mathcal{X}\), the function
is the likelihood. If we observe \(X=x\), then for \(\eta \in \Theta\), \(\mathcal{L}(\eta) = \mathcal{L}(\eta | X=x)\) is the _likelihood of \(\eta\) given \(X=x\). The likelihood ratio of \(\eta\) to \(\nu\) given \(X=x\) is \(\mathcal{L}(\eta | X=x)/\mathcal{L}(\nu | X=x)\).
Suppose \(\mathbb{P}_1\) and \(\mathbb{P}_2\) are distributions on some outcome space and that they have densities with respect to some dominating measure \(\mu\). Let \(f\) and \(g\) denote those densities. Suppose \((Y_j)_{j \in \mathbb{N}}\) are IID with density \(f\). Define the likelihood ratio process
Now
since \(g\) is a probability density with respect to \(\mu\).
Prior-posterior ratio (PPR) martingale#
See Waudby-Smith and Ramdas (2021).
Again, we have a family of distributions Let \(\Theta\) be countable and suppose \(\{\mathbb{P}_\theta : \theta \in \Theta\}\) are dominated by a common distribution \(\mu\), where \(p_\theta\) is the density of \(\mathbb{P}_\theta\) with respect to \(\mu\). Suppose \(\Theta\) is measurable and let \(\pi\) (the prior) be a probability distribution on \(\Theta\) that assigns positive probability to every \(\eta \in \Theta\). We observe \((X_i)_{i \in \mathbb{N}}\) sequentially, with \(X_1 \sim p_\theta\) and \(X_{j+1} \sim p_\theta(x | X^j)\), for all \(j \in \mathbb{N}\), for some fixed \(\theta \in \Theta\). The predictive distribution of \(x^j = (x_1, \ldots, x_j)\) is
The posterior distribution of \(\theta\) given \(X^j=x^j\) is
Consider the sequence
Then \(Y_j\) is nonnegative. Suppose the true value of \(\theta\) is \(\mu\) and that \(\pi\) assigns positive mass to \(\mu\). Then \((Y_j(\mu))\) is a nonnegative martingale with expected value 1.
Proof: [To do]
General betting martingales#
Waudby-Smith and Ramdas (2021)
with \(Y_0 := 1\), \(\lambda_i \in [-1/(1-\eta), 1/\eta]\) predictable.
Initial fortune of 1. Bet on the value of \(\theta_i = \mathbb{E}X_i\). Positive \(\lambda_i\) pays if \(X_i > \eta\); negative \(\lambda_i\) pays if \(X_i < \eta\). The constraint on \(\lambda\) ensures the gambler can’t go into debt: the fortune is nonnegative.
Waudby-Smith and Ramdas Proposition 3:
For any arbitrary set of (possibly unbounded) distributions \(\Theta\), every
\(\theta\)-nonnegative martingale is necessarily of the form
for some predictable \(\lambda_i \le 1\) and some \(Z_i \ge -1\), \(\mathbb{P}_\theta\)-almost surely, and \(\mathbb{E}_{\theta}(Z_i | X^{i-1}) = 0\) for every \(\theta \in \Theta\). The same statement also holds for nonnegative S-supermartingales, with the = 0 replaced by ≤ 0.
The empirical Bernstein martingale#
Sampling with or without replacement from a finite population#
Suppose we have a population \(\{x_i\}_{i=1}^N\). Let \(\theta = \bar{x} := \frac{1}{N} \sum_{i=1}^N x_i\) be the population mean. Let \(X_k\) be the value of \(x_i\) selected on the \(k\)th draw. Let \(X^j := (X_1, \ldots, X_j)\). Assume \(X_j \in [0, u_j]\) for some known \(u_j\).
Let \(\theta_j := \mathbb{E}(X_j | X^{j-1})\). For sampling with replacement, \(\theta_j = \theta\). Define \(S_j := \sum_{i=1}^j X_j\). For sampling without replacement, \(\theta_j = (N\theta - S_{j-1})/(N-j)\): the population total was originally \(N\theta\), from which items totalling \(S_{j-1}\) have been removed, leaving \((N-j)\) items in the population.
Then \(\sum_{i=1}^j (X_i-\theta_i)\) and \(\prod_{i=1}^j X_i/\theta_i\) (provided \(\theta_i \ne 0\)) are martingales.
Properties of martingales, supermartingales, and submartingales#
If \((X_j)_{j \in \mathbb{N}}\) is a martingale, then \(\mathbb{E} X_j = \mathbb{E} X_1\).
If \((X_j)_{j \in \mathbb{N}}\) is a supermartingale, then \(\mathbb{E} X_j \le \mathbb{E} X_1\).
If \((X_j)_{j \in \mathbb{N}}\) is a submartingale, then \(\mathbb{E} X_j \ge \mathbb{E} X_1\).
Proof. Suppose \((X_j)_{j \in \mathbb{N}}\) is a martingale w.r.t. \((Y_j)_{j \in \mathbb{N}}\). Then, by the law of total expectation,
Continuing, we end up with \(\mathbb{E}X_1\).
For supermartingales and submartingales, \(=\) is replaced with \(\le\) or \(\ge\), respectively. \(\Box\).
A concave (convex) function of a martingale is a super- (sub-)submartingale. (Proof by Jensen’s inequality.)
Stopping times#
Let \(\tau\) be a measurable mapping from \(X_1, X_2, \ldots\) to \(\mathbb{N} \cup \infty\) and suppose that for \(j=1, 2, \ldots\), the event \(\tau = j\) depends only on \(X^j = (X_1, \ldots, X_j)\), and not on \(X_k\) for \(k>j\). (The usual definition of stopping times involves filtrations, which are beyond the scope of these materials.) Then \(\tau\) is a stopping time for \(X\).
In the example where \(X_j\) represents a sequence of wagers, some stopping rules are:
stop after 10 bets
stop when you go broke
stop when you double your initial stake
stop if you have won three bets in a row
These are not stopping rules:
stop when your fortune is as high as it will ever get
stop if you would lose the next three bets
If \(X_1, X_2, \ldots\) is a (super | sub) martingale and \(\tau\)
Doob’s Optional Stopping Theorem (1953)#
Suppose \(\tau\) is a stopping time for the stochastic process \((X_j)_{j \in \mathbb{N}}\) and that any of the following conditions holds:
\(\mathbb{P} \{\tau \le c \} = 1\) for some \(c \in \mathbb{N}\)
\(\mathbb{E} \tau < \infty\) and \( | X_j - X_{j-1}| \le c\) for all \(j\) and some \(c\)
\(|X_{j}| \le c \) for all \(j \in \mathbb{N}\) for some \(c\) and \(\mathbb{P} \{ \tau < \infty \}=1\).
Then if \((X_j)_{j \in \mathbb{N}}\) is a martingale, \(\mathbb{E} X_\tau = \mathbb{E} X_1\); if it is a supermartingale, \(\mathbb{E} X_\tau \le \mathbb{E} X_1\).
Doob’s Martingale Convergence Theorem#
Suppose \((X_j){j \in \mathbb{N}}\) is a martingale and \(\sup_{j \in \mathbb{N}} \mathbb{E} (-X_j \wedge 0) < \infty\). Then \((X_j)\) converges almost surely to a random variable \(X_\infty\) with \(\mathbb{E} X_\infty < \infty\).
Ville’s Inequality (1939)#
Let \((X_j)_{j \in \mathbb{N}}\) be a nonnegative (super)martingale with respect to \((Y_j)_{j \in \mathbb{N}}\). Then
Proof. (informal: can be made rigorous using Doob’s martingale convergence theorem)
Define the stopping time \(\tau(k)\) to be the first \(j \le \) such that \(X_j \ge x > 0\), or \(k\) otherwise. For each \(k\), \(X_{\tau(k)}\) is a stopping time, so \(\mathbb{E} X_{\tau(k)} = \mathbb{E} X_1\). Apply Markov’s inequality to \(X_{\tau(k)}\) and let \(k \rightarrow \infty\).
The Azuma-Hoeffding inequality#
Let \((X_j)_{j \in \mathbb{N}}\) be a supermartingale and suppose that there are finite constants \(\{c_k\}_{k \in \mathbb{N}\) such that for all \(k\),
Then for all \(t \in \mathbb{N}\) and all \(\epsilon > 0\),
Supermartingale tests#
Wald’s SPRT#
See SPRT
ALPHA martingales#
Analogous to the SPRT for the Bernoulli \(p\).
Bernoulli SPRT: Observe \(\{X_j \}_{j \in \mathbb{N}}\) IID Bernoulli(\(p\)). The null hypothesis is that \(p = \mu\); alternative hypothesis is that \(p = \eta\).
Claim: if \(p \le \mu\), \((T_j)_{j\in\mathbb{N}}\) is a nonnegative supermartingale.
We will prove something more general. We start by developing a one-sided test of the simple hypothesis \(\theta = \mu\), then show that the \(P\)-value is monotone in \(\mu\), so the test is valid for the composite hypothesis \(\theta \le \mu\). Let \(X^j := (X_1, \ldots, X_j)\). Assume \(X_i \in [0, u]\) for some known \(u\).
Let \(\mu_j := \mathbb{E}(X_j | X^{j-1})\) computed under the null hypothesis \(\theta = \mu\). Let \(\eta_j = \eta_j(X^{j-1})\), \(j=1, \ldots\), be a predictable sequence in the sense that \(\eta_j\) may depend on \(X^{j-1}\), but not on \(X_k\) for \(k \ge j\).
Let \(T_0 := 1\) and
This can be rearranged to yield
Equivalently,
Under the null hypothesis that \(\theta_j = \mu_j\), \(T_j\) is nonnegative since \(X_j\), \(\mu_j\), and \(\eta_j\) are all in \([0, u]\). Also,
Thus if \(\theta = \mu\), \((T_j)_{j \in \mathbb{N}}\) is a nonnegative martingale with respect to \((X_j)_{j \in \mathbb{N}}\), starting at \(1\).
If \(\theta < \mu\), then \(\mathbb{E} (X_j | X^{j-1}) < \mu_j\) and \(r_j = \frac{\mathbb{E} (X_j | X^{j-1})}{\mu_j} < 1\), so
Thus \((T_j)\) is a nonnegative supermartingale starting at 1 if \(\theta \le \mu\). It follows from Ville’s inequality that if \(\theta \le \mu\),
That is, \(\min(1, 1/T_j)\) is an “anytime \(P\)-value” for the composite null hypothesis \(\theta \le \mu\).
Note that the derivation did not use any information about \(\{x_i\}\) other than \(x_i \in [0, u]\): it applies to populations \(\{x_i\}\) that are nonnegative and bounded, not merely binary populations.
Sampling without replacement#
To use ALPHA with a sample drawn without replacement, we need \(\mathbb{E}(X_j | X^{j-1})\) computed on the assumption that \(\theta := \frac{1}{N} \sum_{i=1}^N x_i = \mu\). For sampling without replacement from a population with mean \(\mu\), after draw \(j-1\), the mean of the remaining numbers is \((N\mu - \sum_{k=1}^{j-1}X_k)/(N-j+1)\). Thus the conditional expectation of \(X_j\) given \(X^{j-1}\) under the null is \((N\mu - \sum_{k=1}^{j-1}X_k)/(N-j+1)\). If \(N\mu - \sum_{k=1}^{j-1}X_k < 0\) for any \(k\), the null hypothesis \(\theta = \mu\) is certainly false.
The Empirical Bernstein martingale test#
See Howard et al. (2021), Theorem 4.1, and Waudby-Smith and Ramdas (2021).
Suppose \(X_t \in [a, b]\) almost surely for all \(t\). Define
and
Let \((\hat{X}_t)\) be any \([a, b]\)-valued predictable sequence, and let \(\omega > 0\) be any sub-exponential uniform boundary with crossing probability \(\alpha\) for scale \(c = b − a\).
MPrPl-EB t (m) := Yt i=1 \exp {\lambda i(Xi − m) − vi\psi E(\lambda i)} (13) where, following Howard et al. [38, 39], we have defined vi := 4(Xi − \mu bi−1) 2 and \psi E(\lambda ) := (− \lm(1 − \lambda) − \lambda )/4 for \(\lambda \in [0, 1)\).
Using MPrPl-EB t (m) in Step (b) in Theorem 1 — and applying the same procedure but with (Xt)∞ t=1 and m replaced by (−Xt)\infty t=1 and −m combined with a union bound over the resulting CSs — we get the following CS. Theorem 2 (Predictable plug-in empirical Bernstein CS [PrPl-EB]). Suppose (Xt)∞ t=1 ∼ P for some P ∈ Pµ. For any (0, 1)-valued predictable (λt)∞ t=1, C PrPl-EB t := Pt i=1 λiXi Pt i=1 λi ± log(2/α) + Pt i=1 viψE(λi) Pt i=1 λi ! forms a (1 − α)-CS for µ, as does its running intersection, T i≤t C PrPl-EB i . In particular, we recommend the predictable plug-in (λ PrPl-EB t )∞ t=1 given by λ PrPl-EB t := s 2 log(2/α) σb 2 t−1 tlog(1 + t) ∧ c, σb 2 t := 1 4 + Pt i=1(Xi − µbi) 2 t + 1 , µbt := 1 2 + Pt i=1 Xi
import math
import scipy as sp
from scipy.stats import bernoulli, uniform
import numpy as np
from numpy.testing import assert_allclose
import json
np.random.seed(123456789)
!pip install
# !pip install rilacs
from rilacs.strategies import linear_gamma_dist
import pytest
from rilacs.martingales import (
apriori_Kelly_martingale,
distKelly_martingale,
sqKelly_martingale,
dKelly_martingale,
)
import itertools
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
/var/folders/70/k4zn0z5907g_7ftxdq3dxm_h0000gn/T/ipykernel_48878/3478796715.py in <module>
2 from rilacs.strategies import linear_gamma_dist
3 import pytest
----> 4 from rilacs.martingales import (
5 apriori_Kelly_martingale,
6 distKelly_martingale,
/opt/anaconda3/lib/python3.8/site-packages/rilacs/martingales.py in <module>
1 from typing import Callable
2 import numpy as np
----> 3 from confseq.betting import (
4 betting_mart,
5 diversified_betting_mart,
/opt/anaconda3/lib/python3.8/site-packages/confseq/betting.py in <module>
7 from copy import copy, deepcopy
8 from logging import info
----> 9 from confseq.misc import get_running_intersection, get_ci_seq
10
11 from confseq.predmix import lambda_predmix_eb
/opt/anaconda3/lib/python3.8/site-packages/confseq/misc.py in <module>
1 import numpy as np
----> 2 from numpy.typing import NDArray
3 from typing import Callable, Tuple
4 import multiprocess
5
ImportError: cannot import name 'NDArray' from 'numpy.typing' (/opt/anaconda3/lib/python3.8/site-packages/numpy/typing/__init__.py)
def sprt_mart(x : np.array, N : int, mu : float=1/2, eta: float=1-np.finfo(float).eps, \
u: float=1, random_order = True):
'''
Finds the p value for the hypothesis that the population
mean is less than or equal to mu against the alternative that it is eta,
for a population of size N of values in the interval [0, u].
Generalizes Wald's SPRT for the Bernoulli to sampling without replacement and to bounded
values rather than binary values.
If N is finite, assumes the sample is drawn without replacement
If N is infinite, assumes the sample is with replacement
Data are assumed to be in random order. If not, the calculation for sampling without replacement is incorrect.
Parameters:
-----------
x : binary list, one element per draw. A list element is 1 if the
the corresponding trial was a success
N : int
population size for sampling without replacement, or np.infinity for
sampling with replacement
theta : float in (0,u)
hypothesized population mean
eta : float in (0,u)
alternative hypothesized population mean
random_order : Boolean
if the data are in random order, setting this to True can improve the power.
If the data are not in random order, set to False
'''
if any((xx < 0 or xx > u) for xx in x):
raise ValueError(f'Data out of range [0,{u}]')
if np.isfinite(N):
if not random_order:
raise ValueError("data must be in random order for samples without replacement")
S = np.insert(np.cumsum(x),0,0)[0:-1] # 0, x_1, x_1+x_2, ...,
j = np.arange(1,len(x)+1) # 1, 2, 3, ..., len(x)
m = (N*mu-S)/(N-j+1) # mean of population after (j-1)st draw, if null is true
else:
m = mu
with np.errstate(divide='ignore',invalid='ignore'):
terms = np.cumprod((x*eta/m + (u-x)*(u-eta)/(u-m))/u) # generalization of Bernoulli SPRT
terms[m<0] = np.inf # the null is surely false
return terms
def shrink_trunc(x: np.array, N: int, mu: float=1/2, nu: float=1-np.finfo(float).eps, u: float=1, c: float=1/2,
d: float=100) -> np.array:
'''
apply the shrinkage and truncation estimator to an array
sample mean is shrunk towards nu, with relative weight d compared to a single observation.
estimate is truncated above at u-u*eps and below at mu_j+e_j(c,j)
S_1 = 0
S_j = \sum_{i=1}^{j-1} x_i, j > 1
m_j = (N*mu-S_j)/(N-j+1) if np.isfinite(N) else mu
e_j = c/sqrt(d+j-1)
eta_j = ( (d*nu + S_j)/(d+j-1) \vee (m_j+e_j) ) \wedge u*(1-eps)
Parameters
----------
x : np.array
input data
mu : float in (0, 1)
hypothesized population mean
eta : float in (t, 1)
initial alternative hypothethesized value for the population mean
c : positive float
scale factor for allowing the estimated mean to approach t from above
d : positive float
relative weight of nu compared to an observation, in updating the alternative for each term
'''
S = np.insert(np.cumsum(x),0,0)[0:-1] # 0, x_1, x_1+x_2, ...,
j = np.arange(1,len(x)+1) # 1, 2, 3, ..., len(x)
m = (N*mu-S)/(N-j+1) if np.isfinite(N) else mu # mean of population after (j-1)st draw, if null is true
return np.minimum(u*(1-np.finfo(float).eps), np.maximum((d*nu+S)/(d+j-1),m+c/np.sqrt(d+j-1)))
def test_shrink_trunc():
epsj = lambda c, d, j: c/math.sqrt(d+j-1)
Sj = lambda x, j: 0 if j==1 else np.sum(x[0:j-1])
muj = lambda N, mu, x, j: (N*mu - Sj(x, j))/(N-j+1) if np.isfinite(N) else mu
nus = [.51, .55, .6]
mu = 1/2
u = 1
d = 10
vrand = sp.stats.bernoulli.rvs(1/2, size=20)
v = [
np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1]),
np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0]),
vrand
]
for nu in nus:
c = (nu-mu)/2
for x in v:
N = len(x)
xinf = shrink_trunc(x, np.inf, mu, nu, c=c, d=d)
xfin = shrink_trunc(x, len(x), mu, nu, c=c, d=d)
yinf = np.zeros(len(x))
yfin = np.zeros(len(x))
for j in range(1,len(x)+1):
est = (d*nu + Sj(x,j))/(d+j-1)
most = u*(1-np.finfo(float).eps)
yinf[j-1] = np.minimum(np.maximum(mu+epsj(c,d,j), est), most)
yfin[j-1] = np.minimum(np.maximum(muj(N,mu,x,j)+epsj(c,d,j), est), most)
np.testing.assert_allclose(xinf, yinf)
np.testing.assert_allclose(xfin, yfin)
test_shrink_trunc()
def alpha_mart(x: np.array, N: int, mu: float=1/2, eta: float=1-np.finfo(float).eps, u: float=1, \
estim: callable=shrink_trunc) -> np.array :
'''
Finds the ALPHA martingale for the hypothesis that the population
mean is less than or equal to t using a martingale method,
for a population of size N, based on a series of draws x.
The draws must be in random order, or the sequence is not a martingale under the null
If N is finite, assumes the sample is drawn without replacement
If N is infinite, assumes the sample is with replacement
Parameters
----------
x : list corresponding to the data
N : int
population size for sampling without replacement, or np.infinity for sampling with replacement
mu : float in (0,1)
hypothesized fraction of ones in the population
eta : float in (t,1)
alternative hypothesized population mean
estim : callable
estim(x, N, mu, eta, u) -> np.array of length len(x), the sequence of values of eta_j for ALPHA
Returns
-------
terms : array
sequence of terms that would be a nonnegative martingale under the null
'''
S = np.insert(np.cumsum(x),0,0)[0:-1] # 0, x_1, x_1+x_2, ...,
j = np.arange(1,len(x)+1) # 1, 2, 3, ..., len(x)
m = (N*mu-S)/(N-j+1) if np.isfinite(N) else mu # mean of population after (j-1)st draw, if null is true
etaj = estim(x, N, mu, eta, u)
with np.errstate(divide='ignore',invalid='ignore'):
terms = np.cumprod((x*etaj/m + (1-x)*(u-etaj)/(u-m))/u)
terms[m<0] = np.inf
return terms