Confidence intervals for Proportions: General Information

Introduction

The methods to use for calculating confidence intervals (CIs) for binomial proportions depend on the situation:

  • 1 sample proportion, p (1 proportion calculated from 1 group of subjects)

  • 2 sample proportions (\(p_1\) and \(p_2\)) and you want a CI for a contrast parameter, such as the difference in the 2 proportions (risk difference, \(RD = p_1 - p_2\)), the ratio (relative risk, \(RR = p_1 / p_2\)), or the odds ratio (\(OR = p_1 (1-p_2) / (p_2 (1-p_1))\)).

    • If the 2 samples come from 2 independent samples (different subjects in each of the 2 treatment groups)

    • If the 2 samples are matched (i.e. the same subject has 2 results, one from each treatment [paired data]).

Reporting a CI for such contrasts is useful for giving a clinically interpretable estimate of the magnitude of a treatment effect. Moreover, it is particularly relevant for non-inferiority (NI) or equivalence trials, where the hypothesis test is an assessment of whether the CI spans a pre-specified NI margin or not.

Some sources suggest selecting a different method depending on your sample size and whether your proportion takes an extreme value (i.e. close to 0 or 1). Similarly, for a 2x2 table, there is a widely-held belief that a so-called ‘exact’ method should be used if expected cell counts are less than 5. These conventions originated at a time when the Wald method was the only available alternative, but options are now available for CI methods that have appropriate coverage properties across the whole parameter space. These are easily obtained with modern computing software, so there is no need for a data-driven approach to method selection or reliance on overly conservative methods.

The poor performance of the approximate normal (‘Wald’) methods is well documented - they can fail to achieve the nominal confidence level even with large sample sizes, and there is a consensus in the literature that they should be avoided1. Wald intervals remain a default output component in SAS, so continue to be widely used, but alternative methods are strongly recommended. [Note in particular that most other methods become conservative for very small cell counts, which suggests there is no need to resort to ‘exact’ methods.]

Strictly conservative vs proximate coverage

Because of the discrete nature of binomial data, it is impossible for any CI method to cover the true value of the estimated proportion precisely 95% of the time for all values of p. Some CI methods are designed to achieve the nominal confidence level as a minimum, but many researchers find such a criterion to be excessive, producing CIs that are unnecessarily wide. The alternative position is to aim for coverage probability that is close to the nominal confidence level on average. These two opposing positions are described as “aiming to align either the minimum or the mean coverage with the nominal \((1-\alpha)\)2, or in other words aiming for coverage to be either ‘strictly conservative’ or ‘proximate’3. It has been pointed out that the proximate stance is consistent with most other types of statistical models, based on approximate assumptions1. The debate makes it difficult to state in simple objective terms whether one method is ‘better’ than another.

Many CI methods are designed to achieve proximate coverage, but some employ computationally intensive approaches (often labelled as ‘exact’ methods) to guarantee strictly conservative coverage, while others include an optional adjustment (‘continuity correction’) to emulate the same effect using asymptotic formulae. The term ‘adjustment’ may be used instead of ‘correction’ to avoid the implication that the non-adjusted methods are inferior[4]5.

It is worth noting that although one might assume that regulatory authorities would insist on strictly conservative coverage (noting that one-sided non-coverage equates to the type 1 error of a NI test), in recent years it appears that the FDA’s preferred method for RD is the Miettinen-Nurminen (MN) method (e.g. see https://www.accessdata.fda.gov/drugsatfda_docs/nda/2010/200327orig1s000statr.pdf). In contrast, for a single proportion, the ‘exact’ Clopper-Pearson method seems (anecdotally) to be more commonly preferred.

Consistency with hypothesis tests, and one-sided coverage

It is desirable (and in some analysis contexts, essential) for a reported CI to be consistent with the result of a hypothesis test, so that for example the 95% CI will exclude the null hypothesis value if and only if the test p-value is less than 0.05 (or 0.025 one-sided). This applies for the special case of a test for association (where the null hypothesis is of 0 difference between the groups), but also the more general case of a non-zero null hypothesis for NI or equivalence testing. It should be noted that not all CI methods satisfy this criterion.

A related issue is whether it is sufficient for two-sided coverage probability to be aligned with the nominal \((1-\alpha)\), or whether the tail probabilities (i.e. one-sided non-coverage at each end) should also be aligned with the nominal \(\alpha/2\). Clearly the latter ‘equal-tailed’ or symmetric coverage criterion (described as ‘central interval location’ by Newcombe) is essential for non-inferiority testing, but it is also a desirable general property for any CI.

Overarching methods

There are some methods that apply (or have variants which apply) across all contrast parameters (i.e. \(\theta=p\) or \(\theta=RD\) etc.), which are outlined below.

Normal Approximation Methods (Also known as the Wald Methods)

The Wald confidence interval is constructed simply from the the point estimate plus and minus a multiple of its estimated standard error: \((L_{Wald}, U_{Wald}) = \hat \theta \pm z_{\alpha/2} \sqrt{\hat V(\hat\theta)}\) where \(z_{\alpha/2}\) is the \(1-\alpha/2\) quantile of a standard normal distribution corresponding to the confidence level \((1-\alpha)\).

For RR and OR, \(\theta\) is first log-transformed to improve the normal approximation, and then \((L_{Wald}, U_{Wald}) = exp[ ln(\hat \theta) \pm z_{\alpha/2} \sqrt{\hat V(ln(\hat\theta))}]\), which requires a modification for boundary cases.

Asymptotic Score Methods

The Score confidence interval is also based on an asymptotic normal approximation, but uses a score statistic \(Z(\theta) = S(\theta)/\sqrt {\tilde V}\). This statistic is based on a contrast function \(S(\theta)\) (e.g. for the single proportion, \(S(p) = \hat p - p\), or for RD, \(S(\theta) = \hat p_1 - \hat p_2 - \theta\)), and a maximum likelihood estimate of its variance \(\tilde V = V(S(\theta))\), which is also function of \(\theta\).

\(Z(\theta)\) can be evaluated at any value of \(\theta\) to find the range of values for which \(|Z(\theta)|< z_{\alpha/2}\). For the single proportion (Wilson score interval), this process simplifies to solving a quadratic equation in \(p\).

For the contrast parameters, score methods were derived for independent proportions by Miettinen & Nurminen and others in the 1980s[6]78, and related methods for paired data followed later[9]10[11]12. In each case, the variance is a function of both \(p_1\) and \(p_2\), or of \(\theta\) and a ‘nuisance parameter’, which is eliminated using the maximum likelihood estimate for \(p_2\) (or for the paired case, the cell probability \(p_{21}\)) for the given value of \(\theta\).

SCAS Methods

Originating from papers published in 1985-1990 by Gart and Nam, the skewness-corrected asymptotic score (‘SCAS’) methods introduce a further term within the score statistic that involves \(\tilde \mu\), the third central moment of \(S(\theta)\):

\[ Z(\theta) = \frac{S(\theta)}{\sqrt {\tilde V}} - \frac{(Z(\theta)^2 - 1)\tilde \mu_3}{6 \tilde V^{3/2}} \]

Hence these are an extension of the asymptotic score methods, designed to address asymmetric coverage which was observed for the score method, particularly for the RR contrast. The same principle is applied to the Wilson score interval for a single proportion, which has also been noted to have a systematic bias in one-sided coverage13. A unified family of methods covering the single proportion and all independent contrasts was described by Laud3, with the addition of a bias correction for the OR case14, and a further publication currently under review for paired data.

SCAS intervals generally achieve one-sided coverage that is very close to the nominal \(\alpha/2\), which naturally also leads to excellent two-sided coverage. Symmetric strictly conservative coverage can be achieved by adding a continuity adjustment.

The SCAS method is not implemented in SAS PROC FREQ, but can be obtained using the %SCORECI and %PAIRBINCI macros from https://github.com/petelaud/ratesci-sas.

MOVER Methods (Also known as the Newcombe Method)

This class of methods was popularised by Newcombe for the RD contrast, and is therefore labelled as the Newcombe method in current SAS PROC FREQ documentation. It has since been generalised to other contrasts and for paired data (though not implemented in SAS), and renamed as the Method of Variance Estimates Recovery (MOVER). It is also referred to as the ‘Square-and-add’ approach.

MOVER intervals are produced by combining CIs for each of the two separate proportions involved (and an estimate of their correlation, in the paired data case). Originally Wilson Score intervals were used (hence earlier versions of SAS (e.g. v9.3) labelled this method as WILSON or SCORE), but improved performance may be obtained by using Jeffreys intervals instead.

‘Exact’ Methods

So-called ‘exact’ methods are designed to guarantee strictly conservative coverage. (Newcombe points out some problems with the use of the term ‘exact’ here, hence the quotation marks.) For the single proportion, the Clopper-Pearson confidence interval is the range of p for which \(P(X \ge x | p) = \alpha / 2\) and \(P(X \le x | p) = \alpha / 2\). More complex calculations are involved for the contrast parameters, which eliminate the nuisance parameter by taking the supremum of all p-values across the range of combinations of \(p_1\) and \(p_2\) for a given \(\theta\).

There are actually a few different versions of ‘exact’ methods for each contrast, with alternatives aimed at reducing the conservatism of coverage. For example, Chan-Zhang is an improvement on Santner-Snell. However, it should be noted that some methods (e.g. Blaker, Agresti-Min) achieve reduced conservatism by inverting a two-sided exact test, but as a result they do not satisfy the strictly conservative criterion for one-sided coverage (and therefore would produce inflated type 1 error rates if used for a non-inferiority test).

Several asymptotic methods offer a continuity adjustment, which aims to emulate ‘exact’ methods and achieve almost conservative coverage, by for example adding a quantity of the order \(0.5 \times n\) to the formula. Generally this is not a successful modification to the Wald method, but can produce good results for the score methods. It has been suggested that an adjustment of smaller magnitude such as \(3/8 \times n\)15, or \(0.25 \times n\)3, Appendix S2 (or any other selected value on a ‘sliding scale’ from 0 to 0.5) could be used to achieve coverage that is at least \((1-\alpha)\) nearly all of the time.

‘Mid-P’ Methods

‘Exact’ methods may be adapted to be less conservative (i.e. to achieve proximate coverage) by applying a mid-P adjustment, essentially including half the probability of the observed data in the calculations. This is like a reversal of the continuity adjustment for asymptotic methods.

Although theoretically possible for any contrast, the mid-P method is only implemented in SAS for CIs for the single proportion and the odds ratio contrast.

References

1.
Brown, L. D., Cai, T. T. & DasGupta, A. Interval estimation for a binomial proportion. Statistical Science 16, (2001).
2.
Newcombe, R. G. Confidence Intervals for Proportions and Related Measures of Effect Size. (CRC Press, 2012). doi:10.1201/b12670.
3.
Laud, P. J. Equal-tailed confidence intervals for comparison of rates. Pharmaceutical Statistics 16, 334–348 (2017).
4.
Newcombe, R. G. Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in Medicine 17, 857–872 (1998).
5.
Campbell, I. Chi-squared and FisherIrwin tests of two-by-two tables with small sample recommendations. Statistics in Medicine 26, 3661–3675 (2007).
6.
Miettinen, O. & Nurminen, M. Comparative analysis of two rates. Statistics in Medicine 4, 213–226 (1985).
7.
Mee, R. W. & Anbar, D. Confidence bounds for the difference between two probabilities. Biometrics 40, 1175–1176 (1984).
8.
Koopman, P. A. R. Confidence intervals for the ratio of two binomial proportions. Biometrics 40, 513 (1984).
9.
10.
11.
Nam, J. & Blackwelder, W. C. Analysis of the ratio of marginal probabilities in a matched-pair setting. Statistics in Medicine 21, 689–699 (2002).
12.
Tang, N.-S., Tang, M.-L. & Chan, I. S. F. On tests of equivalence via non-unity relative risk for matched-pair design. Statistics in Medicine 22, 1217–1233 (2003).
13.
Tony Cai, T. One-sided confidence intervals in discrete distributions. Journal of Statistical Planning and Inference 131, 63–88 (2005).
14.
Laud, P. J. Equal-tailed confidence intervals for comparison of rates. Pharmaceutical Statistics 17, 290–293 (2018).
15.
Mehrotra, D. V. & Railkar, R. Minimum risk weights for comparing treatments in stratified binomial trials. Statistics in Medicine 19, 811–825 (2000).