When conducting large-scale trials on samples of the population, it can be necessary to use a more complex sampling design than a simple random sample.
Weighting – If smaller populations are sampled more heavily to increase precision, then it is necessary to weight these observations in the analysis.
Finite population correction – Larger samples of populations result in lower variability in comparison to smaller samples.
Stratification – Dividing a population into sub-groups and sampling from each group. This protects from obtaining a very poor sample (e.g. under or over-represented groups), can give samples of a known precision, and gives more precise estimates for population means and totals.
Clustering – Dividing a population into sub-groups, and only sampling certain groups. This gives a lower precision, however can be much more convenient and cheaper - for example if surveying school children you may only sample a subset of schools to avoid travelling to a school to interview a single child.
All of these designs need to be taken into account when calculating statistics, and when producing models. Only summary statistics are discussed in this document, and variances are calculated using the default Taylor series linearisation methods. For a more detailed introduction to survey statistics in R, see (Lohr 2022) or (Lumley 2004).
We will use the {survey} package, which is the standard for survey statistics in R. Note that for those who prefer the tidyverse, the {srvyr} package is a wrapper around {survey} with {dplyr} like syntax.
Simple Survey Designs
We will use the API dataset (“API Data Files” 2006), which contains a number of datasets based on different samples from a dataset of academic performance. Initially we will just cover the methodology with a simple random sample and a finite population correction to demonstrate functionality.
If we want to calculate a mean of a variable in a dataset which has been obtained from a simple random sample such as apisrs, in R we can create a design object using the survey::svydesign function (specifying that there is no PSU using id = ~1 and the finite population correction using fpc=~fpc).
srs_design <-svydesign(id =~1, fpc =~fpc, data = apisrs)
This design object stores all metadata about the sample alongside the data, and is used by all subsequent functions in the {survey} package. To calculate the mean, standard error, and confidence intervals of the growth variable, we can use the survey::svymean and confint functions:
# Calculate mean and SE of growth. The standard error will be corrected by the finite population correction specified in the designsrs_means <-svymean(~growth, srs_design)srs_means
mean SE
growth 31.9 2.0905
# Use degf() to get the degrees of freedomconfint(srs_means, df=degf(srs_design))
2.5 % 97.5 %
growth 27.77764 36.02236
Note that to obtain correct results, we had to specify the degrees of freedom using the design object.
Total
Calculating population totals can be done using the survey::svytotal function in R.
svytotal(~growth, srs_design)
total SE
growth 197589 12949
Ratios
To perform ratio analysis for means or proportions of analysis variables in R, we can survey::svyratio, here requesting that we do not separate the ratio estimation per Strata as this design is not stratified.
For proportions close to 0, it can be that survey::svyciprop is more accurate at producing confidence intervals than confint.
Quantiles
To calculate quantiles in R, we can use the survey::svyquantile function. Note that this function was reworked in version 4.1 of {survey}, and prior to this had different arguments and results. The current version of svyquantile has an qrule which is similar to the type argument in quantile, and can be used to change how the quantiles are calculated. For more information, see vignette("qrule", package="survey").
Much of the previous examples and notes still stand for more complex survey designs, here we will demonstrate using a dataset from NHANES (“National Health and Nutrition Examination Survey Data” 2010), which uses both stratification and clustering:
data("nhanes")head(nhanes) |> gt::gt()
SDMVPSU
SDMVSTRA
WTMEC2YR
HI_CHOL
race
agecat
RIAGENDR
1
83
81528.77
0
2
(19,39]
1
1
84
14509.28
0
3
(0,19]
1
2
86
12041.64
0
3
(0,19]
1
2
75
21000.34
0
3
(59,Inf]
2
1
88
22633.58
0
1
(19,39]
1
2
85
74112.49
1
2
(39,59]
2
To produce means and standard quartiles for this sample, taking account of sample design, we can use the following:
nhanes_design <-svydesign(data = nhanes,id =~SDMVPSU, # Specify the PSU/cluster columnstrata =~SDMVSTRA, # The stratification columnweights =~WTMEC2YR, # The weighting columnnest =TRUE# Allows for PSUs with the same name nested within different strata)svymean(~HI_CHOL, nhanes_design, na.rm=TRUE)
In R, we can perform domain estimations of different sub-populations by passing our required survey function to the svyby function. svyby can also take additional options to pass to the function, for example here we pass na.rm=TRUE and deff=TRUE to svymean:
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.4.2 (2024-10-31)
os Ubuntu 24.04.1 LTS
system x86_64, linux-gnu
ui X11
language (EN)
collate C.UTF-8
ctype C.UTF-8
tz UTC
date 2025-01-20
pandoc 3.4 @ /opt/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P survey * 4.4-2 2024-03-20 [?] RSPM (R 4.4.0)
[1] /home/runner/work/CAMIS/CAMIS/renv/library/linux-ubuntu-noble/R-4.4/x86_64-pc-linux-gnu
[2] /opt/R/4.4.2/lib/R/library
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────