Survey Summary Statistics using R

When conducting large-scale trials on samples of the population, it can be necessary to use a more complex sampling design than a simple random sample.

Weighting – If smaller populations are sampled more heavily to increase precision, then it is necessary to weight these observations in the analysis.
Finite population correction – Larger samples of populations result in lower variability in comparison to smaller samples.
Stratification – Dividing a population into sub-groups and sampling from each group. This protects from obtaining a very poor sample (e.g. under or over-represented groups), can give samples of a known precision, and gives more precise estimates for population means and totals.
Clustering – Dividing a population into sub-groups, and only sampling certain groups. This gives a lower precision, however can be much more convenient and cheaper - for example if surveying school children you may only sample a subset of schools to avoid travelling to a school to interview a single child.

All of these designs need to be taken into account when calculating statistics, and when producing models. Only summary statistics are discussed in this document, and variances are calculated using the default Taylor series linearisation methods. For a more detailed introduction to survey statistics in R, see (Lohr 2022) or (Lumley 2004).

We will use the {survey} package, which is the standard for survey statistics in R. Note that for those who prefer the tidyverse, the {srvyr} package is a wrapper around {survey} with {dplyr} like syntax.

Simple Survey Designs

We will use the API dataset (“API Data Files” 2006), which contains a number of datasets based on different samples from a dataset of academic performance. Initially we will just cover the methodology with a simple random sample and a finite population correction to demonstrate functionality.

library(survey)

data("api")

head(apisrs) |> gt::gt()

cds	stype	name	sname	snum	dname	dnum	cname	cnum	flag	pcttest	api00	api99	target	growth	sch.wide	comp.imp	both	awards	meals	ell	yr.rnd	mobility	acs.k3	acs.46	acs.core	pct.resp	not.hsg	hsg	some.col	col.grad	grad.sch	avg.ed	full	emer	enroll	api.stu	pw	fpc
15739081534155	H	McFarland High	McFarland High	1039	McFarland Unified	432	Kern	14	NA	98	462	448	18	14	No	Yes	No	No	44	31	NA	6	NA	NA	24	82	44	34	12	7	3	1.91	71	35	477	429	30.97	6194
19642126066716	E	Stowers (Cecil	Stowers (Cecil B.) Elementary	1124	ABC Unified	1	Los Angeles	18	NA	100	878	831	NA	47	Yes	Yes	Yes	Yes	8	25	NA	15	19	30	NA	97	4	10	23	43	21	3.66	90	10	478	420	30.97	6194
30664493030640	H	Brea-Olinda Hig	Brea-Olinda High	2868	Brea-Olinda Unified	79	Orange	29	NA	98	734	742	3	-8	No	No	No	No	10	10	NA	7	NA	NA	28	95	5	9	21	41	24	3.71	83	18	1410	1287	30.97	6194
19644516012744	E	Alameda Element	Alameda Elementary	1273	Downey Unified	187	Los Angeles	18	NA	99	772	657	7	115	Yes	Yes	Yes	Yes	70	25	NA	23	23	NA	NA	100	37	40	14	8	1	1.96	85	18	342	291	30.97	6194
40688096043293	E	Sunnyside Eleme	Sunnyside Elementary	4926	San Luis Coastal Unified	640	San Luis Obispo	39	NA	99	739	719	4	20	Yes	Yes	Yes	Yes	43	12	NA	12	20	29	NA	91	8	21	27	34	10	3.17	100	0	217	189	30.97	6194
19734456014278	E	Los Molinos Ele	Los Molinos Elementary	2463	Hacienda la Puente Unif	284	Los Angeles	18	NA	93	835	822	NA	13	Yes	Yes	Yes	No	16	19	NA	13	19	29	NA	71	1	8	20	38	34	3.96	75	20	258	211	30.97	6194

Mean

If we want to calculate a mean of a variable in a dataset which has been obtained from a simple random sample such as apisrs, in R we can create a design object using the survey::svydesign function (specifying that there is no PSU using id = ~1 and the finite population correction using fpc=~fpc).

srs_design <- svydesign(id = ~1, fpc = ~fpc, data = apisrs)

This design object stores all metadata about the sample alongside the data, and is used by all subsequent functions in the {survey} package. To calculate the mean, standard error, and confidence intervals of the growth variable, we can use the survey::svymean and confint functions:

# Calculate mean and SE of growth. The standard error will be corrected by the finite population correction specified in the design
srs_means <- svymean(~growth, srs_design)

srs_means

       mean     SE
growth 31.9 2.0905

# Use degf() to get the degrees of freedom
confint(srs_means, df=degf(srs_design))

          2.5 %   97.5 %
growth 27.77764 36.02236

Note that to obtain correct results, we had to specify the degrees of freedom using the design object.

Total

Calculating population totals can be done using the survey::svytotal function in R.

svytotal(~growth, srs_design)

        total    SE
growth 197589 12949

Ratios

To perform ratio analysis for means or proportions of analysis variables in R, we can survey::svyratio, here requesting that we do not separate the ratio estimation per Strata as this design is not stratified.

svy_ratio <- svyratio(
  ~api00,
  ~api99,
  srs_design,
  se=TRUE,
  df=degf(srs_design),
  separate=FALSE
)

svy_ratio

Ratio estimator: svyratio.survey.design2(~api00, ~api99, srs_design, se = TRUE, 
    df = degf(srs_design), separate = FALSE)
Ratios=
         api99
api00 1.051066
SEs=
            api99
api00 0.003603991

confint(svy_ratio, df=degf(srs_design))

               2.5 %   97.5 %
api00/api99 1.043959 1.058173

Proportions

To calculate a proportion in R, we use the svymean function on a factor or character column:

props <- svymean(~sch.wide, srs_design)

props

             mean     SE
sch.wideNo  0.185 0.0271
sch.wideYes 0.815 0.0271

confint(props, df=degf(srs_design))

                2.5 %    97.5 %
sch.wideNo  0.1316041 0.2383959
sch.wideYes 0.7616041 0.8683959

For proportions close to 0, it can be that survey::svyciprop is more accurate at producing confidence intervals than confint.

Quantiles

To calculate quantiles in R, we can use the survey::svyquantile function. Note that this function was reworked in version 4.1 of {survey}, and prior to this had different arguments and results. The current version of svyquantile has an qrule which is similar to the type argument in quantile, and can be used to change how the quantiles are calculated. For more information, see vignette("qrule", package="survey").

svyquantile(
  ~growth,
  srs_design,
  quantiles = c(0.025, 0.5, 0.975),
  ci=TRUE,
  se=TRUE
)

$growth
      quantile ci.2.5 ci.97.5        se
0.025      -16    -21     -12  2.281998
0.5         27     24      31  1.774887
0.975       99     84     189 26.623305

attr(,"hasci")
[1] TRUE
attr(,"class")
[1] "newsvyquantile"

Summary Statistics on Complex Survey Designs

Much of the previous examples and notes still stand for more complex survey designs, here we will demonstrate using a dataset from NHANES (“National Health and Nutrition Examination Survey Data” 2010), which uses both stratification and clustering:

data("nhanes")

head(nhanes) |> gt::gt()

SDMVPSU	SDMVSTRA	WTMEC2YR	HI_CHOL	race	agecat	RIAGENDR
1	83	81528.77	0	2	(19,39]	1
1	84	14509.28	0	3	(0,19]	1
2	86	12041.64	0	3	(0,19]	1
2	75	21000.34	0	3	(59,Inf]	2
1	88	22633.58	0	1	(19,39]	1
2	85	74112.49	1	2	(39,59]	2

To produce means and standard quartiles for this sample, taking account of sample design, we can use the following:

nhanes_design <- svydesign(
  data = nhanes,
  id = ~SDMVPSU, # Specify the PSU/cluster column
  strata = ~SDMVSTRA,  # The stratification column
  weights = ~WTMEC2YR,  # The weighting column
  nest = TRUE  # Allows for PSUs with the same name nested within different strata
)

svymean(~HI_CHOL, nhanes_design, na.rm=TRUE)

           mean     SE
HI_CHOL 0.11214 0.0054

svyquantile(
  ~HI_CHOL,
  nhanes_design,
  quantiles=c(0.25, 0.5, 0.75),
  na.rm=TRUE,
  ci=TRUE
)

$HI_CHOL
     quantile ci.2.5 ci.97.5        se
0.25        0      0       1 0.2358596
0.5         0      0       1 0.2358596
0.75        0      0       1 0.2358596

attr(,"hasci")
[1] TRUE
attr(,"class")
[1] "newsvyquantile"

In R, we can perform domain estimations of different sub-populations by passing our required survey function to the svyby function. svyby can also take additional options to pass to the function, for example here we pass na.rm=TRUE and deff=TRUE to svymean:

svyby(~HI_CHOL, ~race, nhanes_design, svymean, na.rm=TRUE, deff=TRUE)

  race    HI_CHOL          se DEff.HI_CHOL
1    1 0.10149167 0.006245843     1.082805
2    2 0.12164921 0.006604134     1.407850
3    3 0.07864006 0.010384645     2.091258
4    4 0.09967861 0.024666227     3.098368

Session Info

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.3 (2025-02-28)
 os       Ubuntu 24.04.2 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  C.UTF-8
 ctype    C.UTF-8
 tz       Europe/London
 date     2025-03-13
 pandoc   NA (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 ! package * version date (UTC) lib source
 P survey  * 4.4-2   2024-03-20 [?] RSPM (R 4.4.0)

 [1] /home/michael/source/CAMIS/renv/library/linux-ubuntu-noble/R-4.4/x86_64-pc-linux-gnu
 [2] /opt/R/4.4.3/lib/R/library

 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────

References

“API Data Files.” 2006. California Department of Education. https://web.archive.org/web/20060813165101/http://api.cde.ca.gov/datafiles.asp.

Lohr, Sharon L. 2022. Sampling: Design and Analysis. 3rd ed. CRC Press, Taylor & Francis Group.

Lumley, Thomas. 2004. “Analysis of Complex Survey Samples.” Journal of Statistical Software 9 (8): 1–19. https://doi.org/10.18637/jss.v009.i08.

“National Health and Nutrition Examination Survey Data.” 2010. Centers for Disease Control; Prevention (CDC). https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=laboratory&CycleBeginYear=2009.