Survey Summary Statistics using Python

When conducting large-scale trials on samples of the population, it can be necessary to use a more complex sampling design than a simple random sample.

Weighting – If smaller populations are sampled more heavily to increase precision, then it is necessary to weight these observations in the analysis.
Finite population correction – Larger samples of populations result in lower variability in comparison to smaller samples.
Stratification – Dividing a population into sub-groups and sampling from each group. This protects from obtaining a very poor sample (e.g. under or over-represented groups), can give samples of a known precision, and gives more precise estimates for population means and totals.
Clustering – Dividing a population into sub-groups, and only sampling certain groups. This gives a lower precision, however can be much more convenient and cheaper - for example if surveying school children you may only sample a subset of schools to avoid travelling to a school to interview a single child.

All of these designs need to be taken into account when calculating statistics, and when producing models. Only summary statistics are discussed in this document, and variances are calculated using Taylor series linearisation methods. For a more detailed introduction to calculating survey statistics using statistical software, see (Lohr 2022).

The ecosystem of survey statistics packages is less mature in Python than in R or SAS, however there is a package that provides a subset of the functionality: samplics.

Complex Survey Designs

For R and SAS, we give examples of summary statistics on a simple survey design which just had a finite population correction. Unfortunately, samplics does not have the ability to just use an fpc with no PSU or Strata, so we will instead demonstrate just with a more complete (and realistic) survey design, using the NHANES (“National Health and Nutrition Examination Survey Data” 2010) dataset:

SDMVPSU	SDMVSTRA	WTMEC2YR	HI_CHOL	race	agecat	RIAGENDR
1	83	81528.77	0	2	(19,39]	1
1	84	14509.28	0	3	(0,19]	1
2	86	12041.64	0	3	(0,19]	1
2	75	21000.34	0	3	(59,Inf]	2
1	88	22633.58	0	1	(19,39]	1
2	85	74112.49	1	2	(39,59]	2

Summary Statistics

Mean

If we want to calculate a mean of a variable in a dataset using samplics, we need to create an estimator object using the estimation method we will use - here Taylor Series estimation - and the parameter we are estimating. Then, we can specify the survey design by passing columns which define our strata and PSUs, and a column to estimate:

import numpy as np
import pandas as pd

from samplics import TaylorEstimator
from samplics.utils.types import PopParam

nhanes = pd.read_csv("../data/nhanes.csv")

mean_estimator = TaylorEstimator(PopParam.mean)

mean_estimator.estimate(
    y=nhanes["HI_CHOL"],
    samp_weight=nhanes["WTMEC2YR"],
    psu=nhanes["SDMVPSU"],
    stratum=nhanes["SDMVSTRA"],
    remove_nan=True,
)
print(mean_estimator.to_dataframe())

          _param  _estimate  _stderror      _lci      _uci       _cv
0  PopParam.mean   0.112143   0.005446  0.100598  0.123688  0.048562

Total

Calculating population totals can be done by changing the TaylorEstimator parameter to PopParam.total:

total_estimator = TaylorEstimator(PopParam.total)

total_estimator.estimate(
    y=nhanes["HI_CHOL"],
    samp_weight=nhanes["WTMEC2YR"],
    psu=nhanes["SDMVPSU"],
    stratum=nhanes["SDMVSTRA"],
    remove_nan=True,
)
print(total_estimator.to_dataframe())

           _param     _estimate  ...          _uci       _cv
0  PopParam.total  2.863525e+07  ...  3.291896e+07  0.070567

[1 rows x 6 columns]

Ratios

Calculating population ratios can be done by changing the TaylorEstimator parameter to PopParam.ratio, and additionally specifying an x parameter in the estimate method:

ratio_estimator = TaylorEstimator(PopParam.ratio)

ratio_estimator.estimate(
    y=nhanes["HI_CHOL"],
    x=nhanes["RIAGENDR"],
    samp_weight=nhanes["WTMEC2YR"],
    psu=nhanes["SDMVPSU"],
    stratum=nhanes["SDMVSTRA"],
    remove_nan=True,
)
print(ratio_estimator.to_dataframe())

           _param  _estimate  _stderror      _lci      _uci       _cv
0  PopParam.ratio   0.074222   0.003715  0.066347  0.082097  0.050049

Proportions

Calculating proportions can be done by changing the TaylorEstimator parameter to PopParam.prop:

prop_estimator = TaylorEstimator(PopParam.prop)

prop_estimator.estimate(
    y=nhanes["agecat"],
    samp_weight=nhanes["WTMEC2YR"],
    psu=nhanes["SDMVPSU"],
    stratum=nhanes["SDMVSTRA"],
    remove_nan=True,
)
prop_estimator.to_dataframe()

          _param    _level  _estimate  _stderror      _lci      _uci       _cv
0  PopParam.prop    (0,19]   0.207749   0.006130  0.195054  0.221044  0.029506
1  PopParam.prop   (19,39]   0.293408   0.009561  0.273557  0.314077  0.032585
2  PopParam.prop   (39,59]   0.303290   0.004519  0.293795  0.312955  0.014901
3  PopParam.prop  (59,Inf]   0.195553   0.008093  0.178965  0.213280  0.041383

Quantiles

samplics currently does not have a method to calculate quantiles.

Domain Estimations

We can perform domain estimations of different sub-populations by passing our domain column as a parameter to the estimate method:

mean_estimator = TaylorEstimator(PopParam.mean)

mean_estimator.estimate(
    y=nhanes["HI_CHOL"],
    samp_weight=nhanes["WTMEC2YR"],
    psu=nhanes["SDMVPSU"],
    stratum=nhanes["SDMVSTRA"],
    domain=nhanes["race"],
    remove_nan=True,
)
mean_estimator.to_dataframe()

          _param  _domain  _estimate  _stderror      _lci      _uci       _cv
0  PopParam.mean        1   0.101492   0.006246  0.088251  0.114732  0.061540
1  PopParam.mean        2   0.121649   0.006604  0.107649  0.135649  0.054288
2  PopParam.mean        3   0.078640   0.010385  0.056626  0.100655  0.132053
3  PopParam.mean        4   0.099679   0.024666  0.047389  0.151969  0.247458

Session Info

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.3 (2025-02-28)
 os       Ubuntu 24.04.2 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  C.UTF-8
 ctype    C.UTF-8
 tz       Europe/London
 date     2025-03-13
 pandoc   NA (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 ! package * version date (UTC) lib source
 P survey  * 4.4-2   2024-03-20 [?] RSPM (R 4.4.0)

 [1] /home/michael/source/CAMIS/renv/library/linux-ubuntu-noble/R-4.4/x86_64-pc-linux-gnu
 [2] /opt/R/4.4.3/lib/R/library

 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────

─ Python configuration ────────────────────────────────────────────────────────
 Python    3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0]
 samplics  0.4.22

References

Lohr, Sharon L. 2022. Sampling: Design and Analysis. 3rd ed. CRC Press, Taylor & Francis Group.

“National Health and Nutrition Examination Survey Data.” 2010. Centers for Disease Control; Prevention (CDC). https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=laboratory&CycleBeginYear=2009.