Survey Summary Statistics using Python

When conducting large-scale trials on samples of the population, it can be necessary to use a more complex sampling design than a simple random sample.

All of these designs need to be taken into account when calculating statistics, and when producing models. Only summary statistics are discussed in this document, and variances are calculated using Taylor series linearisation methods. For a more detailed introduction to calculating survey statistics using statistical software, see (Lohr 2022).

The ecosystem of survey statistics packages is less mature in Python than in R or SAS, however there is a package that provides a subset of the functionality: samplics.

Complex Survey Designs

For R and SAS, we give examples of summary statistics on a simple survey design which just had a finite population correction. Unfortunately, samplics does not have the ability to just use an fpc with no PSU or Strata, so we will instead demonstrate just with a more complete (and realistic) survey design, using the NHANES (“National Health and Nutrition Examination Survey Data” 2010) dataset:

SDMVPSU SDMVSTRA WTMEC2YR HI_CHOL race agecat RIAGENDR
1 83 81528.77 0 2 (19,39] 1
1 84 14509.28 0 3 (0,19] 1
2 86 12041.64 0 3 (0,19] 1
2 75 21000.34 0 3 (59,Inf] 2
1 88 22633.58 0 1 (19,39] 1
2 85 74112.49 1 2 (39,59] 2

Summary Statistics

Mean

If we want to calculate a mean of a variable in a dataset using samplics, we need to create an estimator object using the estimation method we will use - here Taylor Series estimation - and the parameter we are estimating. Then, we can specify the survey design by passing columns which define our strata and PSUs, and a column to estimate:

import numpy as np
import pandas as pd

from samplics import TaylorEstimator
from samplics.utils.types import PopParam

nhanes = pd.read_csv("../data/nhanes.csv")

mean_estimator = TaylorEstimator(PopParam.mean)

mean_estimator.estimate(
    y=nhanes["HI_CHOL"],
    samp_weight=nhanes["WTMEC2YR"],
    psu=nhanes["SDMVPSU"],
    stratum=nhanes["SDMVSTRA"],
    remove_nan=True,
)
print(mean_estimator.to_dataframe())
          _param  _estimate  _stderror      _lci      _uci       _cv
0  PopParam.mean   0.112143   0.005446  0.100598  0.123688  0.048562

Total

Calculating population totals can be done by changing the TaylorEstimator parameter to PopParam.total:

total_estimator = TaylorEstimator(PopParam.total)

total_estimator.estimate(
    y=nhanes["HI_CHOL"],
    samp_weight=nhanes["WTMEC2YR"],
    psu=nhanes["SDMVPSU"],
    stratum=nhanes["SDMVSTRA"],
    remove_nan=True,
)
print(total_estimator.to_dataframe())
           _param     _estimate  ...          _uci       _cv
0  PopParam.total  2.863525e+07  ...  3.291896e+07  0.070567

[1 rows x 6 columns]

Ratios

Calculating population ratios can be done by changing the TaylorEstimator parameter to PopParam.ratio, and additionally specifying an x parameter in the estimate method:

ratio_estimator = TaylorEstimator(PopParam.ratio)

ratio_estimator.estimate(
    y=nhanes["HI_CHOL"],
    x=nhanes["RIAGENDR"],
    samp_weight=nhanes["WTMEC2YR"],
    psu=nhanes["SDMVPSU"],
    stratum=nhanes["SDMVSTRA"],
    remove_nan=True,
)
print(ratio_estimator.to_dataframe())
           _param  _estimate  _stderror      _lci      _uci       _cv
0  PopParam.ratio   0.074222   0.003715  0.066347  0.082097  0.050049

Proportions

Calculating proportions can be done by changing the TaylorEstimator parameter to PopParam.prop:

prop_estimator = TaylorEstimator(PopParam.prop)

prop_estimator.estimate(
    y=nhanes["agecat"],
    samp_weight=nhanes["WTMEC2YR"],
    psu=nhanes["SDMVPSU"],
    stratum=nhanes["SDMVSTRA"],
    remove_nan=True,
)
prop_estimator.to_dataframe()
          _param    _level  _estimate  _stderror      _lci      _uci       _cv
0  PopParam.prop    (0,19]   0.207749   0.006130  0.195054  0.221044  0.029506
1  PopParam.prop   (19,39]   0.293408   0.009561  0.273557  0.314077  0.032585
2  PopParam.prop   (39,59]   0.303290   0.004519  0.293795  0.312955  0.014901
3  PopParam.prop  (59,Inf]   0.195553   0.008093  0.178965  0.213280  0.041383

Quantiles

samplics currently does not have a method to calculate quantiles.

Domain Estimations

We can perform domain estimations of different sub-populations by passing our domain column as a parameter to the estimate method:

mean_estimator = TaylorEstimator(PopParam.mean)

mean_estimator.estimate(
    y=nhanes["HI_CHOL"],
    samp_weight=nhanes["WTMEC2YR"],
    psu=nhanes["SDMVPSU"],
    stratum=nhanes["SDMVSTRA"],
    domain=nhanes["race"],
    remove_nan=True,
)
mean_estimator.to_dataframe()
          _param  _domain  _estimate  _stderror      _lci      _uci       _cv
0  PopParam.mean        1   0.101492   0.006246  0.088251  0.114732  0.061540
1  PopParam.mean        2   0.121649   0.006604  0.107649  0.135649  0.054288
2  PopParam.mean        3   0.078640   0.010385  0.056626  0.100655  0.132053
3  PopParam.mean        4   0.099679   0.024666  0.047389  0.151969  0.247458
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.0 (2024-04-24)
 os       Ubuntu 22.04.5 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  C.UTF-8
 ctype    C.UTF-8
 tz       UTC
 date     2024-10-10
 pandoc   3.2 @ /opt/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 ! package * version date (UTC) lib source
 P survey  * 4.4-2   2024-03-20 [?] RSPM (R 4.4.0)

 [1] /home/runner/work/CAMIS/CAMIS/renv/library/linux-ubuntu-jammy/R-4.4/x86_64-pc-linux-gnu
 [2] /opt/R/4.4.0/lib/R/library

 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────
─ Python configuration ────────────────────────────────────────────────────────
 Python    3.12.7 (main, Oct  1 2024, 15:17:55) [GCC 11.4.0]
 samplics  0.4.22

References

Lohr, Sharon L. 2022. Sampling: Design and Analysis. 3rd ed. CRC Press, Taylor & Francis Group.
“National Health and Nutrition Examination Survey Data.” 2010. Centers for Disease Control; Prevention (CDC). https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=laboratory&CycleBeginYear=2009.