SDMVPSU | SDMVSTRA | WTMEC2YR | HI_CHOL | race | agecat | RIAGENDR |
---|---|---|---|---|---|---|
1 | 83 | 81528.77 | 0 | 2 | (19,39] | 1 |
1 | 84 | 14509.28 | 0 | 3 | (0,19] | 1 |
2 | 86 | 12041.64 | 0 | 3 | (0,19] | 1 |
2 | 75 | 21000.34 | 0 | 3 | (59,Inf] | 2 |
1 | 88 | 22633.58 | 0 | 1 | (19,39] | 1 |
2 | 85 | 74112.49 | 1 | 2 | (39,59] | 2 |
Survey Summary Statistics using Python
When conducting large-scale trials on samples of the population, it can be necessary to use a more complex sampling design than a simple random sample.
Weighting – If smaller populations are sampled more heavily to increase precision, then it is necessary to weight these observations in the analysis.
Finite population correction – Larger samples of populations result in lower variability in comparison to smaller samples.
Stratification – Dividing a population into sub-groups and sampling from each group. This protects from obtaining a very poor sample (e.g. under or over-represented groups), can give samples of a known precision, and gives more precise estimates for population means and totals.
Clustering – Dividing a population into sub-groups, and only sampling certain groups. This gives a lower precision, however can be much more convenient and cheaper - for example if surveying school children you may only sample a subset of schools to avoid travelling to a school to interview a single child.
All of these designs need to be taken into account when calculating statistics, and when producing models. Only summary statistics are discussed in this document, and variances are calculated using Taylor series linearisation methods. For a more detailed introduction to calculating survey statistics using statistical software, see (Lohr 2022).
The ecosystem of survey statistics packages is less mature in Python than in R or SAS, however there is a package that provides a subset of the functionality: samplics
.
Complex Survey Designs
For R and SAS, we give examples of summary statistics on a simple survey design which just had a finite population correction. Unfortunately, samplics
does not have the ability to just use an fpc with no PSU or Strata, so we will instead demonstrate just with a more complete (and realistic) survey design, using the NHANES (“National Health and Nutrition Examination Survey Data” 2010) dataset:
Summary Statistics
Mean
If we want to calculate a mean of a variable in a dataset using samplics
, we need to create an estimator object using the estimation method we will use - here Taylor Series estimation - and the parameter we are estimating. Then, we can specify the survey design by passing columns which define our strata and PSUs, and a column to estimate:
import numpy as np
import pandas as pd
from samplics import TaylorEstimator
from samplics.utils.types import PopParam
= pd.read_csv("../data/nhanes.csv")
nhanes
= TaylorEstimator(PopParam.mean)
mean_estimator
mean_estimator.estimate(=nhanes["HI_CHOL"],
y=nhanes["WTMEC2YR"],
samp_weight=nhanes["SDMVPSU"],
psu=nhanes["SDMVSTRA"],
stratum=True,
remove_nan
)print(mean_estimator.to_dataframe())
_param _estimate _stderror _lci _uci _cv
0 PopParam.mean 0.112143 0.005446 0.100598 0.123688 0.048562
Total
Calculating population totals can be done by changing the TaylorEstimator
parameter to PopParam.total
:
= TaylorEstimator(PopParam.total)
total_estimator
total_estimator.estimate(=nhanes["HI_CHOL"],
y=nhanes["WTMEC2YR"],
samp_weight=nhanes["SDMVPSU"],
psu=nhanes["SDMVSTRA"],
stratum=True,
remove_nan
)print(total_estimator.to_dataframe())
_param _estimate ... _uci _cv
0 PopParam.total 2.863525e+07 ... 3.291896e+07 0.070567
[1 rows x 6 columns]
Ratios
Calculating population ratios can be done by changing the TaylorEstimator
parameter to PopParam.ratio
, and additionally specifying an x
parameter in the estimate
method:
= TaylorEstimator(PopParam.ratio)
ratio_estimator
ratio_estimator.estimate(=nhanes["HI_CHOL"],
y=nhanes["RIAGENDR"],
x=nhanes["WTMEC2YR"],
samp_weight=nhanes["SDMVPSU"],
psu=nhanes["SDMVSTRA"],
stratum=True,
remove_nan
)print(ratio_estimator.to_dataframe())
_param _estimate _stderror _lci _uci _cv
0 PopParam.ratio 0.074222 0.003715 0.066347 0.082097 0.050049
Proportions
Calculating proportions can be done by changing the TaylorEstimator
parameter to PopParam.prop
:
= TaylorEstimator(PopParam.prop)
prop_estimator
prop_estimator.estimate(=nhanes["agecat"],
y=nhanes["WTMEC2YR"],
samp_weight=nhanes["SDMVPSU"],
psu=nhanes["SDMVSTRA"],
stratum=True,
remove_nan
) prop_estimator.to_dataframe()
_param _level _estimate _stderror _lci _uci _cv
0 PopParam.prop (0,19] 0.207749 0.006130 0.195054 0.221044 0.029506
1 PopParam.prop (19,39] 0.293408 0.009561 0.273557 0.314077 0.032585
2 PopParam.prop (39,59] 0.303290 0.004519 0.293795 0.312955 0.014901
3 PopParam.prop (59,Inf] 0.195553 0.008093 0.178965 0.213280 0.041383
Quantiles
samplics
currently does not have a method to calculate quantiles.
Domain Estimations
We can perform domain estimations of different sub-populations by passing our domain column as a parameter to the estimate
method:
= TaylorEstimator(PopParam.mean)
mean_estimator
mean_estimator.estimate(=nhanes["HI_CHOL"],
y=nhanes["WTMEC2YR"],
samp_weight=nhanes["SDMVPSU"],
psu=nhanes["SDMVSTRA"],
stratum=nhanes["race"],
domain=True,
remove_nan
) mean_estimator.to_dataframe()
_param _domain _estimate _stderror _lci _uci _cv
0 PopParam.mean 1 0.101492 0.006246 0.088251 0.114732 0.061540
1 PopParam.mean 2 0.121649 0.006604 0.107649 0.135649 0.054288
2 PopParam.mean 3 0.078640 0.010385 0.056626 0.100655 0.132053
3 PopParam.mean 4 0.099679 0.024666 0.047389 0.151969 0.247458
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.4.0 (2024-04-24)
os Ubuntu 22.04.5 LTS
system x86_64, linux-gnu
ui X11
language (EN)
collate C.UTF-8
ctype C.UTF-8
tz UTC
date 2024-10-25
pandoc 3.2 @ /opt/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P survey * 4.4-2 2024-03-20 [?] RSPM (R 4.4.0)
[1] /home/runner/work/CAMIS/CAMIS/renv/library/linux-ubuntu-jammy/R-4.4/x86_64-pc-linux-gnu
[2] /opt/R/4.4.0/lib/R/library
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────
─ Python configuration ────────────────────────────────────────────────────────
Python 3.12.7 (main, Oct 1 2024, 15:17:55) [GCC 11.4.0]
samplics 0.4.22