This document will compare the survey summary statistics functionality in SAS (available through SAS/STAT), R (available from the {survey} package), and Python (available from the samplics package), highlighting differences in methods and results. Only the default Taylor series linearisation method for calculating variances is used in all languages. A more detailed comparison between R and SAS for specific methods and use-cases is available in (“Software for Analysis of YRBS Data” 2017), (So et al. 2020), or (Damico 2009). For a general guide to survey statistics, which has companion guides for both R and SAS, see (Lohr 2022).
Result Comparison
The following table shows different survey summary statistics, the capabilities of each language, and whether or not the results match. Each analysis also includes calculating the standard error and confidence intervals.
**For confidence limits of proportions near 0 and 1, survey::svyciprop can be more accurate than confint in R, but does not match other software.
For the full R, SAS, and Python code and results used for this comparison, see below:
Show Code
R
library(survey)data("nhanes")nhanes_design <-svydesign(data = nhanes,id =~SDMVPSU, # Specify the PSU/cluster columnstrata =~SDMVSTRA, # The stratification columnweights =~WTMEC2YR, # The weighting columnnest =TRUE# Allows for PSUs with the same name nested within different strata)# Mean of HI_CHOLhi_chol_mean <-svymean(~HI_CHOL, nhanes_design, na.rm=TRUE)# Sum of HI_CHOLhi_chol_sum <-svytotal(~HI_CHOL, nhanes_design, na.rm=TRUE)# Ratio of HI_CHOL / RIAGENDRhi_chol_ratio <-svyratio(numerator =~HI_CHOL,denominator =~RIAGENDR, nhanes_design,na.rm=TRUE,ci=TRUE,se=TRUE,separate=FALSE)# Proportion of different AGECAT valuesagecat_props <-svymean(~agecat, nhanes_design, na.rm=TRUE)# Quantiles of HI_CHOLhi_chol_quart <-svyquantile(~HI_CHOL, nhanes_design,quantiles=c(0.025, 0.5, 0.975),na.rm=TRUE,ci=TRUE)# Domain analysis of mean of HI_CHOL by race, with design effecthi_chol_mean_by_race <-svyby(~HI_CHOL, ~race, nhanes_design, svymean, na.rm=TRUE, deff="replace")print(list("Mean of HI_CHOL"=coef(hi_chol_mean),"SE of Mean HI_CHOL"=SE(hi_chol_mean),"CL of Mean HI_CHOL"=confint(hi_chol_mean, df=degf(nhanes_design)),"Sum of HI_CHOL"=coef(hi_chol_sum),"SE of Sum HI_CHOL"=SE(hi_chol_sum),"CL of Sum HI_CHOL"=confint(hi_chol_sum, df=degf(nhanes_design)),"Ratio of HI_CHOL / RIAGENDR"=coef(hi_chol_ratio),"SE of Ratio HI_CHOL / RIAGENDR"=SE(hi_chol_ratio),"CL of Ratio HI_CHOL / RIAGENDR"=confint(hi_chol_ratio, df=degf(nhanes_design)),"Proportion of AGECAT"=coef(agecat_props),"SE of Proportion AGECAT"=SE(agecat_props),"CL of Proportion AGECAT"=confint(agecat_props, df=degf(nhanes_design)),"Quantiles of HI_CHOL"=coef(hi_chol_quart),"SE of Quantiles HI_CHOL"=SE(hi_chol_quart),"CL of Quantiles HI_CHOL"=confint(hi_chol_quart, df=degf(nhanes_design)),"Mean of HI_CHOL by race"=coef(hi_chol_mean_by_race),"SE of HI_CHOL by race"=SE(hi_chol_mean_by_race),"CL of HI_CHOL by race"=confint(hi_chol_mean_by_race, df=degf(nhanes_design)),"Design Effect of HI_CHOL by race"= hi_chol_mean_by_race$DEff.HI_CHOL))
$`Mean of HI_CHOL`
HI_CHOL
0.112143
$`SE of Mean HI_CHOL`
HI_CHOL
HI_CHOL 0.00544584
$`CL of Mean HI_CHOL`
2.5 % 97.5 %
HI_CHOL 0.1005983 0.1236876
$`Sum of HI_CHOL`
HI_CHOL
28635245
$`SE of Sum HI_CHOL`
HI_CHOL
HI_CHOL 2020711
$`CL of Sum HI_CHOL`
2.5 % 97.5 %
HI_CHOL 24351530 32918961
$`Ratio of HI_CHOL / RIAGENDR`
HI_CHOL/RIAGENDR
0.07422209
$`SE of Ratio HI_CHOL / RIAGENDR`
HI_CHOL/RIAGENDR
0.003714728
$`CL of Ratio HI_CHOL / RIAGENDR`
2.5 % 97.5 %
HI_CHOL/RIAGENDR 0.06634722 0.08209696
$`Proportion of AGECAT`
agecat(0,19] agecat(19,39] agecat(39,59] agecat(59,Inf]
0.2077495 0.2934079 0.3032896 0.1955530
$`SE of Proportion AGECAT`
agecat(0,19] agecat(19,39] agecat(39,59] agecat(59,Inf]
0.006129950 0.009560692 0.004519463 0.008092578
$`CL of Proportion AGECAT`
2.5 % 97.5 %
agecat(0,19] 0.1947546 0.2207444
agecat(19,39] 0.2731401 0.3136756
agecat(39,59] 0.2937088 0.3128704
agecat(59,Inf] 0.1783975 0.2127085
$`Quantiles of HI_CHOL`
HI_CHOL.0.025 HI_CHOL.0.5 HI_CHOL.0.975
0 0 1
$`SE of Quantiles HI_CHOL`
HI_CHOL.0.025 HI_CHOL.0.5 HI_CHOL.0.975
0.2358596 0.2358596 0.0000000
$`CL of Quantiles HI_CHOL`
l u
HI_CHOL.0.025 0 1
HI_CHOL.0.5 0 1
HI_CHOL.0.975 1 1
$`Mean of HI_CHOL by race`
1 2 3 4
0.10149167 0.12164921 0.07864006 0.09967861
$`SE of HI_CHOL by race`
[1] 0.006245843 0.006604134 0.010384645 0.024666227
$`CL of HI_CHOL by race`
2.5 % 97.5 %
1 0.08825107 0.1147323
2 0.10764907 0.1356493
3 0.05662560 0.1006545
4 0.04738854 0.1519687
$`Design Effect of HI_CHOL by race`
[1] 1.082734 1.407822 2.091156 3.098290
SAS
* Mean, sum quantile of HI_CHOL;proc surveymeans data=nhanes mean sum clm quantile=(0.025 0.5 0.975); cluster SDMVPSU; strata SDMVSTRA; weight WTMEC2YR; var HI_CHOL;run;* Ratio of HI_CHOL / RIAGENDR;proc surveymeans data=nhanes; cluster SDMVPSU; strata SDMVSTRA; weight WTMEC2YR; ratio HI_CHOL / RIAGENDR;run;* Proportions of agecat;proc surveyfreq data=nhanes; cluster SDMVPSU; strata SDMVSTRA; weight WTMEC2YR; table agecat / cl;run;* Mean and DEFF of HI_CHOL by race;proc surveymeans data=nhanes mean deff; cluster SDMVPSU; strata SDMVSTRA; weight WTMEC2YR; domain race; var HI_CHOL;run;
The SURVEYMEANS Procedure
Data Summary
Number of Strata 15
Number of Clusters 31
Number of Observations 8591
Sum of Weights 276536446
Statistics
Std Error Std Error
Variable Mean of Mean 95% CL for Mean Sum of Sum 95% CL for Sum
--------------------------------------------------------------------------------------------------------------------------
HI_CHOL 0.112143 0.005446 0.10059829 0.12368762 28635245 2020711 24351529.8 32918960.7
--------------------------------------------------------------------------------------------------------------------------
Quantiles
Std
Variable Percentile Estimate Error 95% Confidence Limits
---------------------------------------------------------------------------------
HI_CHOL 2.5 0 0.024281 -0.0514730 0.05147298
50 Median 0 0.024281 -0.0514730 0.05147298
97.5 0.777070 0.024281 0.7255973 0.82854324
---------------------------------------------------------------------------------
The SURVEYMEANS Procedure
Data Summary
Number of Strata 15
Number of Clusters 31
Number of Observations 8591
Sum of Weights 276536446
Statistics
Std Error
Variable N Mean of Mean 95% CL for Mean
---------------------------------------------------------------------------------
HI_CHOL 7846 0.112143 0.005446 0.10059829 0.12368762
RIAGENDR 8591 1.512019 0.005302 1.50077977 1.52325807
---------------------------------------------------------------------------------
Ratio Analysis
Std
Numerator Denominator N Ratio Error 95% CL for Ratio
----------------------------------------------------------------------------------------------
HI_CHOL RIAGENDR 7846 0.074222 0.003715 0.06634722 0.08209696
----------------------------------------------------------------------------------------------
The SURVEYFREQ Procedure
Data Summary
Number of Strata 15
Number of Clusters 31
Number of Observations 8591
Sum of Weights 276536446
Table of agecat
Weighted Std Err of Std Err of 95% Confidence Limits
agecat Frequency Frequency Wgt Freq Percent Percent for Percent
-------------------------------------------------------------------------------------------------------
(0,19] 2532 57450307 3043819 20.7749 0.6130 19.4755 22.0744
(19,39] 2033 81137975 3692818 29.3408 0.9561 27.3140 31.3676
(39,59] 2021 83870623 4853936 30.3290 0.4519 29.3709 31.2870
(59,Inf] 2005 54077541 4284296 19.5553 0.8093 17.8398 21.2709
Total 8591 276536446 13935730 100.0000
-------------------------------------------------------------------------------------------------------
The SURVEYMEANS Procedure
Data Summary
Number of Strata 15
Number of Clusters 31
Number of Observations 8591
Sum of Weights 276536446
Statistics
Std Error Design
Variable Mean of Mean Effect
--------------------------------------------------------
HI_CHOL 0.112143 0.005446 2.336725
--------------------------------------------------------
The SURVEYMEANS Procedure
Statistics for race Domains
Std Error Design
race Variable Mean of Mean Effect
------------------------------------------------------------------------
1 HI_CHOL 0.101492 0.006246 1.082734
2 HI_CHOL 0.121649 0.006604 1.407822
3 HI_CHOL 0.078640 0.010385 2.091156
4 HI_CHOL 0.099679 0.024666 3.098290
------------------------------------------------------------------------
Python
import pandas as pdfrom samplics import TaylorEstimatorfrom samplics.utils.types import PopParamnhanes = pd.read_csv("../data/nhanes.csv")nhanes_design_kwargs =dict( psu=nhanes["SDMVPSU"], stratum=nhanes["SDMVSTRA"], samp_weight=nhanes["WTMEC2YR"], remove_nan=True,)# Mean of HI_CHOLmean_estimator = TaylorEstimator(PopParam.mean)mean_estimator.estimate(nhanes["HI_CHOL"], **nhanes_design_kwargs)hi_chol_means = mean_estimator.to_dataframe()# Sum of HI_CHOLtotal_estimator = TaylorEstimator(PopParam.total)total_estimator.estimate(nhanes["HI_CHOL"], **nhanes_design_kwargs)hi_chol_totals = total_estimator.to_dataframe()# Ratio of HI_CHOL / RIAGENDRratio_estimator = TaylorEstimator(PopParam.ratio)ratio_estimator.estimate( y=nhanes["HI_CHOL"], x=nhanes["RIAGENDR"], **nhanes_design_kwargs)hi_chol_ratio = ratio_estimator.to_dataframe()# Proportion of different AGECAT valuesprop_estimator = TaylorEstimator(PopParam.prop)prop_estimator.estimate(nhanes["agecat"], **nhanes_design_kwargs)agecat_prop = prop_estimator.to_dataframe()# Quantiles of HI_CHOL# NA# Domain analysis of mean of HI_CHOL by race, with design effectmean_estimator = TaylorEstimator(PopParam.mean)mean_estimator.estimate( nhanes["HI_CHOL"],**nhanes_design_kwargs, domain=nhanes["race"], deff=True, # Design effect param currently has no effect)hi_chol_domain_means = mean_estimator.to_dataframe()ag_dict = agecat_prop.set_index("_level").to_dict()hc_dict = hi_chol_domain_means.set_index("_domain").to_dict()print(f""" Mean of HI_CHOL: {hi_chol_means["_estimate"][0]} SE of Mean HI_CHOL: {hi_chol_means["_stderror"][0]} CL of Mean HI_CHOL: {(hi_chol_means["_lci"][0], hi_chol_means["_uci"][0])} Sum of HI_CHOL: {hi_chol_totals["_estimate"][0]} SE of Sum HI_CHOL: {hi_chol_totals["_stderror"][0]} CL of Sum HI_CHOL: {(hi_chol_totals["_lci"][0], hi_chol_totals["_uci"][0])} Ratio of HI_CHOL / RIAGENDR: {hi_chol_ratio["_estimate"][0]} SE of Ratio HI_CHOL / RIAGENDR: {hi_chol_ratio["_stderror"][0]} CL of Ratio HI_CHOL / RIAGENDR: {(hi_chol_ratio["_lci"][0], hi_chol_ratio["_uci"][0])} Proportion of AGECAT: {ag_dict["_estimate"]} SE of Proportion AGECAT: {ag_dict["_stderror"]} LCL of Proportion AGECAT: {ag_dict["_lci"]} UCL of Proportion AGECAT: {ag_dict["_uci"]} Quantiles of HI_CHOL: Not available Mean of HI_CHOL by race: {hc_dict["_estimate"]} SE of HI_CHL by race: {hc_dict["_stderror"]} LCL of HI_CHOL by race: {hc_dict["_lci"]} UCL of HI_CHOL by race: {hc_dict["_uci"]} Design Effect of HI_CHOL by race: Not available """)
Mean of HI_CHOL: 0.11214295634969222
SE of Mean HI_CHOL: 0.005445839698954557
CL of Mean HI_CHOL: (np.float64(0.1005982919131703), np.float64(0.12368762078621415))
Sum of HI_CHOL: 28635245.254672
SE of Sum HI_CHOL: 2020710.7436996205
CL of Sum HI_CHOL: (np.float64(24351529.84091034), np.float64(32918960.668433655))
Ratio of HI_CHOL / RIAGENDR: 0.07422209323594066
SE of Ratio HI_CHOL / RIAGENDR: 0.0037147278931070065
CL of Ratio HI_CHOL / RIAGENDR: (np.float64(0.06634722189017901), np.float64(0.0820969645817023))
Proportion of AGECAT: {'(0,19]': 0.2077494937870972, '(19,39]': 0.29340788818591346, '(39,59]': 0.30328958320385285, '(59,Inf]': 0.19555303482313666}
SE of Proportion AGECAT: {'(0,19]': 0.006129950336419631, '(19,39]': 0.009560691634608896, '(39,59]': 0.004519462827363183, '(59,Inf]': 0.008092578243976422}
LCL of Proportion AGECAT: {'(0,19]': 0.19505410930097866, '(19,39]': 0.27355685874096586, '(39,59]': 0.2937950591158628, '(59,Inf]': 0.1789647230500222}
UCL of Proportion AGECAT: {'(0,19]': 0.2210442684297426, '(19,39]': 0.3140766293472951, '(39,59]': 0.31295496708023285, '(59,Inf]': 0.21327950895208636}
Quantiles of HI_CHOL: Not available
Mean of HI_CHOL by race: {1: 0.10149166545397208, 2: 0.12164920535593333, 3: 0.07864006039908408, 4: 0.09967860947712034}
SE of HI_CHL by race: {1: 0.006245843308749599, 2: 0.006604133623532979, 3: 0.010384645000548863, 4: 0.024666226871851268}
LCL of HI_CHOL by race: {1: 0.0882510691256497, 2: 0.10764906749064211, 3: 0.056625596431891564, 4: 0.04738854441969514}
UCL of HI_CHOL by race: {1: 0.11473226178229445, 2: 0.13564934322122454, 3: 0.1006545243662766, 4: 0.15196867453454554}
Design Effect of HI_CHOL by race: Not available
Differences
Quantiles
samplics in Python does not have a method for calculating quantiles, and in R and SAS the available methods lead to different results. To demonstrate the differences in calculating quantiles, we will use the apisrs dataset from the survey package in R (“API Data Files” 2006).
In SAS, PROC SURVEYMEANS will calculate quantiles of specific probabilities as you request them, using Woodruff’s method for intervals and a custom quantile method (SAS/STAT® 15.1 User’s Guide 2018, 9834). The quantile method does not match any of the available qrules in R, and although the default interval.types in the R survey::svyquantile function also uses Woodruff’s method, it is a different implementation.
The method and results from SAS are as follows:
proc surveymeans data=apisrs total=6194 quantile=(0.025 0.5 0.975); var growth;run;
The SURVEYMEANS Procedure
Data Summary
Number of Observations 200
Quantiles
Std
Variable Percentile Estimate Error 95% Confidence Limits
---------------------------------------------------------------------------------
growth 2.5 -16.500000 1.755916 -19.962591 -13.037409
50 Median 26.500000 1.924351 22.705263 30.294737
97.5 99.000000 16.133827 67.184794 130.815206
---------------------------------------------------------------------------------
If in R we use the default qrule="math" (equivalent to qrule="hf1" and matches type=1 in the quantile function for unweighted data) along with the default interval.type="mean", we get the following results:
Here we can see that the quantiles, confidence intervals, and standard errors do not match SAS. From testing, none of the available qrule methods match SAS for the quantile values, so it is recommended to use the default values unless you have need of some of the other properties of different quantile definitions - see vignette("qrule", package="survey") for more detail. If an exact match to SAS is required, then the svyquantile function allows for passing a custom function to the qrule argument to define your own method for calculating quantiles. Below is an example that will match SAS:
sas_qrule <-function(x, w, p) {# Custom qrule to match SAS, based on survey::oldsvyquantile's internal methodif (any(is.na(x))) return(NA* p) w <-rowsum(w, x, reorder =TRUE) x <-sort(unique(x)) cum.w <-cumsum(w)/sum(w) cdf <-approxfun(cum.w, x, method ="linear", f =1, yleft =min(x), yright =max(x), ties = min)cdf(p)}sas_quants <-svyquantile(~growth, srs_design,quantiles =c(0.025, 0.5, 0.975),qrule=sas_qrule,ci=TRUE,se=TRUE)sas_quants
Note that although the quantiles and standard errors match, the confidence intervals still do not match SAS. For this another custom calculation is required, based on the formula used in SAS:
sas_quantile_confint <-function(newsvyquantile, level=0.05, df=Inf) { q <-coef(newsvyquantile) se <-SE(newsvyquantile) ci <-cbind( q, q + se *qt(level/2, df), q - se *qt(1- level/2, df), se )colnames(ci) <-c("quantile", paste0("ci.", c(100* level /2, 100* (1- level /2))), "se") ci}sas_quantile_confint(sas_quants, df=degf(srs_design))
Some of the functions in R require the degrees of freedom to be specified when calculating confidence intervals, otherwise it assumes a normal distribution. This can be done easily by using the survey::degf function, which calculates the degrees of freedom for a survey design object.
Single PSU Strata
Although it was not apparent with the examples used here, if there is only one PSU from a stratum then R will by default error, whereas SAS will remove that stratum from the variance calculation. This can be changed in R by setting the options(survey.lonely.psu="certainty") to match SAS and have it make no contribution to the variance. In samplics, this behaviour can be configured using the single_psu argument to the estimate method, and can be set to to match SAS using SinglePSUEst.certainty. This should be considered carefully however, in R and Python there are additional methods of handling single PSUs that may be more appropriate for your use-case.
Documentation Differences
One key consideration when choosing a statistical package is the documentation available. In this case, both the survey package in R and the survey procedures in SAS have a much more comprehensive set of documentation and examples than samplics in Python. This includes both detailed examples, as well as the underlying theory and methods used in the calculations including references to the literature.
Summary and Recommendations
The {survey} package in R and the survey procedures in SAS/STAT both provide similar functionality for calculating survey summary statistics. In most cases in both our tests and others, the results are identical ((“Software for Analysis of YRBS Data” 2017), (So et al. 2020), (Damico 2009)). Where differences do occur, primarily in calculating quantiles, the methods in R are more varied and well-documented.
In contrast, the samplics package in Python is still early in development, and although it does provide some functionality there are still major limitations in both basic statistics (i.e. quantiles) and in more complex methods that were beyond the scope of this document, and the methods are much less well-documented.
Session Info
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.4.0 (2024-04-24)
os Ubuntu 22.04.5 LTS
system x86_64, linux-gnu
ui X11
language (EN)
collate C.UTF-8
ctype C.UTF-8
tz UTC
date 2024-10-25
pandoc 3.2 @ /opt/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P survey * 4.4-2 2024-03-20 [?] RSPM (R 4.4.0)
[1] /home/runner/work/CAMIS/CAMIS/renv/library/linux-ubuntu-jammy/R-4.4/x86_64-pc-linux-gnu
[2] /opt/R/4.4.0/lib/R/library
P ── Loaded and on-disk path mismatch.
─ External software ──────────────────────────────────────────────────────────
setting value
SAS 9.04.01M7P080520
──────────────────────────────────────────────────────────────────────────────
Lohr, Sharon L. 2022. Sampling: Design and Analysis. 3rd ed. CRC Press, Taylor & Francis Group.
SAS/STAT® 15.1 User’s Guide. 2018. SAS Institute Inc.
So, Hon Yiu, Urun Erbas Oz, Lauren Griffith, Susan Kirkland, Jinhua Ma, Parminder Raina, Nazmul Sohel, Mary E. Thompson, Christina Wolfson, and Changbao Wu. 2020. “Modelling Complex Survey Data Using r, SAS, SPSS and Stata: A Comparison Using CLSA Datasets.”https://arxiv.org/abs/2010.09879.