R vs SAS Survey Summary Statistics

This document will compare the survey summary statistics functionality in SAS (available through SAS/STAT) and R (available from the {survey} package), highlighting differences in methods and results. Only the default Taylor series linearisation method for calculating variances is used in both languages. A more detailed comparison for specific methods and use-cases is available in (“Software for Analysis of YRBS Data” 2017), (So et al. 2020), or (Damico 2009). For a general guide to survey statistics, which has companion guides for both R and SAS, see (Lohr 2022)

Result Comparison

The following table shows different survey summary statistics, the capabilities of each language, and whether or not the results match. Each analysis also includes calculating the standard error and confidence intervals.

Analysis	Supported in R	Supported in SAS	Results Match	Notes
Mean	Yes	Yes	Yes	Must specify degrees of freedom in R for confidence limits
Total	Yes	Yes	Yes	Must specify degrees of freedom in R for confidence limits
Ratios	Yes	Yes	Yes	Must specify degrees of freedom in R for confidence limits
Proportions	Yes	Yes	Yes	For confidence limits of proportions near 0 and 1, `survey::svyciprop` can be more accurate in R.
Quantiles	Yes	Yes	No	Different methods for calculating quantiles
Domain Analysis	Yes	Yes	Yes
Design Effect	Yes	Yes	Yes	Set `deff="replace"` in R to match SAS exactly

For the full R and SAS code and results used for this comparison, see below:

Show Code

library(survey)

data("nhanes")

nhanes_design <- svydesign(
  data = nhanes,
  id = ~SDMVPSU, # Specify the PSU/cluster column
  strata = ~SDMVSTRA,  # The stratification column
  weights = ~WTMEC2YR,  # The weighting column
  nest = TRUE  # Allows for PSUs with the same name nested within different strata
)

# Mean of HI_CHOL
hi_chol_mean <- svymean(~HI_CHOL, nhanes_design, na.rm=TRUE)

# Sum of HI_CHOL
hi_chol_sum <- svytotal(~HI_CHOL, nhanes_design, na.rm=TRUE)

# Ratio of HI_CHOL / RIAGENDR
hi_chol_ratio <- svyratio(
  numerator = ~HI_CHOL,
  denominator = ~RIAGENDR,
  nhanes_design,
  na.rm=TRUE,
  ci=TRUE,
  se=TRUE,
  separate=FALSE
)

# Proportion of different AGECAT values
agecat_props <- svymean(~agecat, nhanes_design, na.rm=TRUE)

# Quantiles of HI_CHOL
hi_chol_quart <- svyquantile(
  ~HI_CHOL,
  nhanes_design,
  quantiles=c(0.025, 0.5, 0.975),
  na.rm=TRUE,
  ci=TRUE
)

# Domain analysis of mean of HI_CHOL by race, with design effect
hi_chol_mean_by_race <- svyby(~HI_CHOL, ~race, nhanes_design, svymean, na.rm=TRUE, deff="replace")

print(list(
  "Mean of HI_CHOL" = coef(hi_chol_mean),
  "SE of Mean HI_CHOL" = SE(hi_chol_mean),
  "CL of Mean HI_CHOL" = confint(hi_chol_mean, df=degf(nhanes_design)),
  "Sum of HI_CHOL" = coef(hi_chol_sum),
  "SE of Sum HI_CHOL" = SE(hi_chol_sum),
  "CL of Sum HI_CHOL" = confint(hi_chol_sum, df=degf(nhanes_design)),
  "Ratio of HI_CHOL / RIAGENDR" = coef(hi_chol_ratio),
  "SE of Ratio HI_CHOL / RIAGENDR" = SE(hi_chol_ratio),
  "CL of Ratio HI_CHOL / RIAGENDR" = confint(hi_chol_ratio, df=degf(nhanes_design)),
  "Proportion of AGECAT" = coef(agecat_props),
  "SE of Proportion AGECAT" = SE(agecat_props),
  "CL of Proportion AGECAT" = confint(agecat_props, df=degf(nhanes_design)),
  "Quantiles of HI_CHOL" = coef(hi_chol_quart),
  "SE of Quantiles HI_CHOL" = SE(hi_chol_quart),
  "CL of Quantiles HI_CHOL" = confint(hi_chol_quart, df=degf(nhanes_design)),
  "Mean of HI_CHOL by race" = coef(hi_chol_mean_by_race),
  "SE of HI_CHOL by race" = SE(hi_chol_mean_by_race),
  "CL of HI_CHOL by race" = confint(hi_chol_mean_by_race, df=degf(nhanes_design)),
  "Design Effect of HI_CHOL by race" = hi_chol_mean_by_race$DEff.HI_CHOL
))

$`Mean of HI_CHOL`
 HI_CHOL 
0.112143 

$`SE of Mean HI_CHOL`
           HI_CHOL
HI_CHOL 0.00544584

$`CL of Mean HI_CHOL`
            2.5 %    97.5 %
HI_CHOL 0.1005983 0.1236876

$`Sum of HI_CHOL`
 HI_CHOL 
28635245 

$`SE of Sum HI_CHOL`
        HI_CHOL
HI_CHOL 2020711

$`CL of Sum HI_CHOL`
           2.5 %   97.5 %
HI_CHOL 24351530 32918961

$`Ratio of HI_CHOL / RIAGENDR`
HI_CHOL/RIAGENDR 
      0.07422209 

$`SE of Ratio HI_CHOL / RIAGENDR`
HI_CHOL/RIAGENDR 
     0.003714728 

$`CL of Ratio HI_CHOL / RIAGENDR`
                      2.5 %     97.5 %
HI_CHOL/RIAGENDR 0.06634722 0.08209696

$`Proportion of AGECAT`
  agecat(0,19]  agecat(19,39]  agecat(39,59] agecat(59,Inf] 
     0.2077495      0.2934079      0.3032896      0.1955530 

$`SE of Proportion AGECAT`
  agecat(0,19]  agecat(19,39]  agecat(39,59] agecat(59,Inf] 
   0.006129950    0.009560692    0.004519463    0.008092578 

$`CL of Proportion AGECAT`
                   2.5 %    97.5 %
agecat(0,19]   0.1947546 0.2207444
agecat(19,39]  0.2731401 0.3136756
agecat(39,59]  0.2937088 0.3128704
agecat(59,Inf] 0.1783975 0.2127085

$`Quantiles of HI_CHOL`
HI_CHOL.0.025   HI_CHOL.0.5 HI_CHOL.0.975 
            0             0             1 

$`SE of Quantiles HI_CHOL`
HI_CHOL.0.025   HI_CHOL.0.5 HI_CHOL.0.975 
    0.2358596     0.2358596     0.0000000 

$`CL of Quantiles HI_CHOL`
              l u
HI_CHOL.0.025 0 1
HI_CHOL.0.5   0 1
HI_CHOL.0.975 1 1

$`Mean of HI_CHOL by race`
         1          2          3          4 
0.10149167 0.12164921 0.07864006 0.09967861 

$`SE of HI_CHOL by race`
[1] 0.006245843 0.006604134 0.010384645 0.024666227

$`CL of HI_CHOL by race`
       2.5 %    97.5 %
1 0.08825107 0.1147323
2 0.10764907 0.1356493
3 0.05662560 0.1006545
4 0.04738854 0.1519687

$`Design Effect of HI_CHOL by race`
[1] 1.082734 1.407822 2.091156 3.098290

* Mean, sum quantile of HI_CHOL;
proc surveymeans data=nhanes mean sum clm quantile=(0.025 0.5 0.975);
    cluster SDMVPSU;
    strata SDMVSTRA;
    weight WTMEC2YR;
    var HI_CHOL;
run;

* Ratio of HI_CHOL / RIAGENDR;
proc surveymeans data=nhanes;
    cluster SDMVPSU;
    strata SDMVSTRA;
    weight WTMEC2YR;
    ratio HI_CHOL / RIAGENDR;
run;

* Proportions of agecat;
proc surveyfreq data=nhanes;
    cluster SDMVPSU;
    strata SDMVSTRA;
    weight WTMEC2YR;
    table agecat / cl;
run;

* Mean and DEFF of HI_CHOL by race;
proc surveymeans data=nhanes mean deff;
    cluster SDMVPSU;
    strata SDMVSTRA;
    weight WTMEC2YR;
    domain race;
    var HI_CHOL;
run;


                                                 The SURVEYMEANS Procedure

                                                        Data Summary

                                            Number of Strata                  15
                                            Number of Clusters                31
                                            Number of Observations          8591
                                            Sum of Weights             276536446


                                                         Statistics

                                Std Error                                                Std Error
 Variable            Mean         of Mean       95% CL for Mean                Sum          of Sum        95% CL for Sum
 --------------------------------------------------------------------------------------------------------------------------
 HI_CHOL         0.112143        0.005446    0.10059829 0.12368762        28635245         2020711    24351529.8 32918960.7
 --------------------------------------------------------------------------------------------------------------------------


                                                         Quantiles

                                                                          Std
                     Variable       Percentile       Estimate           Error    95% Confidence Limits
                     ---------------------------------------------------------------------------------
                     HI_CHOL          2.5                   0        0.024281    -0.0514730 0.05147298
                                       50 Median            0        0.024281    -0.0514730 0.05147298
                                     97.5            0.777070        0.024281     0.7255973 0.82854324
                     ---------------------------------------------------------------------------------

                                                 The SURVEYMEANS Procedure

                                                        Data Summary

                                            Number of Strata                  15
                                            Number of Clusters                31
                                            Number of Observations          8591
                                            Sum of Weights             276536446


                                                        Statistics

                                                                    Std Error
                     Variable               N            Mean         of Mean       95% CL for Mean
                     ---------------------------------------------------------------------------------
                     HI_CHOL             7846        0.112143        0.005446    0.10059829 0.12368762
                     RIAGENDR            8591        1.512019        0.005302    1.50077977 1.52325807
                     ---------------------------------------------------------------------------------


                                                       Ratio Analysis

                                                                              Std
               Numerator Denominator            N           Ratio           Error        95% CL for Ratio
               ----------------------------------------------------------------------------------------------
               HI_CHOL   RIAGENDR            7846        0.074222        0.003715    0.06634722    0.08209696
               ----------------------------------------------------------------------------------------------

                                                  The SURVEYFREQ Procedure

                                                        Data Summary

                                            Number of Strata                  15
                                            Number of Clusters                31
                                            Number of Observations          8591
                                            Sum of Weights             276536446


                                                      Table of agecat

                                        Weighted    Std Err of                Std Err of    95% Confidence Limits
          agecat         Frequency     Frequency      Wgt Freq     Percent       Percent         for Percent
          -------------------------------------------------------------------------------------------------------
          (0,19]              2532      57450307       3043819     20.7749        0.6130     19.4755      22.0744
          (19,39]             2033      81137975       3692818     29.3408        0.9561     27.3140      31.3676
          (39,59]             2021      83870623       4853936     30.3290        0.4519     29.3709      31.2870
          (59,Inf]            2005      54077541       4284296     19.5553        0.8093     17.8398      21.2709

          Total               8591     276536446      13935730    100.0000                                       
          -------------------------------------------------------------------------------------------------------

                                                 The SURVEYMEANS Procedure

                                                        Data Summary

                                            Number of Strata                  15
                                            Number of Clusters                31
                                            Number of Observations          8591
                                            Sum of Weights             276536446


                                                         Statistics

                                                                 Std Error          Design
                                  Variable            Mean         of Mean          Effect
                                  --------------------------------------------------------
                                  HI_CHOL         0.112143        0.005446        2.336725
                                  --------------------------------------------------------

                                                 The SURVEYMEANS Procedure

                                                 Statistics for race Domains

                                                                         Std Error          Design
                                  race    Variable            Mean         of Mean          Effect
                          ------------------------------------------------------------------------
                                     1    HI_CHOL         0.101492        0.006246        1.082734
                                     2    HI_CHOL         0.121649        0.006604        1.407822
                                     3    HI_CHOL         0.078640        0.010385        2.091156
                                     4    HI_CHOL         0.099679        0.024666        3.098290
                          ------------------------------------------------------------------------

Differences

Quantiles

To demonstrate the differences in calculating quantiles, we will use the apisrs dataset from the survey package in R (“API Data Files” 2006).

library(survey)

data("api")

head(apisrs) |> gt::gt()

cds	stype	name	sname	snum	dname	dnum	cname	cnum	flag	pcttest	api00	api99	target	growth	sch.wide	comp.imp	both	awards	meals	ell	yr.rnd	mobility	acs.k3	acs.46	acs.core	pct.resp	not.hsg	hsg	some.col	col.grad	grad.sch	avg.ed	full	emer	enroll	api.stu	pw	fpc
15739081534155	H	McFarland High	McFarland High	1039	McFarland Unified	432	Kern	14	NA	98	462	448	18	14	No	Yes	No	No	44	31	NA	6	NA	NA	24	82	44	34	12	7	3	1.91	71	35	477	429	30.97	6194
19642126066716	E	Stowers (Cecil	Stowers (Cecil B.) Elementary	1124	ABC Unified	1	Los Angeles	18	NA	100	878	831	NA	47	Yes	Yes	Yes	Yes	8	25	NA	15	19	30	NA	97	4	10	23	43	21	3.66	90	10	478	420	30.97	6194
30664493030640	H	Brea-Olinda Hig	Brea-Olinda High	2868	Brea-Olinda Unified	79	Orange	29	NA	98	734	742	3	-8	No	No	No	No	10	10	NA	7	NA	NA	28	95	5	9	21	41	24	3.71	83	18	1410	1287	30.97	6194
19644516012744	E	Alameda Element	Alameda Elementary	1273	Downey Unified	187	Los Angeles	18	NA	99	772	657	7	115	Yes	Yes	Yes	Yes	70	25	NA	23	23	NA	NA	100	37	40	14	8	1	1.96	85	18	342	291	30.97	6194
40688096043293	E	Sunnyside Eleme	Sunnyside Elementary	4926	San Luis Coastal Unified	640	San Luis Obispo	39	NA	99	739	719	4	20	Yes	Yes	Yes	Yes	43	12	NA	12	20	29	NA	91	8	21	27	34	10	3.17	100	0	217	189	30.97	6194
19734456014278	E	Los Molinos Ele	Los Molinos Elementary	2463	Hacienda la Puente Unif	284	Los Angeles	18	NA	93	835	822	NA	13	Yes	Yes	Yes	No	16	19	NA	13	19	29	NA	71	1	8	20	38	34	3.96	75	20	258	211	30.97	6194

In SAS, PROC SURVEYMEANS will calculate quantiles of specific probabilities as you request them, using Woodruff’s method for intervals and a custom quantile method (SAS/STAT® 15.1 User’s Guide 2018, 9834). The quantile method does not match any of the available qrules in R, and although the default interval.types in the R survey::svyquantile function also uses Woddruff’s method, it is a different implementation.

The method and results from SAS are as follows:

proc surveymeans data=apisrs total=6194 quantile=(0.025 0.5 0.975);
    var growth;
run;

                             The SURVEYMEANS Procedure

                                    Data Summary

                        Number of Observations           200




                                     Quantiles

                                                      Std
 Variable       Percentile       Estimate           Error    95% Confidence Limits
 ---------------------------------------------------------------------------------
 growth           2.5          -16.500000        1.755916    -19.962591 -13.037409
                   50 Median    26.500000        1.924351     22.705263  30.294737
                 97.5           99.000000       16.133827     67.184794 130.815206
 ---------------------------------------------------------------------------------

If in R we use the default qrule="math" (equivalent to qrule="hf1" and matches type=1 in the quantile function for unweighted data) along with the default interval.type="mean", we get the following results:

srs_design <- svydesign(data = apisrs,id = ~1,fpc = ~fpc,)

svyquantile(
  ~growth,
  srs_design,
  quantiles = c(0.025, 0.5, 0.975),
  ci=TRUE,
  se=TRUE
)

$growth
      quantile ci.2.5 ci.97.5        se
0.025      -16    -21     -12  2.281998
0.5         27     24      31  1.774887
0.975       99     84     189 26.623305

attr(,"hasci")
[1] TRUE
attr(,"class")
[1] "newsvyquantile"

Here we can see that the quantiles, confidence intervals, and standard errors do not match SAS. From testing, none of the available qrule methods match SAS for the quantile values, so it is recommended to use the default values unless you have need of some of the other properties of different quantile definitions. If an exact match to SAS is required, then the svyquantile function allows for passing a custom function to the qrule argument to define your own method for calculating quantiles. Below is an example that will match SAS:

sas_qrule <- function(x, w, p) {
  # Custom qrule to match SAS, based on survey::oldsvyquantile's internal method
  if (any(is.na(x))) 
    return(NA * p)
  w <- rowsum(w, x, reorder = TRUE)
  x <- sort(unique(x))
  cum.w <- cumsum(w)/sum(w)
  cdf <- approxfun(cum.w, x, method = "linear", f = 1, 
    yleft = min(x), yright = max(x), ties = min)
  cdf(p)
}


sas_quants <- svyquantile(
  ~growth,
  srs_design,
  quantiles = c(0.025, 0.5, 0.975),
  qrule=sas_qrule,
  ci=TRUE,
  se=TRUE
)

sas_quants

$growth
      quantile    ci.2.5   ci.97.5        se
0.025    -16.5 -22.00000 -15.07482  1.755916
0.5       26.5  23.03563  30.62510  1.924351
0.975     99.0  83.70616 147.33657 16.133827

attr(,"hasci")
[1] TRUE
attr(,"class")
[1] "newsvyquantile"

Note that although the quantiles and standard errors match, the confidence intervals still do not match SAS. For this another custom calculation is required, based on the formula used in SAS:

sas_quantile_confint <- function(newsvyquantile, level=0.05, df=Inf) {
  q <- coef(newsvyquantile)
  se <- SE(newsvyquantile)
  ci <- cbind(
    q,
    q + se * qt(level/2, df),
    q - se * qt(1 - level/2, df),
    se
  )
  colnames(ci) <- c("quantile", paste0("ci.", c(100 * level / 2, 100 * (1 - level / 2))), "se")

  ci
}

sas_quantile_confint(sas_quants, df=degf(srs_design))

             quantile    ci.2.5   ci.97.5        se
growth.0.025    -16.5 -19.96259 -19.96259  1.755916
growth.0.5       26.5  22.70526  22.70526  1.924351
growth.0.975     99.0  67.18479  67.18479 16.133827

Other considerations

Degrees of Freedom

Some of the functions in R require the degrees of freedom to be specified when calculating confidence intervals, otherwise it assumes a normal distribution. This can be done easily by using the survey::degf function, which calculates the degrees of freedom for a survey design object.

Single PSU Strata

Although it was not apparent with the examples used here, if there is only one PSU from a stratum then R will by default error, whereas SAS will remove that stratum from the variance calculation. This can be changed in R by setting the options(survey.lonely.psu="certainty") to match SAS and have it make no contribution to the variance.

Summary and Recommendations

The {survey} package in R and the survey procedures in SAS/STAT both provide similar functionality for calculating survey summary statistics. In most cases in both our tests and others, the results are identical ((“Software for Analysis of YRBS Data” 2017), (So et al. 2020), (Damico 2009)). Where differences do occur, primarily in calculating quantiles, the methods in R are more varied and well-documented.

Session Info

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.3 (2024-02-29)
 os       Ubuntu 22.04.4 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  C.UTF-8
 ctype    C.UTF-8
 tz       UTC
 date     2024-05-01
 pandoc   3.1.11 @ /opt/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 ! package * version date (UTC) lib source
 P survey  * 4.2-1   2023-05-03 [?] RSPM (R 4.3.0)

 [1] /home/runner/work/CAMIS/CAMIS/renv/library/R-4.3/x86_64-pc-linux-gnu
 [2] /opt/R/4.3.3/lib/R/library

 P ── Loaded and on-disk path mismatch.

─ External software ──────────────────────────────────────────────────────────
 setting value
 SAS     9.04.01M7P080520

──────────────────────────────────────────────────────────────────────────────

References

“API Data Files.” 2006. California Department of Education. https://web.archive.org/web/20060813165101/http://api.cde.ca.gov/datafiles.asp.

Damico, Anthony. 2009. “Transitioning to r: Replicating SAS, Stata, and SUDAAN Analysis Techniques in Health Policy Data.” The R Journal, December. https://journal.r-project.org/archive/2009-2/RJournal_2009-2_Damico.pdf.

Lohr, Sharon L. 2022. Sampling: Design and Analysis. 3rd ed. CRC Press, Taylor & Francis Group.

SAS/STAT® 15.1 User’s Guide. 2018. SAS Institute Inc.

So, Hon Yiu, Urun Erbas Oz, Lauren Griffith, Susan Kirkland, Jinhua Ma, Parminder Raina, Nazmul Sohel, Mary E. Thompson, Christina Wolfson, and Changbao Wu. 2020. “Modelling Complex Survey Data Using r, SAS, SPSS and Stata: A Comparison Using CLSA Datasets.” https://arxiv.org/abs/2010.09879.

“Software for Analysis of YRBS Data.” 2017. Centers for Disease Control; Prevention (CDC). https://www.cdc.gov/healthyyouth/data/yrbs/pdf/2017/2017_YRBS_analysis_software.pdf.