Survey Summary Statistics using SAS

When conducting large-scale trials on samples of the population, it can be necessary to use a more complex sampling design than a simple random sample.

Weighting – If smaller populations are sampled more heavily to increase precision, then it is necessary to weight these observations in the analysis.
Finite population correction – Larger samples of populations result in lower variability in comparison to smaller samples.
Stratification – Dividing a population into sub-groups and sampling from each group. This protects from obtaining a very poor sample (e.g. under or over-represented groups), can give samples of a known precision, and gives more precise estimates for population means and totals.
Clustering – Dividing a population into sub-groups, and only sampling certain groups. This gives a lower precision, however can be much more convenient and cheaper - for example if surveying school children you may only sample a subset of schools to avoid travelling to a school to interview a single child.

All of these designs need to be taken into account when calculating statistics, and when producing models. Only summary statistics are discussed in this document, and variances are calculated using the default Taylor series linearisation methods. For a more detailed introduction to survey statistics in SAS, see (Lohr 2022) or (SAS/STAT® 15.1 User’s Guide 2018).

For survey summary statistics in SAS, we can use the SURVEYMEANS and SURVEYFREQ procedures.

Simple Survey Designs

We will use the API dataset (“API Data Files” 2006), which contains a number of datasets based on different samples from a dataset of academic performance. Initially we will just cover the methodology with a simple random sample and a finite population correction to demonstrate functionality.

cds	stype	name	sname	snum	dname	dnum	cname	cnum	flag	pcttest	api00	api99	target	growth	sch.wide	comp.imp	both	awards	meals	ell	yr.rnd	mobility	acs.k3	acs.46	acs.core	pct.resp	not.hsg	hsg	some.col	col.grad	grad.sch	avg.ed	full	emer	enroll	api.stu	pw	fpc
15739081534155	H	McFarland High	McFarland High	1039	McFarland Unified	432	Kern	14	NA	98	462	448	18	14	No	Yes	No	No	44	31	NA	6	NA	NA	24	82	44	34	12	7	3	1.91	71	35	477	429	30.97	6194
19642126066716	E	Stowers (Cecil	Stowers (Cecil B.) Elementary	1124	ABC Unified	1	Los Angeles	18	NA	100	878	831	NA	47	Yes	Yes	Yes	Yes	8	25	NA	15	19	30	NA	97	4	10	23	43	21	3.66	90	10	478	420	30.97	6194
30664493030640	H	Brea-Olinda Hig	Brea-Olinda High	2868	Brea-Olinda Unified	79	Orange	29	NA	98	734	742	3	-8	No	No	No	No	10	10	NA	7	NA	NA	28	95	5	9	21	41	24	3.71	83	18	1410	1287	30.97	6194
19644516012744	E	Alameda Element	Alameda Elementary	1273	Downey Unified	187	Los Angeles	18	NA	99	772	657	7	115	Yes	Yes	Yes	Yes	70	25	NA	23	23	NA	NA	100	37	40	14	8	1	1.96	85	18	342	291	30.97	6194
40688096043293	E	Sunnyside Eleme	Sunnyside Elementary	4926	San Luis Coastal Unified	640	San Luis Obispo	39	NA	99	739	719	4	20	Yes	Yes	Yes	Yes	43	12	NA	12	20	29	NA	91	8	21	27	34	10	3.17	100	0	217	189	30.97	6194
19734456014278	E	Los Molinos Ele	Los Molinos Elementary	2463	Hacienda la Puente Unif	284	Los Angeles	18	NA	93	835	822	NA	13	Yes	Yes	Yes	No	16	19	NA	13	19	29	NA	71	1	8	20	38	34	3.96	75	20	258	211	30.97	6194

Mean

If we want to calculate a mean of a variable in a dataset which has been obtained from a simple random sample such as apisrs, in SAS we can do the following (nb. here total=6194 is obtained from the constant fpc column, and provides the finite population correction):

proc surveymeans data=apisrs total=6194 mean;
    var growth;
run;

                             The SURVEYMEANS Procedure

                                    Data Summary

                        Number of Observations           200


                                    Statistics

                                                Std Error
 Variable               N            Mean         of Mean       95% CL for Mean
 ---------------------------------------------------------------------------------
 growth               200       31.900000        2.090493    27.7776382 36.0223618
 ---------------------------------------------------------------------------------

Total

To calculate population totals, we can request the sum. However SAS requires the user to specify the weights, otherwise the totals will be incorrect. These weights in this case are equivalent to the total population size divided by the sample size:

data apisrs;
    set apisrs nobs=n;
    weight = fpc / n;
run;

proc surveymeans data=apisrs total=6194 sum;
    var growth;
    weight weight;
run;

       The SURVEYMEANS Procedure

              Data Summary

  Number of Observations           200
  Sum of Weights                  6194


               Statistics

                               Std Error
Variable             Sum          of Sum
----------------------------------------
growth            197589           12949
----------------------------------------

Ratios

To perform ratio analysis for means or proportions of analysis variables in SAS, we can use the following:

proc surveymeans data=apisrs total=6194;
    ratio api00 / api99;
run;

                             The SURVEYMEANS Procedure

                                    Data Summary

                        Number of Observations           200


                                    Statistics

                                                Std Error
 Variable               N            Mean         of Mean       95% CL for Mean
 ---------------------------------------------------------------------------------
 api00                200      656.585000        9.249722    638.344950 674.825050
 api99                200      624.685000        9.500304    605.950813 643.419187
 ---------------------------------------------------------------------------------


                                   Ratio Analysis

                                                          Std
Numerator Denominator            N           Ratio           Error        95% CL for Ratio
----------------------------------------------------------------------------------------------
api00     api99                200        1.051066        0.003604    1.04395882    1.05817265
----------------------------------------------------------------------------------------------

Proportions

To calculate a proportion in SAS, we use the PROC SURVEYFREQ, in the simplest case below:

proc surveyfreq data=apisrs total=6194;
table 'sch.wide'n / cl;
run;

                          The SURVEYFREQ Procedure

                                Data Summary

                    Number of Observations           200


                             Table of sch.wide

                                       Std Err of    95% Confidence Limits
 sch.wide     Frequency     Percent       Percent         for Percent
 -------------------------------------------------------------------------
 No                  37     18.5000        2.7078     13.1604      23.8396
 Yes                163     81.5000        2.7078     76.1604      86.8396

 Total              200    100.0000

Quantiles

To calculate quantiles in SAS, we can use the quantile option to request specific quantiles, or can use keywords to request common quantiles (e.g. quartiles or the median). This will use Woodruff’s method for confidence intervals, and a custom quantile method (SAS/STAT® 15.1 User’s Guide 2018, 9834).

proc surveymeans data=apisrs total=6194 quantile=(0.025 0.5 0.975);
    var growth;
run;

                             The SURVEYMEANS Procedure

                                    Data Summary

                        Number of Observations           200




                                     Quantiles

                                                      Std
 Variable       Percentile       Estimate           Error    95% Confidence Limits
 ---------------------------------------------------------------------------------
 growth           2.5          -16.500000        1.755916    -19.962591 -13.037409
                   50 Median    26.500000        1.924351     22.705263  30.294737
                 97.5           99.000000       16.133827     67.184794 130.815206
 ---------------------------------------------------------------------------------

Summary Statistics on Complex Survey Designs

Much of the previous examples and notes still stand for more complex survey designs, here we will demonstrate using a dataset from NHANES (“National Health and Nutrition Examination Survey Data” 2010), which uses both stratification and clustering:

SDMVPSU	SDMVSTRA	WTMEC2YR	HI_CHOL	race	agecat	RIAGENDR
1	83	81528.77	0	2	(19,39]	1
1	84	14509.28	0	3	(0,19]	1
2	86	12041.64	0	3	(0,19]	1
2	75	21000.34	0	3	(59,Inf]	2
1	88	22633.58	0	1	(19,39]	1
2	85	74112.49	1	2	(39,59]	2

To produce means and standard quartiles for this sample, taking account of sample design, we can use the following:

proc surveymeans data=nhanes mean quartiles;
    cluster SDMVPSU;
    strata SDMVSTRA;
    weight WTMEC2YR;
    var HI_CHOL;
run;

                             The SURVEYMEANS Procedure

                                    Data Summary

                        Number of Strata                  15
                        Number of Clusters                31
                        Number of Observations          8591
                        Sum of Weights             276536446


                                     Statistics

                                                     Std Error
                      Variable            Mean         of Mean
                      ----------------------------------------
                      HI_CHOL         0.112143        0.005446
                      ----------------------------------------


                                     Quantiles

                                                      Std
 Variable       Percentile       Estimate           Error    95% Confidence Limits
 ---------------------------------------------------------------------------------
 HI_CHOL           25 Q1                0        0.024281    -0.0514730 0.05147298
                   50 Median            0        0.024281    -0.0514730 0.05147298
                   75 Q3                0        0.024281    -0.0514730 0.05147298
 ---------------------------------------------------------------------------------

To produce an analysis of separate subpopulations in SAS we can use the DOMAIN statement (note: do not use the BY statement as it will not give statistically valid analysis), here we also request the design effect:

proc surveymeans data=nhanes mean deff;
    cluster SDMVPSU;
    strata SDMVSTRA;
    weight WTMEC2YR;
    var HI_CHOL;
    domain race;
run;


               The SURVEYMEANS Procedure

                      Data Summary

          Number of Strata                  15
          Number of Clusters                31
          Number of Observations          8591
          Sum of Weights             276536446


                       Statistics

                               Std Error          Design
Variable            Mean         of Mean          Effect
--------------------------------------------------------
HI_CHOL         0.112143        0.005446        2.336725
--------------------------------------------------------

               Statistics for race Domains

                                       Std Error          Design
race    Variable            Mean         of Mean          Effect
------------------------------------------------------------------------
   1    HI_CHOL         0.101492        0.006246        1.082734
   2    HI_CHOL         0.121649        0.006604        1.407822
   3    HI_CHOL         0.078640        0.010385        2.091156
   4    HI_CHOL         0.099679        0.024666        3.098290
------------------------------------------------------------------------

Session Info

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.3 (2025-02-28)
 os       Ubuntu 24.04.2 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  C.UTF-8
 ctype    C.UTF-8
 tz       Europe/London
 date     2025-03-13
 pandoc   NA (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 ! package * version date (UTC) lib source
 P survey  * 4.4-2   2024-03-20 [?] RSPM (R 4.4.0)

 [1] /home/michael/source/CAMIS/renv/library/linux-ubuntu-noble/R-4.4/x86_64-pc-linux-gnu
 [2] /opt/R/4.4.3/lib/R/library

 P ── Loaded and on-disk path mismatch.

─ External software ──────────────────────────────────────────────────────────
 setting value
 SAS     9.04.01M7P080520

──────────────────────────────────────────────────────────────────────────────

References

“API Data Files.” 2006. California Department of Education. https://web.archive.org/web/20060813165101/http://api.cde.ca.gov/datafiles.asp.

Lohr, Sharon L. 2022. Sampling: Design and Analysis. 3rd ed. CRC Press, Taylor & Francis Group.

“National Health and Nutrition Examination Survey Data.” 2010. Centers for Disease Control; Prevention (CDC). https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=laboratory&CycleBeginYear=2009.

SAS/STAT® 15.1 User’s Guide. 2018. SAS Institute Inc.