cds | stype | name | sname | snum | dname | dnum | cname | cnum | flag | pcttest | api00 | api99 | target | growth | sch.wide | comp.imp | both | awards | meals | ell | yr.rnd | mobility | acs.k3 | acs.46 | acs.core | pct.resp | not.hsg | hsg | some.col | col.grad | grad.sch | avg.ed | full | emer | enroll | api.stu | pw | fpc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15739081534155 | H | McFarland High | McFarland High | 1039 | McFarland Unified | 432 | Kern | 14 | NA | 98 | 462 | 448 | 18 | 14 | No | Yes | No | No | 44 | 31 | NA | 6 | NA | NA | 24 | 82 | 44 | 34 | 12 | 7 | 3 | 1.91 | 71 | 35 | 477 | 429 | 30.97 | 6194 |
19642126066716 | E | Stowers (Cecil | Stowers (Cecil B.) Elementary | 1124 | ABC Unified | 1 | Los Angeles | 18 | NA | 100 | 878 | 831 | NA | 47 | Yes | Yes | Yes | Yes | 8 | 25 | NA | 15 | 19 | 30 | NA | 97 | 4 | 10 | 23 | 43 | 21 | 3.66 | 90 | 10 | 478 | 420 | 30.97 | 6194 |
30664493030640 | H | Brea-Olinda Hig | Brea-Olinda High | 2868 | Brea-Olinda Unified | 79 | Orange | 29 | NA | 98 | 734 | 742 | 3 | -8 | No | No | No | No | 10 | 10 | NA | 7 | NA | NA | 28 | 95 | 5 | 9 | 21 | 41 | 24 | 3.71 | 83 | 18 | 1410 | 1287 | 30.97 | 6194 |
19644516012744 | E | Alameda Element | Alameda Elementary | 1273 | Downey Unified | 187 | Los Angeles | 18 | NA | 99 | 772 | 657 | 7 | 115 | Yes | Yes | Yes | Yes | 70 | 25 | NA | 23 | 23 | NA | NA | 100 | 37 | 40 | 14 | 8 | 1 | 1.96 | 85 | 18 | 342 | 291 | 30.97 | 6194 |
40688096043293 | E | Sunnyside Eleme | Sunnyside Elementary | 4926 | San Luis Coastal Unified | 640 | San Luis Obispo | 39 | NA | 99 | 739 | 719 | 4 | 20 | Yes | Yes | Yes | Yes | 43 | 12 | NA | 12 | 20 | 29 | NA | 91 | 8 | 21 | 27 | 34 | 10 | 3.17 | 100 | 0 | 217 | 189 | 30.97 | 6194 |
19734456014278 | E | Los Molinos Ele | Los Molinos Elementary | 2463 | Hacienda la Puente Unif | 284 | Los Angeles | 18 | NA | 93 | 835 | 822 | NA | 13 | Yes | Yes | Yes | No | 16 | 19 | NA | 13 | 19 | 29 | NA | 71 | 1 | 8 | 20 | 38 | 34 | 3.96 | 75 | 20 | 258 | 211 | 30.97 | 6194 |
Survey Summary Statistics using SAS
When conducting large-scale trials on samples of the population, it can be necessary to use a more complex sampling design than a simple random sample.
Weighting – If smaller populations are sampled more heavily to increase precision, then it is necessary to weight these observations in the analysis.
Finite population correction – Larger samples of populations result in lower variability in comparison to smaller samples.
Stratification – Dividing a population into sub-groups and sampling from each group. This protects from obtaining a very poor sample (e.g. under or over-represented groups), can give samples of a known precision, and gives more precise estimates for population means and totals.
Clustering – Dividing a population into sub-groups, and only sampling certain groups. This gives a lower precision, however can be much more convenient and cheaper - for example if surveying school children you may only sample a subset of schools to avoid travelling to a school to interview a single child.
All of these designs need to be taken into account when calculating statistics, and when producing models. Only summary statistics are discussed in this document, and variances are calculated using the default Taylor series linearisation methods. For a more detailed introduction to survey statistics in SAS, see (Lohr 2022) or (SAS/STAT® 15.1 User’s Guide 2018).
For survey summary statistics in SAS, we can use the SURVEYMEANS
and SURVEYFREQ
procedures.
Simple Survey Designs
We will use the API dataset (“API Data Files” 2006), which contains a number of datasets based on different samples from a dataset of academic performance. Initially we will just cover the methodology with a simple random sample and a finite population correction to demonstrate functionality.
Mean
If we want to calculate a mean of a variable in a dataset which has been obtained from a simple random sample such as apisrs
, in SAS we can do the following (nb. here total=6194
is obtained from the constant fpc
column, and provides the finite population correction):
proc surveymeans data=apisrs total=6194 mean;
var growth; run;
The SURVEYMEANS Procedure
Data Summary
Number of Observations 200
Statistics
Std Error
Variable N Mean of Mean 95% CL for Mean
---------------------------------------------------------------------------------
growth 200 31.900000 2.090493 27.7776382 36.0223618
---------------------------------------------------------------------------------
Total
To calculate population totals, we can request the sum
. However SAS requires the user to specify the weights, otherwise the totals will be incorrect. These weights in this case are equivalent to the total population size divided by the sample size:
data apisrs;
set apisrs nobs=n;
weight = fpc / n;
run;
proc surveymeans data=apisrs total=6194 sum;
var growth;
weight weight; run;
The SURVEYMEANS Procedure
Data Summary
Number of Observations 200
Sum of Weights 6194
Statistics
Std Error
Variable Sum of Sum
----------------------------------------
growth 197589 12949
----------------------------------------
Ratios
To perform ratio analysis for means or proportions of analysis variables in SAS, we can use the following:
proc surveymeans data=apisrs total=6194;
ratio api00 / api99; run;
The SURVEYMEANS Procedure
Data Summary
Number of Observations 200
Statistics
Std Error
Variable N Mean of Mean 95% CL for Mean
---------------------------------------------------------------------------------
api00 200 656.585000 9.249722 638.344950 674.825050
api99 200 624.685000 9.500304 605.950813 643.419187
---------------------------------------------------------------------------------
Ratio Analysis
Std
Numerator Denominator N Ratio Error 95% CL for Ratio
----------------------------------------------------------------------------------------------
api00 api99 200 1.051066 0.003604 1.04395882 1.05817265
----------------------------------------------------------------------------------------------
Proportions
To calculate a proportion in SAS, we use the PROC SURVEYFREQ
, in the simplest case below:
proc surveyfreq data=apisrs total=6194;
table 'sch.wide'n / cl; run;
The SURVEYFREQ Procedure
Data Summary
Number of Observations 200
Table of sch.wide
Std Err of 95% Confidence Limits
sch.wide Frequency Percent Percent for Percent
-------------------------------------------------------------------------
No 37 18.5000 2.7078 13.1604 23.8396
Yes 163 81.5000 2.7078 76.1604 86.8396
Total 200 100.0000
Quantiles
To calculate quantiles in SAS, we can use the quantile
option to request specific quantiles, or can use keywords to request common quantiles (e.g. quartiles or the median). This will use Woodruff’s method for confidence intervals, and a custom quantile method (SAS/STAT® 15.1 User’s Guide 2018, 9834).
proc surveymeans data=apisrs total=6194 quantile=(0.025 0.5 0.975);
var growth; run;
The SURVEYMEANS Procedure
Data Summary
Number of Observations 200
Quantiles
Std
Variable Percentile Estimate Error 95% Confidence Limits
---------------------------------------------------------------------------------
growth 2.5 -16.500000 1.755916 -19.962591 -13.037409
50 Median 26.500000 1.924351 22.705263 30.294737
97.5 99.000000 16.133827 67.184794 130.815206
---------------------------------------------------------------------------------
Summary Statistics on Complex Survey Designs
Much of the previous examples and notes still stand for more complex survey designs, here we will demonstrate using a dataset from NHANES (“National Health and Nutrition Examination Survey Data” 2010), which uses both stratification and clustering:
SDMVPSU | SDMVSTRA | WTMEC2YR | HI_CHOL | race | agecat | RIAGENDR |
---|---|---|---|---|---|---|
1 | 83 | 81528.77 | 0 | 2 | (19,39] | 1 |
1 | 84 | 14509.28 | 0 | 3 | (0,19] | 1 |
2 | 86 | 12041.64 | 0 | 3 | (0,19] | 1 |
2 | 75 | 21000.34 | 0 | 3 | (59,Inf] | 2 |
1 | 88 | 22633.58 | 0 | 1 | (19,39] | 1 |
2 | 85 | 74112.49 | 1 | 2 | (39,59] | 2 |
To produce means and standard quartiles for this sample, taking account of sample design, we can use the following:
proc surveymeans data=nhanes mean quartiles;
cluster SDMVPSU;
strata SDMVSTRA;
weight WTMEC2YR;
var HI_CHOL; run;
The SURVEYMEANS Procedure
Data Summary
Number of Strata 15
Number of Clusters 31
Number of Observations 8591
Sum of Weights 276536446
Statistics
Std Error
Variable Mean of Mean
----------------------------------------
HI_CHOL 0.112143 0.005446
----------------------------------------
Quantiles
Std
Variable Percentile Estimate Error 95% Confidence Limits
---------------------------------------------------------------------------------
HI_CHOL 25 Q1 0 0.024281 -0.0514730 0.05147298
50 Median 0 0.024281 -0.0514730 0.05147298
75 Q3 0 0.024281 -0.0514730 0.05147298
---------------------------------------------------------------------------------
To produce an analysis of separate subpopulations in SAS we can use the DOMAIN
statement (note: do not use the BY
statement as it will not give statistically valid analysis), here we also request the design effect:
proc surveymeans data=nhanes mean deff;
cluster SDMVPSU;
strata SDMVSTRA;
weight WTMEC2YR;
var HI_CHOL;
domain race; run;
The SURVEYMEANS Procedure
Data Summary
Number of Strata 15
Number of Clusters 31
Number of Observations 8591
Sum of Weights 276536446
Statistics
Std Error Design
Variable Mean of Mean Effect
--------------------------------------------------------
HI_CHOL 0.112143 0.005446 2.336725
--------------------------------------------------------
Statistics for race Domains
Std Error Design
race Variable Mean of Mean Effect
------------------------------------------------------------------------
1 HI_CHOL 0.101492 0.006246 1.082734
2 HI_CHOL 0.121649 0.006604 1.407822
3 HI_CHOL 0.078640 0.010385 2.091156
4 HI_CHOL 0.099679 0.024666 3.098290
------------------------------------------------------------------------
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.4.0 (2024-04-24)
os Ubuntu 22.04.5 LTS
system x86_64, linux-gnu
ui X11
language (EN)
collate C.UTF-8
ctype C.UTF-8
tz UTC
date 2024-10-10
pandoc 3.2 @ /opt/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P survey * 4.4-2 2024-03-20 [?] RSPM (R 4.4.0)
[1] /home/runner/work/CAMIS/CAMIS/renv/library/linux-ubuntu-jammy/R-4.4/x86_64-pc-linux-gnu
[2] /opt/R/4.4.0/lib/R/library
P ── Loaded and on-disk path mismatch.
─ External software ──────────────────────────────────────────────────────────
setting value
SAS 9.04.01M7P080520
──────────────────────────────────────────────────────────────────────────────