<- lm(y ~ drug + disease + drug*disease, df_disease) lm_model
ANOVA
Introduction
ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more groups to determine if at least one group mean is significantly different from the others. It helps to test hypotheses about group differences based on sample data.
The key assumptions include:
- Independence of observations
- Normality of the data (the distribution should be approximately normal)
- Homogeneity of variances (similar variances across groups)
Common types include one-way ANOVA (one independent variable) and two-way ANOVA (two independent variables).
One-way ANOVA tests the effect of a single independent variable on a dependent variable (the grouping factor).
Two-way ANOVA tests the effect of two independent variables on a dependent variable and also examines if there is an interaction between the two independent variables.
Getting Started
To demonstrate the various types of sums of squares, we’ll create a data frame called df_disease
taken from the SAS documentation. The corresponding data can be found here.
The Model
For this example, we’re testing for a significant difference in stem_length
using ANOVA. In R, we’re using lm()
to run the ANOVA, and then using broom::glance()
and broom::tidy()
to view the results in a table format.
The glance
function gives us a summary of the model diagnostic values.
%>%
lm_model glance()
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.456 0.326 10.5 3.51 0.00130 11 -212. 450. 477.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The tidy
function gives a summary of the model results.
%>% tidy() lm_model
# A tibble: 12 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 29.3 4.29 6.84 0.0000000160
2 drug2 -1.33 6.36 -0.210 0.835
3 drug3 -13 7.43 -1.75 0.0869
4 drug4 -15.7 6.36 -2.47 0.0172
5 disease2 -1.08 6.78 -0.160 0.874
6 disease3 -8.93 6.36 -1.40 0.167
7 drug2:disease2 6.58 9.78 0.673 0.504
8 drug3:disease2 -10.9 10.2 -1.06 0.295
9 drug4:disease2 0.317 9.30 0.0340 0.973
10 drug2:disease3 -0.900 9.00 -0.100 0.921
11 drug3:disease3 1.10 10.2 0.107 0.915
12 drug4:disease3 9.53 9.20 1.04 0.306
The Results
You’ll see that R print the individual results for each level of the drug and disease interaction. We can get the combined F table in R using the anova()
function on the model object.
%>%
lm_model anova() %>%
tidy()
# A tibble: 4 × 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 drug 3 3133. 1044. 9.46 0.0000558
2 disease 2 419. 209. 1.90 0.162
3 drug:disease 6 707. 118. 1.07 0.396
4 Residuals 46 5081. 110. NA NA
We can add a Total
row, by using add_row
and calculating the sum of the degrees of freedom and sum of squares.
%>%
lm_model anova() %>%
tidy() %>%
add_row(term = "Total", df = sum(.$df), sumsq = sum(.$sumsq)) %>%
kable()
term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|
drug | 3 | 3133.2385 | 1044.4128 | 9.455761 | 0.0000558 |
disease | 2 | 418.8337 | 209.4169 | 1.895990 | 0.1617201 |
drug:disease | 6 | 707.2663 | 117.8777 | 1.067225 | 0.3958458 |
Residuals | 46 | 5080.8167 | 110.4525 | NA | NA |
Total | 57 | 9340.1552 | NA | NA | NA |
Sums of Squares Tables
rstatix
Unfortunately, it is not easy to get the various types of sums of squares calculations in using functions from base R. However, the rstatix
package offers a solution to produce these various sums of squares tables. For each type, you supply the original dataset and model to the. anova_test
function, then specify the type and se detailed = TRUE
.
Type I
%>%
df_disease ::anova_test(
rstatix~ drug + disease + drug*disease,
y type = 1,
detailed = TRUE) %>%
::get_anova_table() %>%
rstatixkable()
Effect | DFn | DFd | SSn | SSd | F | p | p<.05 | ges |
---|---|---|---|---|---|---|---|---|
drug | 3 | 46 | 3133.239 | 5080.817 | 9.456 | 5.58e-05 | * | 0.381 |
disease | 2 | 46 | 418.834 | 5080.817 | 1.896 | 1.62e-01 | 0.076 | |
drug:disease | 6 | 46 | 707.266 | 5080.817 | 1.067 | 3.96e-01 | 0.122 |
Type II
%>%
df_disease ::anova_test(
rstatix~ drug + disease + drug*disease,
y type = 2,
detailed = TRUE) %>%
::get_anova_table() %>%
rstatixkable()
Effect | SSn | SSd | DFn | DFd | F | p | p<.05 | ges |
---|---|---|---|---|---|---|---|---|
drug | 3063.433 | 5080.817 | 3 | 46 | 9.245 | 6.75e-05 | * | 0.376 |
disease | 418.834 | 5080.817 | 2 | 46 | 1.896 | 1.62e-01 | 0.076 | |
drug:disease | 707.266 | 5080.817 | 6 | 46 | 1.067 | 3.96e-01 | 0.122 |
Type III
%>%
df_disease ::anova_test(
rstatix~ drug + disease + drug*disease,
y type = 3,
detailed = TRUE) %>%
::get_anova_table() %>%
rstatixkable()
Effect | SSn | SSd | DFn | DFd | F | p | p<.05 | ges |
---|---|---|---|---|---|---|---|---|
(Intercept) | 20037.613 | 5080.817 | 1 | 46 | 181.414 | 0.00e+00 | * | 0.798 |
drug | 2997.472 | 5080.817 | 3 | 46 | 9.046 | 8.09e-05 | * | 0.371 |
disease | 415.873 | 5080.817 | 2 | 46 | 1.883 | 1.64e-01 | 0.076 | |
drug:disease | 707.266 | 5080.817 | 6 | 46 | 1.067 | 3.96e-01 | 0.122 |
Type IV
In R there is no equivalent operation to the Type IV
sums of squares calculation in SAS.
car
As an alternative to the rstatix
package, you can use the car
package, which is the R package for “An R Companion to Applied Regression, Third Edition, Sage 2019”
Type I
Type I can come directly from stats
.
anova(lm_model)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
drug 3 3133.2 1044.41 9.4558 5.58e-05 ***
disease 2 418.8 209.42 1.8960 0.1617
drug:disease 6 707.3 117.88 1.0672 0.3958
Residuals 46 5080.8 110.45
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Type II
Anova(lm_model, type = "II")
Anova Table (Type II tests)
Response: y
Sum Sq Df F value Pr(>F)
drug 3063.4 3 9.2451 6.748e-05 ***
disease 418.8 2 1.8960 0.1617
drug:disease 707.3 6 1.0672 0.3958
Residuals 5080.8 46
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Type III
For type III tests the linear model we have used above does not make sense if the contrasts are orthogonal in the row-basis. The default contrast term contr.treatment
are not, so the contrast should be changed. If you want to match the procedure in SAS use contr.sum, contr.poly. This can be either set as an option.
options(contrasts = c("contr.sum", "contr.poly"))
Or, set directly in the model statement using the contrasts
argument
<- lm(y ~ drug + disease + drug*disease, df_disease,
lm_model contrasts = list(drug = "contr.sum", disease = "contr.poly"))
::Anova(lm_model, type = "III") car
Anova Table (Type III tests)
Response: y
Sum Sq Df F value Pr(>F)
(Intercept) 20037.6 1 181.4138 < 2.2e-16 ***
drug 2997.5 3 9.0460 8.086e-05 ***
disease 415.9 2 1.8826 0.1637
drug:disease 707.3 6 1.0672 0.3958
Residuals 5080.8 46
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rstatix
does this is by default. The rstatix
package uses the car
package to do the anova calculation, but can be nicer to use as it handles the contrasts for you and is more “pipe-able”.
Contrasts
The easiest way to get contrasts in R is by using emmeans
. For looking at contrast we are going to fit a different model on new data, that doesn’t include an interaction term as it is easier to calculate contrasts without an interaction term. For this dataset we have three different drugs A, C, and E.
<- read.csv("../data/drug_trial.csv")
df_trial
lm(formula = post ~ pre + drug, data = df_trial) %>%
emmeans("drug") %>%
contrast(method = list(
"C vs A" = c(-1, 1, 0),
"E vs CA" = c(-1, -1, 2)
))
contrast estimate SE df t.ratio p.value
C vs A 0.109 1.80 26 0.061 0.9521
E vs CA 6.783 3.28 26 2.067 0.0488