<- lm(y ~ drug + disease + drug*disease, df_disease) lm_model
ANOVA
Introduction
ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more groups to determine if at least one group mean is significantly different from the others. It helps to test hypotheses about group differences based on sample data.
The key assumptions include:
- Independence of observations
- Normality of the data (the distribution should be approximately normal)
- Homogeneity of variances (similar variances across groups)
Common types include one-way ANOVA (one independent variable) and two-way ANOVA (two independent variables).
One-way ANOVA tests the effect of a single independent variable on a dependent variable (the grouping factor).
Two-way ANOVA tests the effect of two independent variables on a dependent variable and also examines if there is an interaction between the two independent variables.
Getting Started
To demonstrate the various types of sums of squares, we’ll create a data frame called df_disease
taken from the SAS documentation. The corresponding data can be found here.
The Model
For this example, we’re testing for a significant difference in stem_length
using ANOVA. In R, we’re using lm()
to run the ANOVA, and then using broom::glance()
and broom::tidy()
to view the results in a table format.
The glance
function gives us a summary of the model diagnostic values.
%>%
lm_model glance() %>%
pivot_longer(everything())
# A tibble: 12 × 2
name value
<chr> <dbl>
1 r.squared 0.456
2 adj.r.squared 0.326
3 sigma 10.5
4 statistic 3.51
5 p.value 0.00130
6 df 11
7 logLik -212.
8 AIC 450.
9 BIC 477.
10 deviance 5081.
11 df.residual 46
12 nobs 58
The tidy
function gives a summary of the model results.
%>% tidy() lm_model
# A tibble: 12 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 29.3 4.29 6.84 0.0000000160
2 drug2 -1.33 6.36 -0.210 0.835
3 drug3 -13 7.43 -1.75 0.0869
4 drug4 -15.7 6.36 -2.47 0.0172
5 disease2 -1.08 6.78 -0.160 0.874
6 disease3 -8.93 6.36 -1.40 0.167
7 drug2:disease2 6.58 9.78 0.673 0.504
8 drug3:disease2 -10.9 10.2 -1.06 0.295
9 drug4:disease2 0.317 9.30 0.0340 0.973
10 drug2:disease3 -0.900 9.00 -0.100 0.921
11 drug3:disease3 1.10 10.2 0.107 0.915
12 drug4:disease3 9.53 9.20 1.04 0.306
The Results
You’ll see that R print the individual results for each level of the drug and disease interaction. We can get the combined F table in R using the anova()
function on the model object.
%>%
lm_model anova() %>%
tidy() %>%
kable()
term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|
drug | 3 | 3133.2385 | 1044.4128 | 9.455761 | 0.0000558 |
disease | 2 | 418.8337 | 209.4169 | 1.895990 | 0.1617201 |
drug:disease | 6 | 707.2663 | 117.8777 | 1.067225 | 0.3958458 |
Residuals | 46 | 5080.8167 | 110.4525 | NA | NA |
We can add a Total
row, by using add_row
and calculating the sum of the degrees of freedom and sum of squares.
%>%
lm_model anova() %>%
tidy() %>%
add_row(term = "Total", df = sum(.$df), sumsq = sum(.$sumsq)) %>%
kable()
term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|
drug | 3 | 3133.2385 | 1044.4128 | 9.455761 | 0.0000558 |
disease | 2 | 418.8337 | 209.4169 | 1.895990 | 0.1617201 |
drug:disease | 6 | 707.2663 | 117.8777 | 1.067225 | 0.3958458 |
Residuals | 46 | 5080.8167 | 110.4525 | NA | NA |
Total | 57 | 9340.1552 | NA | NA | NA |
Sums of Squares Tables
Unfortunately, it is not easy to get the various types of sums of squares calculations in using functions from base R. However, the rstatix
package offers a solution to produce these various sums of squares tables. For each type, you supply the original dataset and model to the. anova_test
function, then specify the ttype and se detailed = TRUE
.
Type I
%>%
df_disease ::anova_test(
rstatix~ drug + disease + drug*disease,
y type = 1,
detailed = TRUE) %>%
::get_anova_table() %>%
rstatixkable()
Effect | DFn | DFd | SSn | SSd | F | p | p<.05 | ges |
---|---|---|---|---|---|---|---|---|
drug | 3 | 46 | 3133.239 | 5080.817 | 9.456 | 5.58e-05 | * | 0.381 |
disease | 2 | 46 | 418.834 | 5080.817 | 1.896 | 1.62e-01 | 0.076 | |
drug:disease | 6 | 46 | 707.266 | 5080.817 | 1.067 | 3.96e-01 | 0.122 |
Type II
%>%
df_disease ::anova_test(
rstatix~ drug + disease + drug*disease,
y type = 2,
detailed = TRUE) %>%
::get_anova_table() %>%
rstatixkable()
Effect | SSn | SSd | DFn | DFd | F | p | p<.05 | ges |
---|---|---|---|---|---|---|---|---|
drug | 3063.433 | 5080.817 | 3 | 46 | 9.245 | 6.75e-05 | * | 0.376 |
disease | 418.834 | 5080.817 | 2 | 46 | 1.896 | 1.62e-01 | 0.076 | |
drug:disease | 707.266 | 5080.817 | 6 | 46 | 1.067 | 3.96e-01 | 0.122 |
Type III
%>%
df_disease ::anova_test(
rstatix~ drug + disease + drug*disease,
y type = 3,
detailed = TRUE) %>%
::get_anova_table() %>%
rstatixkable()
Effect | SSn | SSd | DFn | DFd | F | p | p<.05 | ges |
---|---|---|---|---|---|---|---|---|
(Intercept) | 20037.613 | 5080.817 | 1 | 46 | 181.414 | 0.00e+00 | * | 0.798 |
drug | 2997.472 | 5080.817 | 3 | 46 | 9.046 | 8.09e-05 | * | 0.371 |
disease | 415.873 | 5080.817 | 2 | 46 | 1.883 | 1.64e-01 | 0.076 | |
drug:disease | 707.266 | 5080.817 | 6 | 46 | 1.067 | 3.96e-01 | 0.122 |
Type IV
In R there is no equivalent operation to the Type IV
sums of squares calculation in SAS.