ANOVA

Introduction

ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more groups to determine if at least one group mean is significantly different from the others. It helps to test hypotheses about group differences based on sample data.

The key assumptions include:

  • Independence of observations
  • Normality of the data (the distribution should be approximately normal)
  • Homogeneity of variances (similar variances across groups)

Common types include one-way ANOVA (one independent variable) and two-way ANOVA (two independent variables).

One-way ANOVA tests the effect of a single independent variable on a dependent variable (the grouping factor).

Two-way ANOVA tests the effect of two independent variables on a dependent variable and also examines if there is an interaction between the two independent variables.

Getting Started

To demonstrate the various types of sums of squares, we’ll create a data frame called df_disease taken from the SAS documentation. The corresponding data can be found here.

The Model

For this example, we’re testing for a significant difference in stem_length using ANOVA. In R, we’re using lm() to run the ANOVA, and then using broom::glance() and broom::tidy() to view the results in a table format.

lm_model <- lm(y ~ drug + disease + drug*disease, df_disease)

The glance function gives us a summary of the model diagnostic values.

lm_model %>% 
  glance() %>% 
  pivot_longer(everything())
# A tibble: 12 × 2
   name               value
   <chr>              <dbl>
 1 r.squared        0.456  
 2 adj.r.squared    0.326  
 3 sigma           10.5    
 4 statistic        3.51   
 5 p.value          0.00130
 6 df              11      
 7 logLik        -212.     
 8 AIC            450.     
 9 BIC            477.     
10 deviance      5081.     
11 df.residual     46      
12 nobs            58      

The tidy function gives a summary of the model results.

lm_model %>% tidy()
# A tibble: 12 × 5
   term           estimate std.error statistic      p.value
   <chr>             <dbl>     <dbl>     <dbl>        <dbl>
 1 (Intercept)      29.3        4.29    6.84   0.0000000160
 2 drug2            -1.33       6.36   -0.210  0.835       
 3 drug3           -13          7.43   -1.75   0.0869      
 4 drug4           -15.7        6.36   -2.47   0.0172      
 5 disease2         -1.08       6.78   -0.160  0.874       
 6 disease3         -8.93       6.36   -1.40   0.167       
 7 drug2:disease2    6.58       9.78    0.673  0.504       
 8 drug3:disease2  -10.9       10.2    -1.06   0.295       
 9 drug4:disease2    0.317      9.30    0.0340 0.973       
10 drug2:disease3   -0.900      9.00   -0.100  0.921       
11 drug3:disease3    1.10      10.2     0.107  0.915       
12 drug4:disease3    9.53       9.20    1.04   0.306       

The Results

You’ll see that R print the individual results for each level of the drug and disease interaction. We can get the combined F table in R using the anova() function on the model object.

lm_model %>% 
  anova() %>% 
  tidy() %>% 
  kable()
term df sumsq meansq statistic p.value
drug 3 3133.2385 1044.4128 9.455761 0.0000558
disease 2 418.8337 209.4169 1.895990 0.1617201
drug:disease 6 707.2663 117.8777 1.067225 0.3958458
Residuals 46 5080.8167 110.4525 NA NA

We can add a Total row, by using add_row and calculating the sum of the degrees of freedom and sum of squares.

lm_model %>%
  anova() %>%
  tidy() %>%
  add_row(term = "Total", df = sum(.$df), sumsq = sum(.$sumsq)) %>% 
  kable()
term df sumsq meansq statistic p.value
drug 3 3133.2385 1044.4128 9.455761 0.0000558
disease 2 418.8337 209.4169 1.895990 0.1617201
drug:disease 6 707.2663 117.8777 1.067225 0.3958458
Residuals 46 5080.8167 110.4525 NA NA
Total 57 9340.1552 NA NA NA

Sums of Squares Tables

Unfortunately, it is not easy to get the various types of sums of squares calculations in using functions from base R. However, the rstatix package offers a solution to produce these various sums of squares tables. For each type, you supply the original dataset and model to the. anova_test function, then specify the ttype and se detailed = TRUE.

Type I

df_disease %>% 
  rstatix::anova_test(
    y ~ drug + disease + drug*disease, 
    type = 1, 
    detailed = TRUE) %>% 
  rstatix::get_anova_table() %>% 
  kable()
Effect DFn DFd SSn SSd F p p<.05 ges
drug 3 46 3133.239 5080.817 9.456 5.58e-05 * 0.381
disease 2 46 418.834 5080.817 1.896 1.62e-01 0.076
drug:disease 6 46 707.266 5080.817 1.067 3.96e-01 0.122

Type II

df_disease %>% 
  rstatix::anova_test(
    y ~ drug + disease + drug*disease, 
    type = 2, 
    detailed = TRUE) %>% 
  rstatix::get_anova_table() %>% 
  kable()
Effect SSn SSd DFn DFd F p p<.05 ges
drug 3063.433 5080.817 3 46 9.245 6.75e-05 * 0.376
disease 418.834 5080.817 2 46 1.896 1.62e-01 0.076
drug:disease 707.266 5080.817 6 46 1.067 3.96e-01 0.122

Type III

df_disease %>% 
  rstatix::anova_test(
    y ~ drug + disease + drug*disease, 
    type = 3, 
    detailed = TRUE) %>% 
  rstatix::get_anova_table() %>% 
  kable()
Effect SSn SSd DFn DFd F p p<.05 ges
(Intercept) 20037.613 5080.817 1 46 181.414 0.00e+00 * 0.798
drug 2997.472 5080.817 3 46 9.046 8.09e-05 * 0.371
disease 415.873 5080.817 2 46 1.883 1.64e-01 0.076
drug:disease 707.266 5080.817 6 46 1.067 3.96e-01 0.122

Type IV

In R there is no equivalent operation to the Type IV sums of squares calculation in SAS.