ANOVA

Introduction

ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more groups to determine if at least one group mean is significantly different from the others. It helps to test hypotheses about group differences based on sample data.

The key assumptions include:

  • Independence of observations
  • Normality of the data (the distribution should be approximately normal)
  • Homogeneity of variances (similar variances across groups)

Common types include one-way ANOVA (one independent variable) and two-way ANOVA (two independent variables).

One-way ANOVA tests the effect of a single independent variable on a dependent variable (the grouping factor).

Two-way ANOVA tests the effect of two independent variables on a dependent variable and also examines if there is an interaction between the two independent variables.

Getting Started

To demonstrate the various types of sums of squares, we’ll create a data frame called df_disease taken from the SAS documentation. The corresponding data can be found here.

The Model

For this example, we’re testing for a significant difference in stem_length using ANOVA. In R, we’re using lm() to run the ANOVA, and then using broom::glance() and broom::tidy() to view the results in a table format.

lm_model <- lm(y ~ drug + disease + drug*disease, df_disease)

The glance function gives us a summary of the model diagnostic values.

lm_model %>% 
  glance() 
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.456         0.326  10.5      3.51 0.00130    11  -212.  450.  477.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The tidy function gives a summary of the model results.

lm_model %>% tidy()
# A tibble: 12 × 5
   term           estimate std.error statistic      p.value
   <chr>             <dbl>     <dbl>     <dbl>        <dbl>
 1 (Intercept)      29.3        4.29    6.84   0.0000000160
 2 drug2            -1.33       6.36   -0.210  0.835       
 3 drug3           -13          7.43   -1.75   0.0869      
 4 drug4           -15.7        6.36   -2.47   0.0172      
 5 disease2         -1.08       6.78   -0.160  0.874       
 6 disease3         -8.93       6.36   -1.40   0.167       
 7 drug2:disease2    6.58       9.78    0.673  0.504       
 8 drug3:disease2  -10.9       10.2    -1.06   0.295       
 9 drug4:disease2    0.317      9.30    0.0340 0.973       
10 drug2:disease3   -0.900      9.00   -0.100  0.921       
11 drug3:disease3    1.10      10.2     0.107  0.915       
12 drug4:disease3    9.53       9.20    1.04   0.306       

The Results

You’ll see that R print the individual results for each level of the drug and disease interaction. We can get the combined F table in R using the anova() function on the model object.

lm_model %>% 
  anova() %>% 
  tidy() 
# A tibble: 4 × 6
  term            df sumsq meansq statistic    p.value
  <chr>        <int> <dbl>  <dbl>     <dbl>      <dbl>
1 drug             3 3133.  1044.      9.46  0.0000558
2 disease          2  419.   209.      1.90  0.162    
3 drug:disease     6  707.   118.      1.07  0.396    
4 Residuals       46 5081.   110.     NA    NA        

We can add a Total row, by using add_row and calculating the sum of the degrees of freedom and sum of squares.

lm_model %>%
  anova() %>%
  tidy() %>%
  add_row(term = "Total", df = sum(.$df), sumsq = sum(.$sumsq)) %>% 
  kable()
term df sumsq meansq statistic p.value
drug 3 3133.2385 1044.4128 9.455761 0.0000558
disease 2 418.8337 209.4169 1.895990 0.1617201
drug:disease 6 707.2663 117.8777 1.067225 0.3958458
Residuals 46 5080.8167 110.4525 NA NA
Total 57 9340.1552 NA NA NA

Sums of Squares Tables

rstatix

Unfortunately, it is not easy to get the various types of sums of squares calculations in using functions from base R. However, the rstatix package offers a solution to produce these various sums of squares tables. For each type, you supply the original dataset and model to the. anova_test function, then specify the type and se detailed = TRUE.

Type I

df_disease %>% 
  rstatix::anova_test(
    y ~ drug + disease + drug*disease, 
    type = 1, 
    detailed = TRUE) %>% 
  rstatix::get_anova_table() %>% 
  kable()
Effect DFn DFd SSn SSd F p p<.05 ges
drug 3 46 3133.239 5080.817 9.456 5.58e-05 * 0.381
disease 2 46 418.834 5080.817 1.896 1.62e-01 0.076
drug:disease 6 46 707.266 5080.817 1.067 3.96e-01 0.122

Type II

df_disease %>% 
  rstatix::anova_test(
    y ~ drug + disease + drug*disease, 
    type = 2, 
    detailed = TRUE) %>% 
  rstatix::get_anova_table() %>% 
  kable()
Effect SSn SSd DFn DFd F p p<.05 ges
drug 3063.433 5080.817 3 46 9.245 6.75e-05 * 0.376
disease 418.834 5080.817 2 46 1.896 1.62e-01 0.076
drug:disease 707.266 5080.817 6 46 1.067 3.96e-01 0.122

Type III

df_disease %>% 
  rstatix::anova_test(
    y ~ drug + disease + drug*disease, 
    type = 3, 
    detailed = TRUE) %>% 
  rstatix::get_anova_table() %>% 
  kable()
Effect SSn SSd DFn DFd F p p<.05 ges
(Intercept) 20037.613 5080.817 1 46 181.414 0.00e+00 * 0.798
drug 2997.472 5080.817 3 46 9.046 8.09e-05 * 0.371
disease 415.873 5080.817 2 46 1.883 1.64e-01 0.076
drug:disease 707.266 5080.817 6 46 1.067 3.96e-01 0.122

Type IV

In R there is no equivalent operation to the Type IV sums of squares calculation in SAS.

car

As an alternative to the rstatix package, you can use the car package, which is the R package for “An R Companion to Applied Regression, Third Edition, Sage 2019”

Type I

Type I can come directly from stats.

anova(lm_model)
Analysis of Variance Table

Response: y
             Df Sum Sq Mean Sq F value   Pr(>F)    
drug          3 3133.2 1044.41  9.4558 5.58e-05 ***
disease       2  418.8  209.42  1.8960   0.1617    
drug:disease  6  707.3  117.88  1.0672   0.3958    
Residuals    46 5080.8  110.45                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Type II

Anova(lm_model, type = "II")
Anova Table (Type II tests)

Response: y
             Sum Sq Df F value    Pr(>F)    
drug         3063.4  3  9.2451 6.748e-05 ***
disease       418.8  2  1.8960    0.1617    
drug:disease  707.3  6  1.0672    0.3958    
Residuals    5080.8 46                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Type III

For type III tests the linear model we have used above does not make sense if the contrasts are orthogonal in the row-basis. The default contrast term contr.treatment are not, so the contrast should be changed. If you want to match the procedure in SAS use contr.sum, contr.poly. This can be either set as an option.

options(contrasts = c("contr.sum", "contr.poly"))

Or, set directly in the model statement using the contrasts argument

lm_model <- lm(y ~ drug + disease + drug*disease, df_disease, 
              contrasts = list(drug = "contr.sum", disease = "contr.poly"))
car::Anova(lm_model, type = "III")
Anova Table (Type III tests)

Response: y
              Sum Sq Df  F value    Pr(>F)    
(Intercept)  20037.6  1 181.4138 < 2.2e-16 ***
drug          2997.5  3   9.0460 8.086e-05 ***
disease        415.9  2   1.8826    0.1637    
drug:disease   707.3  6   1.0672    0.3958    
Residuals     5080.8 46                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

rstatix does this is by default. The rstatix package uses the car package to do the anova calculation, but can be nicer to use as it handles the contrasts for you and is more “pipe-able”.

Contrasts

The easiest way to get contrasts in R is by using emmeans. For looking at contrast we are going to fit a different model on new data, that doesn’t include an interaction term as it is easier to calculate contrasts without an interaction term. For this dataset we have three different drugs A, C, and E.

df_trial<- read.csv("../data/drug_trial.csv")

lm(formula = post ~ pre + drug, data = df_trial) %>% 
  emmeans("drug") %>% 
  contrast(method = list(
    "C vs A"  = c(-1,  1, 0),
    "E vs CA" = c(-1, -1, 2)
  ))
 contrast estimate   SE df t.ratio p.value
 C vs A      0.109 1.80 26   0.061  0.9521
 E vs CA     6.783 3.28 26   2.067  0.0488