ANOVA

Introduction

ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more groups to determine if at least one group mean is significantly different from the others. It helps to test hypotheses about group differences based on sample data.

The key assumptions include:

Independence of observations
Normality of the data (the distribution should be approximately normal)
Homogeneity of variances (similar variances across groups)

Common types include one-way ANOVA (one independent variable) and two-way ANOVA (two independent variables).

One-way ANOVA tests the effect of a single independent variable on a dependent variable (the grouping factor).

Two-way ANOVA tests the effect of two independent variables on a dependent variable and also examines if there is an interaction between the two independent variables.

Getting Started

To demonstrate the various types of sums of squares, we’ll create a data frame called df_disease taken from the SAS documentation. The corresponding data can be found here.

The Model

For this example, we’re testing for a significant difference in stem_length using ANOVA. Before getting the sums of squares and associated p-values from the ANOVA, we need to fit a linear model. In R, we’re using lm() to fit the model, and then using broom::glance() and broom::tidy() to view the results in a table format.

lm_model <- lm(y ~ drug + disease + drug * disease, df_disease)

The glance function gives us a summary of the model diagnostic values.

lm_model |>
  glance()

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.456         0.326  10.5      3.51 0.00130    11  -212.  450.  477.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The tidy function gives a summary of the model results.

lm_model |>
  tidy()

# A tibble: 12 × 5
   term           estimate std.error statistic      p.value
   <chr>             <dbl>     <dbl>     <dbl>        <dbl>
 1 (Intercept)      29.3        4.29    6.84   0.0000000160
 2 drug2            -1.33       6.36   -0.210  0.835       
 3 drug3           -13          7.43   -1.75   0.0869      
 4 drug4           -15.7        6.36   -2.47   0.0172      
 5 disease2         -1.08       6.78   -0.160  0.874       
 6 disease3         -8.93       6.36   -1.40   0.167       
 7 drug2:disease2    6.58       9.78    0.673  0.504       
 8 drug3:disease2  -10.8       10.2    -1.06   0.295       
 9 drug4:disease2    0.317      9.30    0.0340 0.973       
10 drug2:disease3   -0.900      9.00   -0.100  0.921       
11 drug3:disease3    1.10      10.2     0.107  0.915       
12 drug4:disease3    9.53       9.20    1.04   0.306

Sums of Squares Tables

Type I

Type I sums of square, also known as sequential ANOVA, is a method of analysis of variance where model terms are assessed sequentially. In this approach, the contribution of each factor or variable to the model is evaluated in the order they are specified, with each factor being adjusted for the effects of those that precede it. This means that the significance of a factor can depend on the factors that have already been included in the model. Type I ANOVA is useful for hierarchical models, where the sequence of entering factors into the model is meaningful or based on theoretical considerations. While possible to use on unbalanced designs it is often not testing the hypothesis of interest.

For a model with two factors, A and B (in that order) the sums of squares will be tested like this: - SS(A) for factor A. - SS(B | A) for factor B. - SS(AB | B, A) for interaction AB.

This can be calculated using, the base R {stats} package or the {rstatix} package. Both give the same result.

stats

stats::anova(lm_model)

Analysis of Variance Table

Response: y
             Df Sum Sq Mean Sq F value   Pr(>F)    
drug          3 3133.2 1044.41  9.4558 5.58e-05 ***
disease       2  418.8  209.42  1.8960   0.1617    
drug:disease  6  707.3  117.88  1.0672   0.3958    
Residuals    46 5080.8  110.45                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

rstatix

df_disease |>
  rstatix::anova_test(
    y ~ drug + disease + drug * disease,
    type = 1,
    detailed = TRUE
  )

Warning: NA detected in rows: 8,10,15,20,25,29,37,38,41,43,51,54,56,72.
Removing this rows before the analysis.

ANOVA Table (type I tests)

        Effect DFn DFd      SSn      SSd     F        p p<.05   ges
1         drug   3  46 3133.239 5080.817 9.456 5.58e-05     * 0.381
2      disease   2  46  418.834 5080.817 1.896 1.62e-01       0.076
3 drug:disease   6  46  707.266 5080.817 1.067 3.96e-01       0.122

Type II

Type II sum of squares also known as hierarchical or partially sequential sums of squares. Tests the effect of adding a factor to the model after all other factors have been added. This means that the significance of a factor is assessed while controlling for the effects of all other factors in the model, but not for interactions. Type II ANOVA is particularly useful when there are no interactions in the model or when the focus is on main effects only. It is often used in unbalanced designs, where the number of observations varies across groups.

For a model with two factors, A and B (in that order) the sums of squares will be tested like this: - SS(A | B) for factor A. - SS(B | A) for factor B.

This can be calculated using the {car} package or the {rstatix} package. Both give the same result.

car

car::Anova(lm_model, type = "II")

Anova Table (Type II tests)

Response: y
             Sum Sq Df F value    Pr(>F)    
drug         3063.4  3  9.2451 6.748e-05 ***
disease       418.8  2  1.8960    0.1617    
drug:disease  707.3  6  1.0672    0.3958    
Residuals    5080.8 46                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

rstatix

df_disease |>
  rstatix::anova_test(
    y ~ drug + disease + drug * disease,
    type = 2,
    detailed = TRUE
  )

Warning: NA detected in rows: 8,10,15,20,25,29,37,38,41,43,51,54,56,72.
Removing this rows before the analysis.

ANOVA Table (type II tests)

        Effect      SSn      SSd DFn DFd     F        p p<.05   ges
1         drug 3063.433 5080.817   3  46 9.245 6.75e-05     * 0.376
2      disease  418.834 5080.817   2  46 1.896 1.62e-01       0.076
3 drug:disease  707.266 5080.817   6  46 1.067 3.96e-01       0.122

Type III

Type III sum of squares is calculated such that every effect is adjusted for all other effect. This means testing for the presence of a main effect after adjusting for other main effects and interactions. For a model with two factors, A and B (in that order) the sums of squares will be tested like this: - SS(A | B, AB) for factor A. - SS(B | A, AB) for factor B.

This can be calculated using the base R {stats} package, the {car} package or the {rstatix} package. All give the same result.

Note: Calculating type III sums of squares in R is a bit tricky, because the multi-way ANOVA model is over-paramerterised. So when running the linear model we need to select a design matrix that sums to zero. In R those options will be either "contr.sum" or "contr.poly"

# Drug design matrix
contr.sum(4) # Using 4 here as we have 4 levels of drug

  [,1] [,2] [,3]
1    1    0    0
2    0    1    0
3    0    0    1
4   -1   -1   -1

# Disease design matrix
contr.sum(3)

  [,1] [,2]
1    1    0
2    0    1
3   -1   -1

While not relevant for this example as the disease variable isn’t ordinal the polynomial design matrix would look like

contr.poly(3)

                .L         .Q
[1,] -7.071068e-01  0.4082483
[2,] -9.073800e-17 -0.8164966
[3,]  7.071068e-01  0.4082483

lm_model <- lm(
  y ~ drug + disease + drug * disease,
  df_disease,
  contrasts = list(drug = "contr.sum", disease = "contr.sum")
)

stats

Using the base stats package, you can use the drop1() function which drops all possible single terms in a model. The scope term specifies how things can be dropped.

stats::drop1(lm_model, scope = . ~ ., test = "F")

Single term deletions

Model:
y ~ drug + disease + drug * disease
             Df Sum of Sq    RSS    AIC F value    Pr(>F)    
<none>                    5080.8 283.42                      
drug          3   2997.47 8078.3 304.32  9.0460 8.086e-05 ***
disease       2    415.87 5496.7 283.99  1.8826    0.1637    
drug:disease  6    707.27 5788.1 278.98  1.0672    0.3958    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

car

car::Anova(lm_model, type = "III")

Anova Table (Type III tests)

Response: y
              Sum Sq Df  F value    Pr(>F)    
(Intercept)  20037.6  1 181.4138 < 2.2e-16 ***
drug          2997.5  3   9.0460 8.086e-05 ***
disease        415.9  2   1.8826    0.1637    
drug:disease   707.3  6   1.0672    0.3958    
Residuals     5080.8 46                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

rstatix

The rstatix package uses the car package to do the anova calculation, but can be nicer to use as it handles the contrasts for you and is more “pipe-able”.

df_disease |>
  rstatix::anova_test(
    y ~ drug + disease + drug * disease,
    type = 3,
    detailed = TRUE
  )

Warning: NA detected in rows: 8,10,15,20,25,29,37,38,41,43,51,54,56,72.
Removing this rows before the analysis.

ANOVA Table (type III tests)

        Effect       SSn      SSd DFn DFd       F        p p<.05   ges
1  (Intercept) 20037.613 5080.817   1  46 181.414 1.42e-17     * 0.798
2         drug  2997.472 5080.817   3  46   9.046 8.09e-05     * 0.371
3      disease   415.873 5080.817   2  46   1.883 1.64e-01       0.076
4 drug:disease   707.266 5080.817   6  46   1.067 3.96e-01       0.122

Type IV

In R there is no equivalent operation to the Type IV sums of squares calculation in SAS.

Contrasts

The easiest way to get contrasts in R is by using emmeans. For looking at contrast we are going to fit a different model on new data, that doesn’t include an interaction term as it is easier to calculate contrasts without an interaction term. For this dataset we have three different drugs A, C, and E.

df_trial <- read.csv("../data/drug_trial.csv")

lm(formula = post ~ pre + drug, data = df_trial) |>
  emmeans("drug") |>
  contrast(
    method = list(
      "C vs A" = c(-1, 1, 0),
      "E vs CA" = c(-1, -1, 2)
    )
  )

 contrast estimate   SE df t.ratio p.value
 C vs A      0.109 1.80 26   0.061  0.9521
 E vs CA     6.783 3.28 26   2.067  0.0488

References

Göttingen University. (n.d.). Type II and III SS using the car package. Retrieved 19 August 2025, from https://md.psych.bio.uni-goettingen.de/mv/unit/lm_cat/lm_cat_unbal_ss_explained.html#type-ii-and-iii-ss-using-the-car-package