ANOVA

Introduction

Analysis of VAriance (ANOVA) is a statistical test to measure the difference between means of more than two groups.It is best suited when the data is normally distributed. By partitioning total variance into components, ANOVA unravels relationship between variables and identifies the true source of variation. It can handle multiple factors and their interactions, providing a robust way to better understand intricate relationships.

Anova Test in Python

To perform a one-way ANOVA test in Python we can use the f_oneway() function from SciPy library. Similarly, to perform two-way ANOVA test anova_lm() function from the statsmodel library is frequently used.

For this test, we’ll create a data frame called df_disease taken from the SAS documentation. The corresponding data can be found here. In this experiment, we are trying to find the impact of different drug and disease group on the stem-length

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Read the sample data
df = pd.read_csv("../data/sas_disease.csv")


#perform two-way ANOVA
model = ols('y ~ C(drug) + C(disease) + C(drug):C(disease)', data=df).fit()
sm.stats.anova_lm(model, typ=2)
sum_sq df F PR(>F)
C(drug) 3063.432863 3.0 9.245096 0.000067
C(disease) 418.833741 2.0 1.895990 0.161720
C(drug):C(disease) 707.266259 6.0 1.067225 0.395846
Residual 5080.816667 46.0 NaN NaN

Sum of Squares Tables

Type I

model = ols('y ~ C(drug) + C(disease) + C(drug):C(disease)', data=df).fit()
sm.stats.anova_lm(model)
df sum_sq mean_sq F PR(>F)
C(drug) 3.0 3133.238506 1044.412835 9.455761 0.000056
C(disease) 2.0 418.833741 209.416870 1.895990 0.161720
C(drug):C(disease) 6.0 707.266259 117.877710 1.067225 0.395846
Residual 46.0 5080.816667 110.452536 NaN NaN

Type II

model = ols('y ~ C(drug) + C(disease) + C(drug):C(disease)', data=df).fit()
sm.stats.anova_lm(model, typ=2)
sum_sq df F PR(>F)
C(drug) 3063.432863 3.0 9.245096 0.000067
C(disease) 418.833741 2.0 1.895990 0.161720
C(drug):C(disease) 707.266259 6.0 1.067225 0.395846
Residual 5080.816667 46.0 NaN NaN

Type III

model = ols('y ~ C(drug,Sum) + C(disease,Sum) + C(drug,Sum):C(disease,Sum)', data=df).fit()
sm.stats.anova_lm(model, typ=3)
sum_sq df F PR(>F)
Intercept 20037.613011 1.0 181.413788 1.417921e-17
C(drug, Sum) 2997.471860 3.0 9.046033 8.086388e-05
C(disease, Sum) 415.873046 2.0 1.882587 1.637355e-01
C(drug, Sum):C(disease, Sum) 707.266259 6.0 1.067225 3.958458e-01
Residual 5080.816667 46.0 NaN NaN

Type IV

There is no Type IV sum of squares calculation in Python similiar to SAS.