ANOVA

Introduction

Analysis of VAriance (ANOVA) is a statistical test to measure the difference between means of more than two groups.It is best suited when the data is normally distributed. By partitioning total variance into components, ANOVA unravels relationship between variables and identifies the true source of variation. It can handle multiple factors and their interactions, providing a robust way to better understand intricate relationships.

Anova Test in Python

To perform a one-way ANOVA test in Python we can use the f_oneway() function from SciPy library. Similarly, to perform two-way ANOVA test anova_lm() function from the statsmodel library is frequently used.

For this test, we’ll create a data frame called df_disease taken from the SAS documentation. The corresponding data can be found here. In this experiment, we are trying to find the impact of different drug and disease group on the stem-length

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Read the sample data
df = pd.read_csv("../data/sas_disease.csv")


#perform two-way ANOVA
model = ols('y ~ C(drug) + C(disease) + C(drug):C(disease)', data=df).fit()
sm.stats.anova_lm(model, typ=2)

	sum_sq	df	F	PR(>F)
C(drug)	3063.432863	3.0	9.245096	0.000067
C(disease)	418.833741	2.0	1.895990	0.161720
C(drug):C(disease)	707.266259	6.0	1.067225	0.395846
Residual	5080.816667	46.0	NaN	NaN

Sum of Squares Tables

Type I

model = ols('y ~ C(drug) + C(disease) + C(drug):C(disease)', data=df).fit()
sm.stats.anova_lm(model)

	df	sum_sq	mean_sq	F	PR(>F)
C(drug)	3.0	3133.238506	1044.412835	9.455761	0.000056
C(disease)	2.0	418.833741	209.416870	1.895990	0.161720
C(drug):C(disease)	6.0	707.266259	117.877710	1.067225	0.395846
Residual	46.0	5080.816667	110.452536	NaN	NaN

Type II

model = ols('y ~ C(drug) + C(disease) + C(drug):C(disease)', data=df).fit()
sm.stats.anova_lm(model, typ=2)

	sum_sq	df	F	PR(>F)
C(drug)	3063.432863	3.0	9.245096	0.000067
C(disease)	418.833741	2.0	1.895990	0.161720
C(drug):C(disease)	707.266259	6.0	1.067225	0.395846
Residual	5080.816667	46.0	NaN	NaN

Type III

model = ols('y ~ C(drug,Sum) + C(disease,Sum) + C(drug,Sum):C(disease,Sum)', data=df).fit()
sm.stats.anova_lm(model, typ=3)

	sum_sq	df	F	PR(>F)
Intercept	20037.613011	1.0	181.413788	1.417921e-17
C(drug, Sum)	2997.471860	3.0	9.046033	8.086388e-05
C(disease, Sum)	415.873046	2.0	1.882587	1.637355e-01
C(drug, Sum):C(disease, Sum)	707.266259	6.0	1.067225	3.958458e-01
Residual	5080.816667	46.0	NaN	NaN

Type IV

There is no Type IV sum of squares calculation in Python similiar to SAS.