Analysis of VAriance (ANOVA) is a statistical test to measure the difference between means of more than two groups.It is best suited when the data is normally distributed. By partitioning total variance into components, ANOVA unravels relationship between variables and identifies the true source of variation. It can handle multiple factors and their interactions, providing a robust way to better understand intricate relationships.
Anova Test in Python
To perform a one-way ANOVA test in Python we can use the f_oneway() function from SciPy library. Similarly, to perform two-way ANOVA test anova_lm() function from the statsmodel library is frequently used.
For this test, we’ll create a data frame called df_disease taken from the SAS documentation. The corresponding data can be found here. In this experiment, we are trying to find the impact of different drug and disease group on the stem-length
import pandas as pdimport statsmodels.api as smfrom statsmodels.formula.api import ols# Read the sample datadf = pd.read_csv("../data/sas_disease.csv")#perform two-way ANOVAmodel = ols('y ~ C(drug) + C(disease) + C(drug):C(disease)', data=df).fit()sm.stats.anova_lm(model, typ=2)
sum_sq
df
F
PR(>F)
C(drug)
3063.432863
3.0
9.245096
0.000067
C(disease)
418.833741
2.0
1.895990
0.161720
C(drug):C(disease)
707.266259
6.0
1.067225
0.395846
Residual
5080.816667
46.0
NaN
NaN
Sum of Squares Tables
Type I
model = ols('y ~ C(drug) + C(disease) + C(drug):C(disease)', data=df).fit()sm.stats.anova_lm(model)