import pandas as pd
# Input data
= {
data 'drug': ["A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"D", "D", "D", "D", "D", "D", "D", "D", "D", "D",
"F", "F", "F", "F", "F", "F", "F", "F", "F", "F"],
'pre': [11, 8, 5, 14, 19, 6, 10, 6, 11, 3,
6, 6, 7, 8, 18, 8, 19, 8, 5, 15,
16, 13, 11, 9, 21, 16, 12, 12, 7, 12],
'post': [6, 0, 2, 8, 11, 4, 13, 1, 8, 0,
0, 2, 3, 1, 18, 4, 14, 9, 1, 9,
13, 10, 18, 5, 23, 12, 5, 16, 1, 20]
}
= pd.DataFrame(data) df
Ancova
Introduction
In this example, we’re looking at Analysis of Covariance. ANCOVA is typically used to analyse treatment differences, to see examples of prediction models go to the simple linear regression page.
Data Summary
# Descriptive statistics
= df.describe()
summary_stats
# Calculate median
= df['pre'].median()
median_pre = df['post'].median()
median_post
# Add median to the summary statistics
'median'] = [median_pre, median_post]
summary_stats.loc[
print(summary_stats)
pre post
count 30.000000 30.000000
mean 10.733333 7.900000
std 4.791755 6.666178
min 3.000000 0.000000
25% 7.000000 2.000000
50% 10.500000 7.000000
75% 13.750000 12.750000
max 21.000000 23.000000
median 10.500000 7.000000
Ancova in Python
In Python, Ancova can be performed using the statsmodels library from the scipy package.
import statsmodels.api as sm
import statsmodels.formula.api as smf
from tabulate import tabulate
# Fit the ANCOVA model
= smf.ols('post ~ drug + pre', data=df).fit()
model_ancova
# Summary of the model
= model_ancova.summary()
model_summary print(model_summary)
# Extracting glance (summary) information
= {
model_glance 'r_squared': model_ancova.rsquared,
'adj_r_squared': model_ancova.rsquared_adj,
'f_statistic': model_ancova.fvalue,
'f_pvalue': model_ancova.f_pvalue,
'aic': model_ancova.aic,
'bic': model_ancova.bic
}= pd.DataFrame([model_glance])
model_glance_df print(tabulate(model_glance_df, headers='keys', tablefmt='grid'))
# Extracting tidy (coefficients) information
= model_ancova.summary2().tables[1]
model_tidy print(tabulate(model_tidy, headers='keys', tablefmt='grid'))
OLS Regression Results
==============================================================================
Dep. Variable: post R-squared: 0.676
Model: OLS Adj. R-squared: 0.639
Method: Least Squares F-statistic: 18.10
Date: Thu, 10 Oct 2024 Prob (F-statistic): 1.50e-06
Time: 16:53:10 Log-Likelihood: -82.054
No. Observations: 30 AIC: 172.1
Df Residuals: 26 BIC: 177.7
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -3.8808 1.986 -1.954 0.062 -7.964 0.202
drug[T.D] 0.1090 1.795 0.061 0.952 -3.581 3.799
drug[T.F] 3.4461 1.887 1.826 0.079 -0.432 7.324
pre 0.9872 0.164 6.001 0.000 0.649 1.325
==============================================================================
Omnibus: 2.609 Durbin-Watson: 2.526
Prob(Omnibus): 0.271 Jarque-Bera (JB): 2.148
Skew: 0.645 Prob(JB): 0.342
Kurtosis: 2.765 Cond. No. 39.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
+----+-------------+-----------------+---------------+-------------+---------+---------+
| | r_squared | adj_r_squared | f_statistic | f_pvalue | aic | bic |
+====+=============+=================+===============+=============+=========+=========+
| 0 | 0.676261 | 0.638906 | 18.1039 | 1.50137e-06 | 172.108 | 177.712 |
+----+-------------+-----------------+---------------+-------------+---------+---------+
+-----------+-----------+------------+------------+-------------+-----------+----------+
| | Coef. | Std.Err. | t | P>|t| | [0.025 | 0.975] |
+===========+===========+============+============+=============+===========+==========+
| Intercept | -3.88081 | 1.9862 | -1.95388 | 0.0615519 | -7.96351 | 0.201887 |
+-----------+-----------+------------+------------+-------------+-----------+----------+
| drug[T.D] | 0.108971 | 1.79514 | 0.0607037 | 0.952059 | -3.58098 | 3.79892 |
+-----------+-----------+------------+------------+-------------+-----------+----------+
| drug[T.F] | 3.44614 | 1.88678 | 1.82646 | 0.0792846 | -0.432195 | 7.32447 |
+-----------+-----------+------------+------------+-------------+-----------+----------+
| pre | 0.987184 | 0.164498 | 6.00121 | 2.45433e-06 | 0.649054 | 1.32531 |
+-----------+-----------+------------+------------+-------------+-----------+----------+
Please note that all values match with the corresponding R version, except for the AIC and BIC values, which differ slightly. This should be acceptable for most practical purposes in statistical analysis. Currently, there are ongoing discussions in the statsmodels community regarding the computational details of AIC and BIC.
The following code can be used to enforce complete consistency of AIC and BIC values with R outputs by adding 1 to the number of parameters:
import numpy as np
# Manual calculation of AIC and BIC to ensure consistency with R
= df.shape[0] # number of observations
n = model_ancova.df_model + 1 # number of parameters (including intercept)
k = model_ancova.llf # log-likelihood
log_lik
# Adjusted number of parameters (including scale parameter)
= k + 1
k_adjusted
# Manually calculate AIC and BIC to match R's behavior
= 2 * k_adjusted - 2 * log_lik
aic_adjusted = np.log(n) * k_adjusted - 2 * log_lik
bic_adjusted
print(f"Number of observations (n): {n}")
print(f"Number of parameters (k_adjusted): {k_adjusted}")
print(f"Log-likelihood: {log_lik}")
print(f"AIC (adjusted): {aic_adjusted}")
print(f"BIC (adjusted): {bic_adjusted}")
Number of observations (n): 30
Number of parameters (k_adjusted): 5.0
Log-likelihood: -82.0537744890265
AIC (adjusted): 174.107548978053
BIC (adjusted): 181.11353588636376
There are different types of anova computations. The statsmodels.stats.anova.anova_lm function allows the types 1, 2 and 3. The code to compute these types is depicted below:
import statsmodels.formula.api as smf
import statsmodels.stats.anova as ssa
# Center the predictor for Type III anova
#df['pre_centered'] = df['pre'] - df['pre'].mean()
# Fit the model for types I and II anova
= smf.ols('post ~ C(drug) + pre', data=df).fit()
model
# Perform anova for types I and II
= ssa.anova_lm(model, typ=1)
ancova_table_type_1 = ssa.anova_lm(model, typ=2)
ancova_table_type_2
# Fit the model for Type III anova with centered predictors
= smf.ols('post ~ C(drug) + pre', data=df).fit()
model_type_3 = ssa.anova_lm(model_type_3, typ=3)
ancova_table_type_3
# Calculate SSd (sum of squares for residuals)
= ancova_table_type_1['sum_sq'].loc['Residual']
ssd_type1 = ancova_table_type_2['sum_sq'].loc['Residual']
ssd_type2 = ancova_table_type_3['sum_sq'].loc['Residual']
ssd_type3
# Calculate ges
'ges'] = ancova_table_type_1['sum_sq'] / (ancova_table_type_1['sum_sq'] + ssd_type1)
ancova_table_type_1['ges'] = ancova_table_type_2['sum_sq'] / (ancova_table_type_2['sum_sq'] + ssd_type2)
ancova_table_type_2['ges'] = ancova_table_type_3['sum_sq'] / (ancova_table_type_3['sum_sq'] + ssd_type3)
ancova_table_type_3[
# Add SSd column
'SSd'] = ssd_type1
ancova_table_type_1['SSd'] = ssd_type2
ancova_table_type_2['SSd'] = ssd_type3
ancova_table_type_3[
# Add significance column
'p<0.05'] = ancova_table_type_1['PR(>F)'] < 0.05
ancova_table_type_1['p<0.05'] = ancova_table_type_2['PR(>F)'] < 0.05
ancova_table_type_2['p<0.05'] = ancova_table_type_3['PR(>F)'] < 0.05
ancova_table_type_3[
# Rename columns to match the R output
={'sum_sq': 'SSn', 'df': 'DFn', 'F': 'F', 'PR(>F)': 'p'}, inplace=True)
ancova_table_type_1.rename(columns=True)
ancova_table_type_1.reset_index(inplace={'index': 'Effect'}, inplace=True)
ancova_table_type_1.rename(columns
={'sum_sq': 'SSn', 'df': 'DFn', 'F': 'F', 'PR(>F)': 'p'}, inplace=True)
ancova_table_type_2.rename(columns=True)
ancova_table_type_2.reset_index(inplace={'index': 'Effect'}, inplace=True)
ancova_table_type_2.rename(columns
={'sum_sq': 'SSn', 'df': 'DFn', 'F': 'F', 'PR(>F)': 'p'}, inplace=True)
ancova_table_type_3.rename(columns=True)
ancova_table_type_3.reset_index(inplace={'index': 'Effect'}, inplace=True)
ancova_table_type_3.rename(columns
# Calculate DFd (degrees of freedom for residuals)
= ancova_table_type_1.loc[ancova_table_type_1['Effect'] == 'Residual', 'DFn'].values[0]
dfd_type1 = ancova_table_type_2.loc[ancova_table_type_2['Effect'] == 'Residual', 'DFn'].values[0]
dfd_type2 = ancova_table_type_3.loc[ancova_table_type_3['Effect'] == 'Residual', 'DFn'].values[0]
dfd_type3 'DFd'] = dfd_type1
ancova_table_type_1['DFd'] = dfd_type2
ancova_table_type_2['DFd'] = dfd_type3
ancova_table_type_3[
# Filter out the Residual row
= ancova_table_type_1[ancova_table_type_1['Effect'] != 'Residual']
ancova_table_type_1 = ancova_table_type_2[ancova_table_type_2['Effect'] != 'Residual']
ancova_table_type_2 = ancova_table_type_3[ancova_table_type_3['Effect'] != 'Residual']
ancova_table_type_3
# Select and reorder columns to match the R output
= ancova_table_type_1[['Effect', 'DFn', 'DFd', 'SSn', 'SSd', 'F', 'p', 'p<0.05', 'ges']]
ancova_table_type_1 = ancova_table_type_2[['Effect', 'DFn', 'DFd', 'SSn', 'SSd', 'F', 'p', 'p<0.05', 'ges']]
ancova_table_type_2 = ancova_table_type_3[['Effect', 'DFn', 'DFd', 'SSn', 'SSd', 'F', 'p', 'p<0.05', 'ges']] ancova_table_type_3
Type 1 Ancova in Python
print(tabulate(ancova_table_type_1, headers='keys', tablefmt='grid'))
+----+----------+-------+-------+---------+---------+----------+-------------+----------+----------+
| | Effect | DFn | DFd | SSn | SSd | F | p | p<0.05 | ges |
+====+==========+=======+=======+=========+=========+==========+=============+==========+==========+
| 0 | C(drug) | 2 | 26 | 293.6 | 417.203 | 9.14855 | 0.000981237 | True | 0.413054 |
+----+----------+-------+-------+---------+---------+----------+-------------+----------+----------+
| 1 | pre | 1 | 26 | 577.897 | 417.203 | 36.0145 | 2.45433e-06 | True | 0.580743 |
+----+----------+-------+-------+---------+---------+----------+-------------+----------+----------+
Type 2 Ancova in Python
print(tabulate(ancova_table_type_2, headers='keys', tablefmt='grid'))
+----+----------+-------+-------+----------+---------+----------+-------------+----------+----------+
| | Effect | DFn | DFd | SSn | SSd | F | p | p<0.05 | ges |
+====+==========+=======+=======+==========+=========+==========+=============+==========+==========+
| 0 | C(drug) | 2 | 26 | 68.5537 | 417.203 | 2.13613 | 0.138379 | False | 0.141128 |
+----+----------+-------+-------+----------+---------+----------+-------------+----------+----------+
| 1 | pre | 1 | 26 | 577.897 | 417.203 | 36.0145 | 2.45433e-06 | True | 0.580743 |
+----+----------+-------+-------+----------+---------+----------+-------------+----------+----------+
Type 3 Ancova in Python
print(tabulate(ancova_table_type_3, headers='keys', tablefmt='grid'))
+----+-----------+-------+-------+----------+---------+----------+-------------+----------+----------+
| | Effect | DFn | DFd | SSn | SSd | F | p | p<0.05 | ges |
+====+===========+=======+=======+==========+=========+==========+=============+==========+==========+
| 0 | Intercept | 1 | 26 | 61.2592 | 417.203 | 3.81767 | 0.0615519 | False | 0.128034 |
+----+-----------+-------+-------+----------+---------+----------+-------------+----------+----------+
| 1 | C(drug) | 2 | 26 | 68.5537 | 417.203 | 2.13613 | 0.138379 | False | 0.141128 |
+----+-----------+-------+-------+----------+---------+----------+-------------+----------+----------+
| 2 | pre | 1 | 26 | 577.897 | 417.203 | 36.0145 | 2.45433e-06 | True | 0.580743 |
+----+-----------+-------+-------+----------+---------+----------+-------------+----------+----------+
Please note that the results are consistent with the results achieved with R, except for the first row of the type 3 table featuring the intercept.