import pandas as pd
# Input data
data = {
'drug': ["A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"D", "D", "D", "D", "D", "D", "D", "D", "D", "D",
"F", "F", "F", "F", "F", "F", "F", "F", "F", "F"],
'pre': [11, 8, 5, 14, 19, 6, 10, 6, 11, 3,
6, 6, 7, 8, 18, 8, 19, 8, 5, 15,
16, 13, 11, 9, 21, 16, 12, 12, 7, 12],
'post': [6, 0, 2, 8, 11, 4, 13, 1, 8, 0,
0, 2, 3, 1, 18, 4, 14, 9, 1, 9,
13, 10, 18, 5, 23, 12, 5, 16, 1, 20]
}
df = pd.DataFrame(data)Ancova
Introduction
In this example, we’re looking at Analysis of Covariance. ANCOVA is typically used to analyse treatment differences, to see examples of prediction models go to the simple linear regression page.
Data Summary
# Descriptive statistics
summary_stats = df.describe()
# Calculate median
median_pre = df['pre'].median()
median_post = df['post'].median()
# Add median to the summary statistics
summary_stats.loc['median'] = [median_pre, median_post]
print(summary_stats) pre post
count 30.000000 30.000000
mean 10.733333 7.900000
std 4.791755 6.666178
min 3.000000 0.000000
25% 7.000000 2.000000
50% 10.500000 7.000000
75% 13.750000 12.750000
max 21.000000 23.000000
median 10.500000 7.000000
Ancova in Python
In Python, Ancova can be performed using the statsmodels library from the scipy package.
import statsmodels.api as sm
import statsmodels.formula.api as smf
from tabulate import tabulate
# Fit the ANCOVA model
model_ancova = smf.ols('post ~ drug + pre', data=df).fit()
# Summary of the model
model_summary = model_ancova.summary()
print(model_summary)
# Extracting glance (summary) information
model_glance = {
'r_squared': model_ancova.rsquared,
'adj_r_squared': model_ancova.rsquared_adj,
'f_statistic': model_ancova.fvalue,
'f_pvalue': model_ancova.f_pvalue,
'aic': model_ancova.aic,
'bic': model_ancova.bic
}
model_glance_df = pd.DataFrame([model_glance])
print(tabulate(model_glance_df, headers='keys', tablefmt='grid'))
# Extracting tidy (coefficients) information
model_tidy = model_ancova.summary2().tables[1]
print(tabulate(model_tidy, headers='keys', tablefmt='grid')) OLS Regression Results
==============================================================================
Dep. Variable: post R-squared: 0.676
Model: OLS Adj. R-squared: 0.639
Method: Least Squares F-statistic: 18.10
Date: Wed, 20 Aug 2025 Prob (F-statistic): 1.50e-06
Time: 14:01:49 Log-Likelihood: -82.054
No. Observations: 30 AIC: 172.1
Df Residuals: 26 BIC: 177.7
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -3.8808 1.986 -1.954 0.062 -7.964 0.202
drug[T.D] 0.1090 1.795 0.061 0.952 -3.581 3.799
drug[T.F] 3.4461 1.887 1.826 0.079 -0.432 7.324
pre 0.9872 0.164 6.001 0.000 0.649 1.325
==============================================================================
Omnibus: 2.609 Durbin-Watson: 2.526
Prob(Omnibus): 0.271 Jarque-Bera (JB): 2.148
Skew: 0.645 Prob(JB): 0.342
Kurtosis: 2.765 Cond. No. 39.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
+----+-------------+-----------------+---------------+-------------+---------+---------+
| | r_squared | adj_r_squared | f_statistic | f_pvalue | aic | bic |
+====+=============+=================+===============+=============+=========+=========+
| 0 | 0.676261 | 0.638906 | 18.1039 | 1.50137e-06 | 172.108 | 177.712 |
+----+-------------+-----------------+---------------+-------------+---------+---------+
+-----------+-----------+------------+------------+-------------+-----------+----------+
| | Coef. | Std.Err. | t | P>|t| | [0.025 | 0.975] |
+===========+===========+============+============+=============+===========+==========+
| Intercept | -3.88081 | 1.9862 | -1.95388 | 0.0615519 | -7.96351 | 0.201887 |
+-----------+-----------+------------+------------+-------------+-----------+----------+
| drug[T.D] | 0.108971 | 1.79514 | 0.0607037 | 0.952059 | -3.58098 | 3.79892 |
+-----------+-----------+------------+------------+-------------+-----------+----------+
| drug[T.F] | 3.44614 | 1.88678 | 1.82646 | 0.0792846 | -0.432195 | 7.32447 |
+-----------+-----------+------------+------------+-------------+-----------+----------+
| pre | 0.987184 | 0.164498 | 6.00121 | 2.45433e-06 | 0.649054 | 1.32531 |
+-----------+-----------+------------+------------+-------------+-----------+----------+
Please note that all values match with the corresponding R version, except for the AIC and BIC values, which differ slightly. This should be acceptable for most practical purposes in statistical analysis. Currently, there are ongoing discussions in the statsmodels community regarding the computational details of AIC and BIC.
The following code can be used to enforce complete consistency of AIC and BIC values with R outputs by adding 1 to the number of parameters:
import numpy as np
# Manual calculation of AIC and BIC to ensure consistency with R
n = df.shape[0] # number of observations
k = model_ancova.df_model + 1 # number of parameters (including intercept)
log_lik = model_ancova.llf # log-likelihood
# Adjusted number of parameters (including scale parameter)
k_adjusted = k + 1
# Manually calculate AIC and BIC to match R's behavior
aic_adjusted = 2 * k_adjusted - 2 * log_lik
bic_adjusted = np.log(n) * k_adjusted - 2 * log_lik
print(f"Number of observations (n): {n}")
print(f"Number of parameters (k_adjusted): {k_adjusted}")
print(f"Log-likelihood: {log_lik}")
print(f"AIC (adjusted): {aic_adjusted}")
print(f"BIC (adjusted): {bic_adjusted}")Number of observations (n): 30
Number of parameters (k_adjusted): 5.0
Log-likelihood: -82.0537744890265
AIC (adjusted): 174.107548978053
BIC (adjusted): 181.11353588636376
There are different types of anova computations. The statsmodels.stats.anova.anova_lm function allows the types 1, 2 and 3. The code to compute these types is depicted below:
import statsmodels.formula.api as smf
import statsmodels.stats.anova as ssa
# Center the predictor for Type III anova
#df['pre_centered'] = df['pre'] - df['pre'].mean()
# Fit the model for types I and II anova
model = smf.ols('post ~ C(drug) + pre', data=df).fit()
# Perform anova for types I and II
ancova_table_type_1 = ssa.anova_lm(model, typ=1)
ancova_table_type_2 = ssa.anova_lm(model, typ=2)
# Fit the model for Type III anova with centered predictors
model_type_3 = smf.ols('post ~ C(drug) + pre', data=df).fit()
ancova_table_type_3 = ssa.anova_lm(model_type_3, typ=3)
# Calculate SSd (sum of squares for residuals)
ssd_type1 = ancova_table_type_1['sum_sq'].loc['Residual']
ssd_type2 = ancova_table_type_2['sum_sq'].loc['Residual']
ssd_type3 = ancova_table_type_3['sum_sq'].loc['Residual']
# Calculate ges
ancova_table_type_1['ges'] = ancova_table_type_1['sum_sq'] / (ancova_table_type_1['sum_sq'] + ssd_type1)
ancova_table_type_2['ges'] = ancova_table_type_2['sum_sq'] / (ancova_table_type_2['sum_sq'] + ssd_type2)
ancova_table_type_3['ges'] = ancova_table_type_3['sum_sq'] / (ancova_table_type_3['sum_sq'] + ssd_type3)
# Add SSd column
ancova_table_type_1['SSd'] = ssd_type1
ancova_table_type_2['SSd'] = ssd_type2
ancova_table_type_3['SSd'] = ssd_type3
# Add significance column
ancova_table_type_1['p<0.05'] = ancova_table_type_1['PR(>F)'] < 0.05
ancova_table_type_2['p<0.05'] = ancova_table_type_2['PR(>F)'] < 0.05
ancova_table_type_3['p<0.05'] = ancova_table_type_3['PR(>F)'] < 0.05
# Rename columns to match the R output
ancova_table_type_1.rename(columns={'sum_sq': 'SSn', 'df': 'DFn', 'F': 'F', 'PR(>F)': 'p'}, inplace=True)
ancova_table_type_1.reset_index(inplace=True)
ancova_table_type_1.rename(columns={'index': 'Effect'}, inplace=True)
ancova_table_type_2.rename(columns={'sum_sq': 'SSn', 'df': 'DFn', 'F': 'F', 'PR(>F)': 'p'}, inplace=True)
ancova_table_type_2.reset_index(inplace=True)
ancova_table_type_2.rename(columns={'index': 'Effect'}, inplace=True)
ancova_table_type_3.rename(columns={'sum_sq': 'SSn', 'df': 'DFn', 'F': 'F', 'PR(>F)': 'p'}, inplace=True)
ancova_table_type_3.reset_index(inplace=True)
ancova_table_type_3.rename(columns={'index': 'Effect'}, inplace=True)
# Calculate DFd (degrees of freedom for residuals)
dfd_type1 = ancova_table_type_1.loc[ancova_table_type_1['Effect'] == 'Residual', 'DFn'].values[0]
dfd_type2 = ancova_table_type_2.loc[ancova_table_type_2['Effect'] == 'Residual', 'DFn'].values[0]
dfd_type3 = ancova_table_type_3.loc[ancova_table_type_3['Effect'] == 'Residual', 'DFn'].values[0]
ancova_table_type_1['DFd'] = dfd_type1
ancova_table_type_2['DFd'] = dfd_type2
ancova_table_type_3['DFd'] = dfd_type3
# Filter out the Residual row
ancova_table_type_1 = ancova_table_type_1[ancova_table_type_1['Effect'] != 'Residual']
ancova_table_type_2 = ancova_table_type_2[ancova_table_type_2['Effect'] != 'Residual']
ancova_table_type_3 = ancova_table_type_3[ancova_table_type_3['Effect'] != 'Residual']
# Select and reorder columns to match the R output
ancova_table_type_1 = ancova_table_type_1[['Effect', 'DFn', 'DFd', 'SSn', 'SSd', 'F', 'p', 'p<0.05', 'ges']]
ancova_table_type_2 = ancova_table_type_2[['Effect', 'DFn', 'DFd', 'SSn', 'SSd', 'F', 'p', 'p<0.05', 'ges']]
ancova_table_type_3 = ancova_table_type_3[['Effect', 'DFn', 'DFd', 'SSn', 'SSd', 'F', 'p', 'p<0.05', 'ges']]Type 1 Ancova in Python
print(tabulate(ancova_table_type_1, headers='keys', tablefmt='grid'))+----+----------+-------+-------+---------+---------+----------+-------------+----------+----------+
| | Effect | DFn | DFd | SSn | SSd | F | p | p<0.05 | ges |
+====+==========+=======+=======+=========+=========+==========+=============+==========+==========+
| 0 | C(drug) | 2 | 26 | 293.6 | 417.203 | 9.14855 | 0.000981237 | True | 0.413054 |
+----+----------+-------+-------+---------+---------+----------+-------------+----------+----------+
| 1 | pre | 1 | 26 | 577.897 | 417.203 | 36.0145 | 2.45433e-06 | True | 0.580743 |
+----+----------+-------+-------+---------+---------+----------+-------------+----------+----------+
Type 2 Ancova in Python
print(tabulate(ancova_table_type_2, headers='keys', tablefmt='grid'))+----+----------+-------+-------+----------+---------+----------+-------------+----------+----------+
| | Effect | DFn | DFd | SSn | SSd | F | p | p<0.05 | ges |
+====+==========+=======+=======+==========+=========+==========+=============+==========+==========+
| 0 | C(drug) | 2 | 26 | 68.5537 | 417.203 | 2.13613 | 0.138379 | False | 0.141128 |
+----+----------+-------+-------+----------+---------+----------+-------------+----------+----------+
| 1 | pre | 1 | 26 | 577.897 | 417.203 | 36.0145 | 2.45433e-06 | True | 0.580743 |
+----+----------+-------+-------+----------+---------+----------+-------------+----------+----------+
Type 3 Ancova in Python
print(tabulate(ancova_table_type_3, headers='keys', tablefmt='grid'))+----+-----------+-------+-------+----------+---------+----------+-------------+----------+----------+
| | Effect | DFn | DFd | SSn | SSd | F | p | p<0.05 | ges |
+====+===========+=======+=======+==========+=========+==========+=============+==========+==========+
| 0 | Intercept | 1 | 26 | 61.2592 | 417.203 | 3.81767 | 0.0615519 | False | 0.128034 |
+----+-----------+-------+-------+----------+---------+----------+-------------+----------+----------+
| 1 | C(drug) | 2 | 26 | 68.5537 | 417.203 | 2.13613 | 0.138379 | False | 0.141128 |
+----+-----------+-------+-------+----------+---------+----------+-------------+----------+----------+
| 2 | pre | 1 | 26 | 577.897 | 417.203 | 36.0145 | 2.45433e-06 | True | 0.580743 |
+----+-----------+-------+-------+----------+---------+----------+-------------+----------+----------+
Please note that the results are consistent with the results achieved with R, except for the first row of the type 3 table featuring the intercept.