Ancova

Introduction

In this example, we’re looking at Analysis of Covariance. ANCOVA is typically used to analyse treatment differences, to see examples of prediction models go to the simple linear regression page.

Data Summary

import pandas as pd

# Input data
data = {
    'drug': ["A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
             "D", "D", "D", "D", "D", "D", "D", "D", "D", "D",
             "F", "F", "F", "F", "F", "F", "F", "F", "F", "F"],
    'pre': [11, 8, 5, 14, 19, 6, 10, 6, 11, 3,
            6, 6, 7, 8, 18, 8, 19, 8, 5, 15,
            16, 13, 11, 9, 21, 16, 12, 12, 7, 12],
    'post': [6, 0, 2, 8, 11, 4, 13, 1, 8, 0,
             0, 2, 3, 1, 18, 4, 14, 9, 1, 9,
             13, 10, 18, 5, 23, 12, 5, 16, 1, 20]
}

df = pd.DataFrame(data)
# Descriptive statistics
summary_stats = df.describe()

# Calculate median
median_pre = df['pre'].median()
median_post = df['post'].median()

# Add median to the summary statistics
summary_stats.loc['median'] = [median_pre, median_post]

print(summary_stats)
              pre       post
count   30.000000  30.000000
mean    10.733333   7.900000
std      4.791755   6.666178
min      3.000000   0.000000
25%      7.000000   2.000000
50%     10.500000   7.000000
75%     13.750000  12.750000
max     21.000000  23.000000
median  10.500000   7.000000

Ancova in Python

In Python, Ancova can be performed using the statsmodels library from the scipy package.

import statsmodels.api as sm
import statsmodels.formula.api as smf
from tabulate import tabulate

# Fit the ANCOVA model
model_ancova = smf.ols('post ~ drug + pre', data=df).fit()

# Summary of the model
model_summary = model_ancova.summary()
print(model_summary)

# Extracting glance (summary) information
model_glance = {
    'r_squared': model_ancova.rsquared,
    'adj_r_squared': model_ancova.rsquared_adj,
    'f_statistic': model_ancova.fvalue,
    'f_pvalue': model_ancova.f_pvalue,
    'aic': model_ancova.aic,
    'bic': model_ancova.bic
}
model_glance_df = pd.DataFrame([model_glance])
print(tabulate(model_glance_df, headers='keys', tablefmt='grid'))

# Extracting tidy (coefficients) information
model_tidy = model_ancova.summary2().tables[1]
print(tabulate(model_tidy, headers='keys', tablefmt='grid'))
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   post   R-squared:                       0.676
Model:                            OLS   Adj. R-squared:                  0.639
Method:                 Least Squares   F-statistic:                     18.10
Date:                Fri, 25 Oct 2024   Prob (F-statistic):           1.50e-06
Time:                        08:32:37   Log-Likelihood:                -82.054
No. Observations:                  30   AIC:                             172.1
Df Residuals:                      26   BIC:                             177.7
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -3.8808      1.986     -1.954      0.062      -7.964       0.202
drug[T.D]      0.1090      1.795      0.061      0.952      -3.581       3.799
drug[T.F]      3.4461      1.887      1.826      0.079      -0.432       7.324
pre            0.9872      0.164      6.001      0.000       0.649       1.325
==============================================================================
Omnibus:                        2.609   Durbin-Watson:                   2.526
Prob(Omnibus):                  0.271   Jarque-Bera (JB):                2.148
Skew:                           0.645   Prob(JB):                        0.342
Kurtosis:                       2.765   Cond. No.                         39.8
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
+----+-------------+-----------------+---------------+-------------+---------+---------+
|    |   r_squared |   adj_r_squared |   f_statistic |    f_pvalue |     aic |     bic |
+====+=============+=================+===============+=============+=========+=========+
|  0 |    0.676261 |        0.638906 |       18.1039 | 1.50137e-06 | 172.108 | 177.712 |
+----+-------------+-----------------+---------------+-------------+---------+---------+
+-----------+-----------+------------+------------+-------------+-----------+----------+
|           |     Coef. |   Std.Err. |          t |       P>|t| |    [0.025 |   0.975] |
+===========+===========+============+============+=============+===========+==========+
| Intercept | -3.88081  |   1.9862   | -1.95388   | 0.0615519   | -7.96351  | 0.201887 |
+-----------+-----------+------------+------------+-------------+-----------+----------+
| drug[T.D] |  0.108971 |   1.79514  |  0.0607037 | 0.952059    | -3.58098  | 3.79892  |
+-----------+-----------+------------+------------+-------------+-----------+----------+
| drug[T.F] |  3.44614  |   1.88678  |  1.82646   | 0.0792846   | -0.432195 | 7.32447  |
+-----------+-----------+------------+------------+-------------+-----------+----------+
| pre       |  0.987184 |   0.164498 |  6.00121   | 2.45433e-06 |  0.649054 | 1.32531  |
+-----------+-----------+------------+------------+-------------+-----------+----------+

Please note that all values match with the corresponding R version, except for the AIC and BIC values, which differ slightly. This should be acceptable for most practical purposes in statistical analysis. Currently, there are ongoing discussions in the statsmodels community regarding the computational details of AIC and BIC.

The following code can be used to enforce complete consistency of AIC and BIC values with R outputs by adding 1 to the number of parameters:

import numpy as np

# Manual calculation of AIC and BIC to ensure consistency with R
n = df.shape[0]  # number of observations
k = model_ancova.df_model + 1  # number of parameters (including intercept)
log_lik = model_ancova.llf  # log-likelihood

# Adjusted number of parameters (including scale parameter)
k_adjusted = k + 1

# Manually calculate AIC and BIC to match R's behavior
aic_adjusted = 2 * k_adjusted - 2 * log_lik
bic_adjusted = np.log(n) * k_adjusted - 2 * log_lik

print(f"Number of observations (n): {n}")
print(f"Number of parameters (k_adjusted): {k_adjusted}")
print(f"Log-likelihood: {log_lik}")
print(f"AIC (adjusted): {aic_adjusted}")
print(f"BIC (adjusted): {bic_adjusted}")
Number of observations (n): 30
Number of parameters (k_adjusted): 5.0
Log-likelihood: -82.0537744890265
AIC (adjusted): 174.107548978053
BIC (adjusted): 181.11353588636376

There are different types of anova computations. The statsmodels.stats.anova.anova_lm function allows the types 1, 2 and 3. The code to compute these types is depicted below:

import statsmodels.formula.api as smf
import statsmodels.stats.anova as ssa

# Center the predictor for Type III anova
#df['pre_centered'] = df['pre'] - df['pre'].mean()

# Fit the model for types I and II anova
model = smf.ols('post ~ C(drug) + pre', data=df).fit()

# Perform anova for types I and II
ancova_table_type_1 = ssa.anova_lm(model, typ=1)
ancova_table_type_2 = ssa.anova_lm(model, typ=2)

# Fit the model for Type III anova with centered predictors
model_type_3 = smf.ols('post ~ C(drug) + pre', data=df).fit()
ancova_table_type_3 = ssa.anova_lm(model_type_3, typ=3)

# Calculate SSd (sum of squares for residuals)
ssd_type1 = ancova_table_type_1['sum_sq'].loc['Residual']
ssd_type2 = ancova_table_type_2['sum_sq'].loc['Residual']
ssd_type3 = ancova_table_type_3['sum_sq'].loc['Residual']

# Calculate ges
ancova_table_type_1['ges'] = ancova_table_type_1['sum_sq'] / (ancova_table_type_1['sum_sq'] + ssd_type1)
ancova_table_type_2['ges'] = ancova_table_type_2['sum_sq'] / (ancova_table_type_2['sum_sq'] + ssd_type2)
ancova_table_type_3['ges'] = ancova_table_type_3['sum_sq'] / (ancova_table_type_3['sum_sq'] + ssd_type3)

# Add SSd column
ancova_table_type_1['SSd'] = ssd_type1
ancova_table_type_2['SSd'] = ssd_type2
ancova_table_type_3['SSd'] = ssd_type3

# Add significance column
ancova_table_type_1['p<0.05'] = ancova_table_type_1['PR(>F)'] < 0.05
ancova_table_type_2['p<0.05'] = ancova_table_type_2['PR(>F)'] < 0.05
ancova_table_type_3['p<0.05'] = ancova_table_type_3['PR(>F)'] < 0.05

# Rename columns to match the R output
ancova_table_type_1.rename(columns={'sum_sq': 'SSn', 'df': 'DFn', 'F': 'F', 'PR(>F)': 'p'}, inplace=True)
ancova_table_type_1.reset_index(inplace=True)
ancova_table_type_1.rename(columns={'index': 'Effect'}, inplace=True)

ancova_table_type_2.rename(columns={'sum_sq': 'SSn', 'df': 'DFn', 'F': 'F', 'PR(>F)': 'p'}, inplace=True)
ancova_table_type_2.reset_index(inplace=True)
ancova_table_type_2.rename(columns={'index': 'Effect'}, inplace=True)

ancova_table_type_3.rename(columns={'sum_sq': 'SSn', 'df': 'DFn', 'F': 'F', 'PR(>F)': 'p'}, inplace=True)
ancova_table_type_3.reset_index(inplace=True)
ancova_table_type_3.rename(columns={'index': 'Effect'}, inplace=True)

# Calculate DFd (degrees of freedom for residuals)
dfd_type1 = ancova_table_type_1.loc[ancova_table_type_1['Effect'] == 'Residual', 'DFn'].values[0]
dfd_type2 = ancova_table_type_2.loc[ancova_table_type_2['Effect'] == 'Residual', 'DFn'].values[0]
dfd_type3 = ancova_table_type_3.loc[ancova_table_type_3['Effect'] == 'Residual', 'DFn'].values[0]
ancova_table_type_1['DFd'] = dfd_type1
ancova_table_type_2['DFd'] = dfd_type2
ancova_table_type_3['DFd'] = dfd_type3

# Filter out the Residual row
ancova_table_type_1 = ancova_table_type_1[ancova_table_type_1['Effect'] != 'Residual']
ancova_table_type_2 = ancova_table_type_2[ancova_table_type_2['Effect'] != 'Residual']
ancova_table_type_3 = ancova_table_type_3[ancova_table_type_3['Effect'] != 'Residual']

# Select and reorder columns to match the R output
ancova_table_type_1 = ancova_table_type_1[['Effect', 'DFn', 'DFd', 'SSn', 'SSd', 'F', 'p', 'p<0.05', 'ges']]
ancova_table_type_2 = ancova_table_type_2[['Effect', 'DFn', 'DFd', 'SSn', 'SSd', 'F', 'p', 'p<0.05', 'ges']]
ancova_table_type_3 = ancova_table_type_3[['Effect', 'DFn', 'DFd', 'SSn', 'SSd', 'F', 'p', 'p<0.05', 'ges']]

Type 1 Ancova in Python

print(tabulate(ancova_table_type_1, headers='keys', tablefmt='grid'))
+----+----------+-------+-------+---------+---------+----------+-------------+----------+----------+
|    | Effect   |   DFn |   DFd |     SSn |     SSd |        F |           p | p<0.05   |      ges |
+====+==========+=======+=======+=========+=========+==========+=============+==========+==========+
|  0 | C(drug)  |     2 |    26 | 293.6   | 417.203 |  9.14855 | 0.000981237 | True     | 0.413054 |
+----+----------+-------+-------+---------+---------+----------+-------------+----------+----------+
|  1 | pre      |     1 |    26 | 577.897 | 417.203 | 36.0145  | 2.45433e-06 | True     | 0.580743 |
+----+----------+-------+-------+---------+---------+----------+-------------+----------+----------+

Type 2 Ancova in Python

print(tabulate(ancova_table_type_2, headers='keys', tablefmt='grid'))
+----+----------+-------+-------+----------+---------+----------+-------------+----------+----------+
|    | Effect   |   DFn |   DFd |      SSn |     SSd |        F |           p | p<0.05   |      ges |
+====+==========+=======+=======+==========+=========+==========+=============+==========+==========+
|  0 | C(drug)  |     2 |    26 |  68.5537 | 417.203 |  2.13613 | 0.138379    | False    | 0.141128 |
+----+----------+-------+-------+----------+---------+----------+-------------+----------+----------+
|  1 | pre      |     1 |    26 | 577.897  | 417.203 | 36.0145  | 2.45433e-06 | True     | 0.580743 |
+----+----------+-------+-------+----------+---------+----------+-------------+----------+----------+

Type 3 Ancova in Python

print(tabulate(ancova_table_type_3, headers='keys', tablefmt='grid'))
+----+-----------+-------+-------+----------+---------+----------+-------------+----------+----------+
|    | Effect    |   DFn |   DFd |      SSn |     SSd |        F |           p | p<0.05   |      ges |
+====+===========+=======+=======+==========+=========+==========+=============+==========+==========+
|  0 | Intercept |     1 |    26 |  61.2592 | 417.203 |  3.81767 | 0.0615519   | False    | 0.128034 |
+----+-----------+-------+-------+----------+---------+----------+-------------+----------+----------+
|  1 | C(drug)   |     2 |    26 |  68.5537 | 417.203 |  2.13613 | 0.138379    | False    | 0.141128 |
+----+-----------+-------+-------+----------+---------+----------+-------------+----------+----------+
|  2 | pre       |     1 |    26 | 577.897  | 417.203 | 36.0145  | 2.45433e-06 | True     | 0.580743 |
+----+-----------+-------+-------+----------+---------+----------+-------------+----------+----------+

Please note that the results are consistent with the results achieved with R, except for the first row of the type 3 table featuring the intercept.