The chi-square test is a non-parametric statistical test used to determine whether there is a significant association within the categorical variables. It compares the observed frequencies in a contingency table with the frequency we would expect if the variables were independent. The chi-square test calculates a test statistic, often denoted as χ² (chi-square), which follows chi-square distribution, we can determine whether the association between the variables are statistically significant.
The chi-squared test and Fisher’s exact test can assess for independence between two variables when the comparing groups are independent and not correlated. The chi-squared test applies an approximation assuming the sample is large, while the Fisher’s exact test runs an exact procedure especially for small-sized samples.
Data used
To perform the analysis the data used is: Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7, 1994.
Implementing Chi-Square test in Python
We can use crosstab() function to create contingency table of two selected variables.
import pandas as pd import numpy as npimport scipy.stats as stats # Read the sample datadata = pd.read_csv("../data/lung_cancer.csv") # Removing undesired rowsdf= data.dropna(subset=['ph.ecog','wt.loss']) # Converting numerical variable into categorical variabledf['ecog_grp']= np.where(df['ph.ecog']>0, "fully active","symptomatic")print(df['ecog_grp'])df['wt_grp'] = np.where(df['wt.loss']>0, "weight loss", "weight gain")contingency_table= pd.crosstab(df['ecog_grp'],df['wt_grp'])contingency_table
1 symptomatic
2 symptomatic
3 fully active
4 symptomatic
5 fully active
...
223 fully active
224 symptomatic
225 fully active
226 fully active
227 fully active
Name: ecog_grp, Length: 213, dtype: object
/tmp/ipykernel_7549/2909872460.py:13: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df['ecog_grp']= np.where(df['ph.ecog']>0, "fully active","symptomatic")
/tmp/ipykernel_7549/2909872460.py:15: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df['wt_grp'] = np.where(df['wt.loss']>0, "weight loss", "weight gain")
wt_grp
weight gain
weight loss
ecog_grp
fully active
39
113
symptomatic
22
39
Furthermore, the chi2_contingency() function in scipy.stats library in Python can be used to implement Chi-square test.
# Parsing the values from the contingency tablevalue = np.array([contingency_table.iloc[0][0:5].values, contingency_table.iloc[1][0:5].values])statistic, p, dof, expected = stats.chi2_contingency(value)print("The chi2 value is:", statistic)print("The p value is:", p)print("The degree of freedom is:", dof)print("The expected values are:", expected)
The chi2 value is: 1.8260529076055192
The p value is: 0.17659446865934614
The degree of freedom is: 1
The expected values are: [[ 43.53051643 108.46948357]
[ 17.46948357 43.53051643]]
Implementing Fisher exact test in Python
To implement Fischer’s exact test in Python, we can use the fischer_exact() function from the stats module in SciPy library. It returns SignificanceResult object with statistic and pvalue as it’s attributes.