27 mins read
### Correlation between two categorical features

# Details of the chi-square test

## Using the chi-square statistics to determine if two categorical variables are correlated

## Steps to Perform Chi-Square Test

**Using the chi-square test on the Titanic dataset**

### Loading the Dataset

### Data Cleansing and Feature Engineering

### Visualizing the correlations between features and target

### Visualizing the correlations between each feature

## Defining the Hypotheses

## Calculating χ2 manually

## Calculating the p-value

## Trying out the other features

## Summary Chi-square

# Details of ANOVA

## What is ANOVA?

## Performing AVONA by hand

### Sample Dataset

## Visualizing the dataset

## Pivoting the dataframe

## Defining the Hypotheses

### Step 1 — Calculating the means for all groups

### Step 2 — Calculate the Sum of Squares

## Sum of squares of all observations — **SS_total**

## Sum of squares within — SS_within

### Sum of Squares between — SS_between

## Creating the ANOVA Table

## Using the Stats module to calculate f-score

## Using the statsmodels module to calculate f-score

## Summary of ANOVA

This scenario can happen when we are doing regression or classification in machine learning.

**Regression**: The target variable is numeric and one of the predictors is categorical**Classification**: The target variable is categorical and one of the predictors in numeric

In both these cases, the strength of the correlation between the variables can be measured using the ANOVA test. ANOVA stands for Analysis Of Variance. Actually, this test measures if there are any significant differences between the means of the values of the numeric variable for each categorical value. This is something that you can visualize using a box plot as well.

Null hypothesis(H0) ANOVA hypothesis test: The variables are not correlated with each other

In the below example, we are trying to measure if there is any correlation between FuelType on CarPrices. Here FuelType is a categorical predictor and CarPrices is the numeric target variable.

```
# Generating sample data
import pandas as pd
ColumnNames=['FuelType','CarPrice']
DataValues= [[ 'Petrol', 2000],
[ 'Petrol', 2100],
[ 'Petrol', 1900],
[ 'Petrol', 2150],
[ 'Petrol', 2100],
[ 'Petrol', 2200],
[ 'Petrol', 1950],
[ 'Diesel', 2500],
[ 'Diesel', 2700],
[ 'Diesel', 2900],
[ 'Diesel', 2850],
[ 'Diesel', 2600],
[ 'Diesel', 2500],
[ 'Diesel', 2700],
[ 'CNG', 1500],
[ 'CNG', 1400],
[ 'CNG', 1600],
[ 'CNG', 1650],
[ 'CNG', 1600],
[ 'CNG', 1500],
[ 'CNG', 1500]
]
#Create the Data Frame
CarData=pd.DataFrame(data=DataValues,columns=ColumnNames)
print(CarData.head())
########################################################
# f_oneway() function takes the group data as input and
# returns F-statistic and P-value
from scipy.stats import f_oneway
# Running the one-way anova test between CarPrice and FuelTypes
# Assumption(H0) is that FuelType and CarPrices are NOT correlated
# Finds out the Prices data for each FuelType as a list
CategoryGroupLists=CarData.groupby('FuelType')['CarPrice'].apply(list)
# Performing the ANOVA test
# We accept the Assumption(H0) only when P-Value > 0.05
AnovaResults = f_oneway(*CategoryGroupLists)
print('P-Value for Anova is: ', AnovaResults[1])
```

**Sample Output**

As the output of the P-value is almost zero, hence, we reject H0. This means the variables are correlated with each other.

This is a situation that arises often during classification machine learning. The target variable is categorical and the predictors can be either continuous or categorical, so when both of them are categorical, then the strength of the relationship between them can be measured using a **Chi-square test**.

The Chi-square test finds the probability of a Null hypothesis(H0).

- Assumption(H0): The two columns are NOT related to each other
- Result of Chi-Sq Test: The Probability of H0 being True
- More information on ChiSq can be found here

It can help to understand whether both the categorical variables are correlated with each other or not. In the below scenario, we try to measure the correlation between GENDER and LOAN_APPROVAL.

```
# Creating a sample data frame
import pandas as pd
ColumnNames=['CIBIL','AGE','GENDER' ,'SALARY', 'APPROVE_LOAN']
DataValues=[ [480, 28, 'M', 610000, 'Yes'],
[480, 42, 'M',140000, 'No'],
[480, 29, 'F',420000, 'No'],
[490, 30, 'M',420000, 'No'],
[500, 27, 'M',420000, 'No'],
[510, 34, 'F',190000, 'No'],
[550, 24, 'M',330000, 'Yes'],
[560, 34, 'M',160000, 'Yes'],
[560, 25, 'F',300000, 'Yes'],
[570, 34, 'M',450000, 'Yes'],
[590, 30, 'F',140000, 'Yes'],
[600, 33, 'M',600000, 'Yes'],
[600, 22, 'M',400000, 'Yes'],
[600, 25, 'F',490000, 'Yes'],
[610, 32, 'M',120000, 'Yes'],
[630, 29, 'F',360000, 'Yes'],
[630, 30, 'M',480000, 'Yes'],
[660, 29, 'F',460000, 'Yes'],
[700, 32, 'M',470000, 'Yes'],
[740, 28, 'M',400000, 'Yes']]
#Create the Data Frame
LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)
print(LoanData.head())
#########################################################
# Cross tabulation between GENDER and APPROVE_LOAN
CrosstabResult=pd.crosstab(index=LoanData['GENDER'],columns=LoanData['APPROVE_LOAN'])
print(CrosstabResult)
# importing the required function
from scipy.stats import chi2_contingency
# Performing Chi-sq test
ChiSqResult = chi2_contingency(CrosstabResult)
# P-Value is the Probability of H0 being True
# If P-Value>0.05 then only we Accept the assumption(H0)
print('The P-Value of the ChiSq Test is:', ChiSqResult[1])
```

**Sample Output:**

**H0**: The variables are not correlated with each other. This is the H0 used in the Chi-square test.

In the above example, the p-value came higher than 0.05. Hence H0 will be accepted. This means the variables are not correlated with each other. This means that if two variables are correlated, then the p-value will come very close to zero.

In this section, I will explain how we can test two categorical columns in a dataset to determine if they are dependent on each other (i.e. correlated). We will use a statistics test known as **chi-square **(commonly written as ** χ2**). Before we start our discussion on chi-square, here is a quick summary of the test methods that can be used for testing the various types of variables:

The **chi-square ( χ2) statistics** is a way to check the relationship between

The key idea behind the chi-square test is to compare the observed values in the data to the expected values and see if they are related or not. In particular, it is a useful way to check if two categorical nominal variables are correlated. This is particularly important in machine learning where we only want features that are correlated to the target to be used for training.

There are two types of chi-square tests:

**Chi-Square Goodness of Fit Test**— test if one variable is likely to come from a given distribution.**Chi-Square Test of Independence**— test if two variables might be correlated or not.

Check out https://www.jmp.com/en_us/statistics-knowledge-portal/chi-square-test.html for a more detailed discussion of the above two chi-square tests. When comparing to see if two categorical variables are correlated, we will use the **Chi-Square Test of Independence.**

To use the chi-square test, we need to perform the following steps:

- Define the
*null hypothesis*and*alternate hypothesis*. They are:

**H₀**(*Null Hypothesis*) — that the 2 categorical variables being compared are**independent**of each other.**H₁**(*Alternate Hypothesis*) — that the 2 categorical variables being compared are**dependent**on each other.

2. Decide on the **α** value. This is the risk that we are willing to take in drawing the wrong conclusion. As an example, say we set **α**=0.05 when testing for independence. This means we are undertaking a 5% risk of concluding that two variables are independent when in reality they are not.

3. Calculate the **chi-square** **score** using the two categorical variables and use it to calculate the **p-value**. A ** low** p-value means there is a

In a chi-square analysis, the p-value is the probability of obtaining a chi-square as large or larger than that in the current experiment and yet the data will still support the hypothesis. It is the probability of deviations from what was expected being due to mere chance. In general a p-value of 0.05 or greater is considered critical, anything less means the deviations are significant and the hypothesis being tested must be rejected.

Source: https://passel2.unl.edu/view/lesson/9beaa382bf7e/8

To calculate the p-value, we need two pieces of information:

**Degrees of freedom —**the number of categories minus 1**Chi-square score**.

If the p-value** **obtained is:

- < 0.05 (the
**α**value we have chosen) we reject the**H₀**(*Null Hypothesis*) and accept the**H₁**(*Alternate Hypothesis*). This means the two categorical variables are*dependent*. - > 0.05 we accept the
**H₀**(*Null Hypothesis*) and reject the**H₁**(*Alternate Hypothesis*). This means the two categorical variables are*independent*.

In the case of feature selection for machine learning, we would want the feature that is being compared to the target to have a **low** p-value (less than 0.05), as this means that the feature is dependent on (correlated to) the target.

With the chi-square score that is calculated, we can also use it to refer to a **chi-square table** to see if the score falls within the rejection region or the acceptance region. In the next we, I will use the Titanic dataset and apply the chi-square test on a few of the features and see how if they are correlated to the target.

A good way to understand a new topic is to go through the concepts using an example. For this, I am going to use the classic Titanic dataset (https://www.kaggle.com/tedllh/titanic-train).

The Titanic dataset is often used in machine learning to demonstrate how to build a machine-learning model and use it to make predictions. In particular, the dataset contains several features (**Pclass**, **Sex**, **Age**, **Embarked**, etc) and one target (**Survived**). Several features in the dataset are categorical variables:

**Pclass**-the class of cabin that the passenger was in**Sex**-the sex of the passenger**Embarked**-the port of embarkation**Survived**-if the passenger survived the disaster

Because this section explores the relationships between categorical features and targets, we are only interested in those columns that contain categorical values.

Let’s load the dataset in a Pandas DataFrame:

```
import pandas as pd
import numpy as np
df = pd.read_csv('titanic_train.csv')
df.sample(5)
```

There are some columns that are not really useful and hence we will proceed to drop them. Also, there are some missing values so let’s drop all those rows with empty values:

```
df.drop(columns=['PassengerId','Name', 'Ticket','Fare','Cabin'],
inplace=True)
df.dropna(inplace=True)
df
```

We will also add one more column named **Alone**, based on the **Parch** (Parent or children) and **Sibsp** (Siblings or spouse) columns. The idea we want to explore is if being alone affects the survivability of the passenger. So **Alone** is 1 if both **Parch** and **Sibsp** are 0, else it is 0:

```
df['Alone'] = (df['Parch'] + df['SibSp']).apply(
lambda x: 1 if x == 0 else 0)
df
```

Now that the data is cleaned, let’s try to visualize how the sex of passengers is related to their survival in the accident:

```
import seaborn as sns
sns.barplot(x='Sex', y='Survived', data=df, ci=None)
```

The Sex column contains nominal data(i.e. ranking is not important).

From the above figure, we can see that of all the female passengers, more than 70% survived; of all the men, about 20% survived. Seems like there exists a very strong relationship between **Sex** and **Survived** features. To confirm this, we will use the chi-square test to confirm this later on.

How about **Pclass** and **Survived**? Are they related?

```
sns.barplot(x='Pclass', y='Survived', data=df, ci=None)
```

Perhaps unsurprisingly, it shows that the higher the **Pclass** that the passenger was in, the higher the survival rate of the passenger. The next feature of interest is if the place of embarkation determines who survives and who doesn’t:

```
sns.barplot(x='Embarked', y='Survived', data=df, ci=None)
```

From the chart, it seems like more people who embarked from **C** (Cherbourg) survived.

C = Cherbourg; Q = Queenstown; S = Southampton

We also want to know if being alone on the trip makes one more survivable:

```
ax = sns.barplot(x='Alone', y='Survived', data=df, ci=None)
ax.set_xticklabels(['Not Alone','Alone'])
```

We can see that if one is with their family, he/she will have a higher chance of survival.

Now that we have visualized the relationships between the categorical features against the target (**Survived**), we want to now visualize the relationships between each feature. Before we can do that, we need to convert the label values in the **Sex** and **Embarked** columns to numeric. To do that, we can make use of the **LabelEncoder** class in **sklearn**:

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsfrom sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df['Sex'])
df['Sex'] = le.transform(df['Sex'])
sex_labels = dict(zip(le.classes_, le.transform(le.classes_)))
print(sex_labels)
le.fit(df['Embarked'])
df['Embarked'] = le.transform(df['Embarked'])
embarked_labels = dict(zip(le.classes_,
le.transform(le.classes_)))
print(embarked_labels)
```

The above code snippet label-encodes the **Sex** and **Embarked** columns. The output shows the mappings of the values for each column, which is very useful later when performing predictions:

```
{'female': 0, 'male': 1}
{'C': 0, 'Q': 1, 'S': 2}
```

The following statements show the relationship between **Embarked** and **Sex**:

```
ax = sns.barplot(x='Embarked', y='Sex', data=df, ci=None)
ax.set_xticklabels(embarked_labels.keys())
```

Seems like more males boarded from Southampton (**S**) than in Queenstown (**Q**) and Cherbourg (**C**).

How about **Embarked** and **Alone**?

```
ax = sns.barplot(x='Embarked', y='Alone', data=df, ci=None)
ax.set_xticklabels(embarked_labels.keys())
```

Seems like a large proportion of those who embarked from Queenstown are alone.

And finally, let’s see the relationship between **Sex** and **Alone**:

```
ax = sns.barplot(x='Sex', y='Alone', data=df, ci=None)
ax.set_xticklabels(sex_labels.keys())
```

As we can see, there are more males than females who are alone on the trip.

We now define the *null hypothesis* and *alternate hypothesis*. As explained earlier, they are:

**H₀**(*Null Hypothesis*) — that the 2 categorical variables to be compared are**independent**of each other.**H₁**(*Alternate Hypothesis*) — that the 2 categorical variables being compared are**dependent**on each other.

And we draw conclusions based on the following p-value conditions:

- p < 0.05 — this means the two categorical variables are
.*correlated* - p > 0.05 — this means the two categorical variables are
*not*.*correlated*

Let’s manually go through the steps in calculating the χ2 values. The first step is to create a *contingency table*. Using the **Sex** and **Survived** columns as an example, we first create a contingency table:

The contingency table above displays the frequency distribution of the two categorical columns — **Sex** and **Survived**.

The **Degrees of Freedom** is next calculated as ** (number of rows -1) * (number of columns -1)**. In this example, the degree of freedom is (2–1)*(2–1) =

Once the contingency table is created, sum up all the rows and columns, like this:

The above is the *Observed values*.

Next, we are going to calculate the *Expected values*. Here is how they are calculated:

- Replace each value in the observed value with the product of the sum of its column and the sum of its row, divided by the total sum.

The following figure shows how the first value is calculated:

The next figure shows how the second value is calculated:

Here is the result for the *Expected* values:

Then, calculate the **chi-square value** for each cell using the formula for ** χ2**:

Applying this formula to the *Observed* and *Expected* values, we get the chi-square values:

The **chi-square score** is the grand total of the chi-square values:

We can use the following websites to verify if the numbers are correct:

**Chi-Square Calculator**— https://www.mathsisfun.com/data/chi-square-calculator.html

The Python implementation for the above steps is contained within the following **chi2_by_hand() **function:

```
def chi2_by_hand(df, col1, col2):
#---create the contingency table---
df_cont = pd.crosstab(index = df[col1], columns = df[col2])
display(df_cont) #---calculate degree of freedom---
degree_f = (df_cont.shape[0]-1) * (df_cont.shape[1]-1) #---sum up the totals for row and columns---
df_cont.loc[:,'Total']= df_cont.sum(axis=1)
df_cont.loc['Total']= df_cont.sum()
print('---Observed (O)---')
display(df_cont) #---create the expected value dataframe---
df_exp = df_cont.copy()
df_exp.iloc[:,:] = np.multiply.outer(
df_cont.sum(1).values,df_cont.sum().values) /
df_cont.sum().sum()
print('---Expected (E)---')
display(df_exp)
# calculate chi-square values
df_chi2 = ((df_cont - df_exp)**2) / df_exp
df_chi2.loc[:,'Total']= df_chi2.sum(axis=1)
df_chi2.loc['Total']= df_chi2.sum()
print('---Chi-Square---')
display(df_chi2) #---get chi-square score---
chi_square_score = df_chi2.iloc[:-1,:-1].sum().sum()
return chi_square_score, degree_f
```

The **chi2_by_hand()** function takes in three arguments — the dataframe containing all columns, followed by two strings containing the names of the two columns we are comparing against. It returns a tuple — the chi-square score, plus the degrees of freedom.

Let’s now test the above function using the Titanic dataset. First, let’s compare the **Sex** and the **Survived** columns:

```
chi_score, degree_f = chi2_by_hand(df,'Sex','Survived')
print(f'Chi2_score: {chi_score}, Degrees of freedom: {degree_f}')
```

```
Chi2_score: 205.1364846934008, Degrees of freedom: 1
```

Using the chi-square score, we can now decide if we will accept or reject the null hypothesis using the **chi-square distribution curve**:

The x-axis represents the *χ*2 score. The area that is to the right of the **critical chi-square region** is known as the **rejection region**. The area to the left of it is known as the **acceptance region**. If the chi-square score that we have obtained falls in the acceptance region, the null hypothesis is accepted; else the alternate hypothesis is accepted.

So how do we obtain the **critical chi-square region**? For this, we have to check the **chi-square table**:

We can check out the **Chi-Square Table** at** **https://www.mathsisfun.com/data/chi-square-table.html

This is how we use the chi-square table. With the **α** set to be 0.05, and **1** degree of freedom, the critical chi-square region is **3.84** (refer to the chart above). Putting this value into the chi-square distribution curve, we can conclude that:

- As the calculated chi-square value (
**205**) is greater than**3.84**, it, therefore, falls in the*rejection region*, and hence the null hypothesis is rejected and the**alternate hypothesis is accepted**. - Recalling our alternate hypothesis as
**H₁**(*Alternate Hypothesis*) — that the 2 categorical variables being compared are**dependent**on each other.

This means that the **Sex** and **Survived** columns are dependent on each other. We can use the chi2_by_hand() function on the other features.

The previous section shows how we can accept or reject the null hypothesis by examining the chi-square score and comparing it with the chi-square distribution curve. An alternative way to accept or reject the null hypothesis is by using the *p-value*. Remember, the p-value can be calculated using the chi-square score and the degrees of freedom. For simplicity, we shall not go into the details of how to calculate the p-value by hand.

In Python, we can calculate the p-value using the **stats** module’s **sf()** function:

```
def chi2_by_hand(df, col1, col2):
#---create the contingency table---
df_cont = pd.crosstab(index = df[col1], columns = df[col2])
display(df_cont) ... chi_square_score = df_chi2.iloc[:-1,:-1].sum().sum()
#---calculate the p-value---
from scipy import stats
p = stats.distributions.chi2.sf(chi_square_score, degree_f)
return chi_square_score, degree_f, p
```

We can now call the **chi2_by_hand()** function and get both the chi_square score, degrees of freedom, and p-value:

```
chi_score, degree_f, p = chi2_by_hand(df,'Sex','Survived')
print(f'Chi2_score: {chi_score}, Degrees of freedom: {degree_f}, p-value: {p}')
```

The above code results in the following p-value:

```
Chi2_score: 205.1364846934008, Degrees of freedom: 1, p-value: 1.581266384342472e-46
```

As a quick recap, we accept or reject the hypotheses and form the conclusion based on the following p-value conditions:

- p < 0.05 — this means the two categorical variables are
.*correlated* - p > 0.05 — this means the two categorical variables are
*not*.*correlated*

And since **p < 0.05** — this means the two categorical variables are ** correlated**.

Let’s try out the categorical columns that contain nominal values:

```
chi_score, degree_f, p = chi2_by_hand(df,'Embarked','Survived')
print(f'Chi2_score: {chi_score}, Degrees of freedom: {degree_f}, p-value: {p}')
# Chi2_score: 27.918691003688615, Degrees of freedom: 2,
# p-value: 8.660306799267924e-07
chi_score, degree_f, p = chi2_by_hand(df,'Alone','Survived')
print(f'Chi2_score: {chi_score}, Degrees of freedom: {degree_f}, p-value: {p}')
# Chi2_score: 28.406341862069905, Degrees of freedom: 1,
# p-value: 9.834262807301776e-08
```

Since the p-values for both **Embarked** and **Alone** are < 0.05, we can conclude that both the **Embarked** and **Alone** features are correlated to the **Survived** target, and should be included for training in our model.

A few notes of caution would be useful here:

- While the
**Pearson’s coefficient**and**Spearman’s rank coefficient**measure the*strength*of an association between two variables, the**chi-square**test measures the*significance*of the association between two variables. What it tells us is whether the relationship we found in the sample is likely to exist in the population, or how likely it is by chance due to sampling error. - The
**chi-square**test is sensitive to small frequencies in the contingency table. Generally,*if a cell in the contingency table has a frequency of 5 or less, the chi-square test will lead to errors in the conclusion*. Also, the chi-square test should not be used if the sample size is less than 50.

The chi-square test is used when both the independent and dependent variables are all *categorical* variables. However, what if the independent variable is *categorical* and the dependent variable is *numerical*? In this case, we have to use another statistic test known as ANOVA — **An**alysis **o**f **Va**riance.

And so in this section, our discussion will revolve around ANOVA and how we use it in machine learning for feature selection. Before we get started, it is useful to summarize the different methods that we have discussed so far:

ANOVA is used for testing two variables, where:

- one is a
*categorical*variable - another is a
*numerical*variable

ANOVA is used when the categorical variable has *at least 3 groups* (i.e three different unique values). If we want to compare just two groups, we use the t-test. ANOVA lets us know if a numerical variable changes according to the level of the categorical variable. **ANOVA uses the f-tests to statistically test the equality of means**. F-tests are named after their test statistic, F, which was named in honor of **Sir Ronald Fisher**.

Here are some examples that make it easier to understand when we can use ANOVA.

We have a dataset containing information about a group of people pertaining to their social media usage and the number of hours they sleep:

We want to find out if the amount of social media usage (categorical variable) has a direct impact on the number of hours of sleep (numerical variable).

We have a dataset containing three different brands of medication and the number of days for the medication to take effect:

We want to find out if there is a direct relationship between a specific brand and its effectiveness.

ANOVA checks whether there is equal variance between groups of categorical features with respect to the numerical response. If there is equal variance between groups, it means this feature has no impact on the response and hence it (the categorical variable) cannot be considered for model training.

The best way to understand ANOVA is to use an example. In the following example, I use a fictitious dataset where I recorded the reaction time of a group of people when they are given a specific type of drink.

I have a sample dataset named **drinks.csv** containing the following content:

```
team,drink_type,reaction_time
1,water,14
2,water,25
3,water,23
4,water,27
5,water,28
6,water,21
7,water,26
8,water,30
9,water,31
10,water,34
1,coke,25
2,coke,26
3,coke,27
4,coke,29
5,coke,25
6,coke,23
7,coke,22
8,coke,27
9,coke,29
10,coke,21
1,coffee,8
2,coffee,20
3,coffee,26
4,coffee,36
5,coffee,39
6,coffee,23
7,coffee,25
8,coffee,28
9,coffee,27
10,coffee,25
```

There are 10 teams in all — each team comprises 3 persons. Each person in the team is given three different types of drinks — water, coke, and coffee. After consuming the drink, they were asked to perform some activities and their reaction time was recorded. The aim of this experiment is to determine if the drinks have any effect on a person’s reaction time.

Let’s first load the dataset into a Pandas DataFrame:

```
import pandas as pd
df = pd.read_csv('drinks.csv')
```

Record the *observation size*, which we will make use of later:

```
observation_size = df.shape[0] # number of observations
```

It is useful to visualize the distribution of the data using a Boxplot:

```
_ = df.boxplot('reaction_time', by='drink_type')
```

We can see that the three types of drinks have about the same median reaction time.

To facilitate the calculation for ANOVA, we need to pivot the dataframe:

```
df = df.pivot(columns='drink_type', index='team')
display(df)
```

The columns represent the three different types of drinks and the rows represent the 10 teams. We will also use this chance to record the *number of items in each group*, as well as the *number of groups*, which we will make use of later:

```
n = df.shape[0] # 10; number of items in each group
k = df.shape[1] # 3; number of groups
```

We now define the *null hypothesis* and *alternate hypothesis*, just like the chi-square test. They are:

**H₀**(Null hypothesis) — that there is no difference among group means.**H₁**(Alternate hypothesis) — that at least one group differs significantly from the overall mean of the dependent variable.

We are now ready to begin our calculations for ANOVA. First, let’s find the mean for each group:

```
df.loc['Group Means'] = df.mean()
df
```

From here, we can now calculate the **overall mean**:

```
overall_mean = df.iloc[-1].mean()
overall_mean # 25.666666666666668
```

Now that we have calculated the *overall mean*, we can proceed to calculate the following:

- Sum of squares of all observations —
**SS_total** - Sum of squares within —
**SS_within** - Sum of squares between —
**SS_between**

The *sum of squares of all observations* is calculated by deducting each observation from the *overall mean*, and then summing all the squares of the differences:

Programmatically, **SS_total** is computed as:

```
SS_total = (((df.iloc[:-1] - overall_mean)**2).sum()).sum()
SS_total # 1002.6666666666667
```

The *sum of squares within* is the sum of squared deviations of scores around their group’s mean:

Programmatically, **SS_within** is computed as:

```
SS_within = (((df.iloc[:-1] - df.iloc[-1])**2).sum()).sum()
SS_within # 1001.4
```

Next, we calculate the sum of squares of the group means from the overall mean:

Programmatically, **SS_between** is computed as:

```
SS_between = (n * (df.iloc[-1] - overall_mean)**2).sum()
SS_between # 1.266666666666667
```

We can verify that:

**SS_total** = **SS_between** + **SS_within**

With all the values computed, we can now complete the ANOVA table. Recall we have the following variables:

We can compute the various *degrees of freedoms* as follows:

```
df_total = observation_size - 1 # 29
df_within = observation_size - k # 27
df_between = k - 1 # 2
```

From the above, compute the various *mean squared* values:

```
mean_sq_between = SS_between / (k - 1) # 0.6333333333333335
mean_sq_within = \
SS_within / (observation_size - k) # 37.08888888888889
```

Finally, we can calculate the **F-value**, which is the ratio of two variances:

```
F = mean_sq_between / mean_sq_within # 0.017076093469143204
```

Recall earlier that I mentioned ANOVA uses the f-tests to statistically test the equality of means.

Once the F-value is obtained, we now have to refer to the *f-distribution table *(see http://www.socr.ucla.edu/Applets.dir/F_Table.html for one example) to obtain the **f-critical value**. The f-distribution table is organized based on the **α** value (usually 0.05). So we need to first locate the table based on **α=**0.05:

Next, observe that the columns of the f-distribution table are based on **df1** while the rows are based on **df2**. We can get **df1** and **df2** from the previous variables that we have created:

```
df1 = df_between # 2
df2 = df_within # 27
```

Using the values of **df1** and **df2**, we can now locate the **f-critical value** by locating the **df1** column and **df2** row:

From the above figure, we can see that the **f-critical value** is **3.3541**. Using this value, we can now decide if we will accept or reject the null hypothesis using the **F-distribution curve**:

Since the **f-value** (0.0171, which is what we can calculate) is less than the f-critical value in the f-distribution table, we accept the null hypothesis — **this means there is no variance in different groups — all the means are the same**. For machine learning, this feature — *drink_type*, should ** not** be included for training as it seems the different types of drinks have no effect on the reaction time. We should only include a feature for training only if we reject the null hypothesis as this means that the values in the drink types affect on the reaction time.

In the previous section, we manually calculated the f-value for our dataset. Actually, there is an easier way — use the **stats** module’s **f_oneway()** function to calculate the f-value and p-value:

```
import scipy.stats as stats
fvalue, pvalue = stats.f_oneway(
df.iloc[:-1,0],
df.iloc[:-1,1],
df.iloc[:-1,2])
print(fvalue, pvalue) # 0.0170760934691432 0.9830794846682348
```

The **f_oneway()** function takes the groups as input and returns the ANOVA F and p-value:

In the above, the **f-value** is **0.0170760934691432** (identical to the one we calculated manually) and the **p-value** is **0.9830794846682348**.

Observe that the **f_oneway()** function takes in a variable number of arguments:

If we have many groups, it would be quite tedious to pass in the values of all the groups one by one. So, there is an easier way:

```
fvalue, pvalue = stats.f_oneway(
*df.iloc[:-1,0:3].T.values
)
```

Another way to calculate the f-value is to use the **statsmodel** module. We first build the model using the **ols()** function, and then call the **fit()** function on the instance of the model. Finally, we call the **anova_lm()** function on the fitted model and specify the type of ANOVA test to perform on it. There are 3 types of ANOVA tests to perform, but their discussion is beyond the scope of this article.

```
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.read_csv('drinks.csv')
model = ols('reaction_time ~ drink_type', data=df).fit()
sm.stats.anova_lm(model, typ=2)
```

The above code snippet produces the following result, which is the same as the f-value that we calculated earlier (**0.017076**):

The **anova_lm()** function also returns the p-value (**0.983079**). We can make use of the following rules to determine if the categorical variable has any influence on the numerical variable:

- if p < 0.05, this means that the categorical variable has a significant influence on the numerical variable
- if p > 0.05, this means that the categorical variable has no significant influence on the numerical variable

Since the p-value is now 0.983079 (>0.05), this means that the **drink_type** has no significant influence on the **reaction_time**.

ANOVA helps to determine if a categorical variable has an influence on a numerical variable. So far the ANOVA test that we have discussed is known as the **one-way ANOVA** test. There are a few variations of ANOVA:

**One-way ANOVA**— is used to check how a numerical variable responds to the levels of one independent categorical variables**Two-way ANOVA**—is used to check how a numerical variable responds to the levels of*two*independent categorical variables**Multi-way ANOVA**— is used to check how a numerical variable responds to the levels of*multiple*independent categorical variables

Using a **two-way ANOVA** or **multi-way ANOVA**, we can investigate the combined impact of two (or more) independent categorical variables on one dependent numerical variable.

Resources:

https://thinkingneuron.com/how-to-measure-the-correlation-between-a-numeric-and-a-categorical-variable-in-python/

https://towardsdatascience.com/statistics-in-python-using-anova-for-feature-selection-b4dc876ef4f0