Measure the correlation between numerical and categorical variables and the correlation between two categorical variables in Python: Chi-Square and ANOVA
2021-07-02
Feature Scaling with Scikit-Learn
2021-07-04
Show all

Understating and discovering multicollinearity in regression analysis with Python code

9 mins read

In this post, I will explain the concept of collinearity and multicollinearity and why it is important to understand them and take appropriate action when we are preparing data.

Correlation vs. Collinearity vs. Multicollinearity

Correlation measures the strength and direction between two columns in a dataset. Correlation is often used to find the relationship between a feature and the target:

For example, if one of the features has a high correlation with the target, it tells that this particular feature heavily influences the target and should be included when we are training the model.

Collinearity, on the other hand, is a situation where two features are linearly associated (high correlation), and they are used as predictors for the target.

Multicollinearity is a special case of collinearity where a feature exhibits a linear relationship with two or more features.

Problem with collinearity and multicollinearity

Recall the formula for multiple linear regression:

One important assumption of linear regression is that there should exist a linear relationship between each of the predictors (x₁, x₂, etc) and the outcome y. However, if there is a correlation between the predictors (e.g. x₁ and x₂ are highly correlated), we can no longer determine the effect of one while holding the other constant since the two predictors change together. The end result is that the coefficients (w₁ and w₂) are now less exact and hence less interpretable.

Fixing Multicollinearity

When training a machine learning model, it is important that during the data preprocessing stage we sieve out the features in the dataset that exhibit multicollinearity. We can do so using a method known as VIF — Variance Inflation Factor.

VIF allows us to determine the strength of the correlation between the various independent variables. It is calculated by taking a variable and regressing it against every other variables.

VIF calculates how much the variance of a coefficient is inflated because of its linear dependencies with other predictors. Hence its name.

Here is how VIF works:

  • Assuming we have a list of features — x₁, x₂, x₃, and x₄.
  • We first take the first feature, x₁, and regress it against the other features:
x₁ ~ x₂ + x₃ + x₄

In fact, we are performing a multiple regression above. Multiple regression generally explains the relationship between multiple independent or predictor variables and one dependent or criterion variable.

  • In the multiple regression above, we extract the  value (between 0 and 1). If  is large, this means that x₁ can be predicted from the three features, and is thus highly correlated with the three features — x₂, x₃, and x₄. If  is small, this means that x₁ cannot be predicted from the features, and is thus not correlated with the three features — x₂, x₃, and x₄.
  • Based on the  value that is calculated for x₁, we can now calculate its VIF using the following formula:
  • A large  value (close to 1) will cause the denominator to be small (1 minus a value close to 1 will give a number close to 0). This will result in a large VIF. A large VIF indicates that this feature exhibits multicollinearity with the other features.
  • Conversely, a small  value (close to 0) will cause the denominator to be large (1 minus a value close to 0 will give a number close to 1). This will result in a small VIF. A small VIF indicates that this feature exhibits low multicollinearity with the other features.
  • (1- R²) is also known as the tolerance.
  • We repeat the process above for the other features and calculate the VIF for each feature:
x₂ ~ x₁ + x₃ + x₄   # regress x₂ against the rest of the features
x₃ ~ x₁ + x₂ + x₄ # regress x₃ against the rest of the features
x₄ ~ x₁ + x₂ + x₃ # regress x₄ against the rest of the features

While correlation matrix and scatter plots can be used to find multicollinearity, they only show the bivariate relationship between the independent variables. VIF, on the other hand, shows the correlation of a variable with a group of other variables.

Implementing VIF using Python

Now that we know how VIF is calculated, we can implement it using Python, with a little help from sklearn:

import pandas as pd
from sklearn.linear_model import LinearRegressiondef 
calculate_vif(df, features):    
    vif, tolerance = {}, {}    # all the features that we want to examine
    for feature in features:
        # extract all the other features we will regress against
        X = [f for f in features if f != feature]        
        X, y = df[X], df[feature]        # extract r-squared from the fit
        r2 = LinearRegression().fit(X, y).score(X, y)                
        
        # calculate tolerance
        tolerance[feature] = 1 - r2        # calculate VIF
        vif[feature] = 1/(tolerance[feature])    # return VIF DataFrame
    return pd.DataFrame({'VIF': vif, 'Tolerance': tolerance})

Let’s Try It Out

To see VIF in action, let’s use a sample dataset named bloodpressure.csv, with the following content:

Pt,BP,Age,Weight,BSA,Dur,Pulse,Stress,
1,105,47,85.4,1.75,5.1,63,33,
2,115,49,94.2,2.1,3.8,70,14,
3,116,49,95.3,1.98,8.2,72,10,
4,117,50,94.7,2.01,5.8,73,99,
5,112,51,89.4,1.89,7,72,95,
6,121,48,99.5,2.25,9.3,71,10,
7,121,49,99.8,2.25,2.5,69,42,
8,110,47,90.9,1.9,6.2,66,8,
9,110,49,89.2,1.83,7.1,69,62,
10,114,48,92.7,2.07,5.6,64,35,
11,114,47,94.4,2.07,5.3,74,90,
12,115,49,94.1,1.98,5.6,71,21,
13,114,50,91.6,2.05,10.2,68,47,
14,106,45,87.1,1.92,5.6,67,80,
15,125,52,101.3,2.19,10,76,98,
16,114,46,94.5,1.98,7.4,69,95,
17,106,46,87,1.87,3.6,62,18,
18,113,46,94.5,1.9,4.3,70,12,
19,110,48,90.5,1.88,9,71,99,
20,122,56,95.7,2.09,7,75,99,

The dataset consists of the following fields:

  • Blood pressure (BP), in mm Hg
  • Age, in years
  • Weight, in kg
  • Body surface area (BSA), in m²
  • Duration of hypertension (Dur), in years
  • Basal Pulse (Pulse), in beats per minute
  • Stress index (Stress)

First, load the dataset into a Pandas DataFrame and drop the redundant columns:

df = pd.read_csv('bloodpressure.csv')
df = df.drop(['Pt','Unnamed: 8'],axis = 1)
df

Visualizing the relationships between columns

Before we do any cleanup, it would be useful to visualize the relationships between the various columns using a pair plot (using the Seaborn module):

import seaborn as sns
sns.pairplot(df)

I have identified some columns where there seems to exist a strong correlation:

Calculating Correlation

Next, calculate the correlation between the columns using the corr() function:

df.corr()

Assuming that we are trying to build a model that predicts BP, we could see that the top features that correlate to BP are AgeWeightBSA, and Pulse:

Calculating VIF

Now that we have identified the columns that we want to use for the training model, we need to see which of the columns have multicollinearity. So so let’s use our calculate_vif() function that we have written earlier:

calculate_vif(df=df, features=['Age','Weight','BSA','Pulse'])

Interpreting VIF Values

The valid value for VIF ranges from 1 to infinity. A rule of thumb for interpreting VIF values is:

  • 1 — features are not correlated
  • 1<VIF<5 — features are moderately correlated
  • VIF>5 — features are highly correlated
  • VIF>10 — high correlation between features and is cause for concern

From the result calculating the VIF in the previous section, we can see the Weight and BSA have VIF values greater than 5. This means that Weight and BSA are highly correlated. This is not surprising as heavier people have a larger body surface area.

So the next thing to do would be to try removing one of the highly correlated features and see if the result for VIF improves. Let’s try removing Weight since it has a higher VIF:

calculate_vif(df=df, features=['Age','BSA','Pulse'])

Let’s now remove BSA and see the VIF of the other features:

calculate_vif(df=df, features=['Age','Weight','Pulse'])

As we observed, removing Weight results in a lower VIF for all other features, compared to removing BSA. So should we remove Weight then? Well, ideally, yes. But for practical reasons, it would make more sense to remove BSA and keep Weight. This is because later on when the model is trained and we use it for prediction, it is easier to get a patient’s weight than his/her body surface area.

One More Example

Let’s look at one more example. This time we will use the Breast Cancer dataset that comes with sklearn:

from sklearn import datasets
bc = datasets.load_breast_cancer()
df = pd.DataFrame(bc.data, columns=bc.feature_names)
df

This dataset has 30 columns, so let’s only focus on the first 8 columns:

sns.pairplot(df.iloc[:,:8])

We can immediately observe that some features are highly correlated. Can you spot them?

Let’s calculate the VIF for the first 8 columns:

calculate_vif(df=df, features=df.columns[:8])

We can see that the following features have large VIF values:

Let’s try to remove these features one by one and observe their new VIF values. First, remove the mean perimeter:

calculate_vif(df=df, features=['mean radius', 
                               'mean texture', 
                               'mean area', 
                               'mean smoothness', 
                               'mean compactness', 
                               'mean concavity',
                               'mean concave points'])

Immediately there is a reduction of VIFs across the board. Let’s now remove the mean area:

calculate_vif(df=df, features=['mean radius', 
                               'mean texture',
                             # 'mean area', 
                               'mean smoothness', 
                               'mean compactness', 
                               'mean concavity',
                               'mean concave points'])

Let’s now remove the mean concave points, which have the highest VIF:

calculate_vif(df=df, features=['mean radius', 
                               'mean texture',
                             # 'mean area', 
                               'mean smoothness', 
                               'mean compactness', 
                               'mean concavity',
                             # 'mean concave points'
                              ])

Finally, let’s remove mean concavity:

calculate_vif(df=df, features=['mean radius', 
                               'mean texture',
                             # 'mean area', 
                               'mean smoothness', 
                               'mean compactness', 
                             # 'mean concavity',
                             # 'mean concave points'
                              ])

And now all the VIF values are under 5.

Summary

In this article, we learned that multicollinearity happens when a feature exhibits a linear relationship with two or more features. To detect multicollinearity, one method is to calculate the Variance Inflation Factor (VIF). Any feature that has a VIF of more than 5 should be removed from the training dataset. It is important to note that VIF only works on continuous variables and not categorical variables.

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.