Handling imbalanced datasets for machine learning tasks
2022-07-01
Hyperparameter optimization with Scikit-Learn GridSearchCV using different models
2022-07-07
Show all

Visual comparison of decision boundaries for different classifiers

33 mins read

There are many debates on how to decide on the best classifier. Measuring the Performance Metrics score, and getting the area under ROC are a few of the approaches, but there is quite a lot of useful information to be gleaned from visualizing a decision boundary, information that will give us an intuitive grasp of learning models.

What is a decision boundary?

While training a classifier on a dataset, using a specific classification algorithm, it is required to define a set of hyper-planes, called Decision Boundary, that separates the data points into specific classes, where the algorithm switches from one class to another. On one side a decision boundary, a datapoint is more likely to be labeled class A — on the other side of the boundary, it’s more likely to be labeled as class B.

Let’s take an example of a Logistic Regression.

The goal of logistic regression is to figure out some way to split the data points to have an accurate prediction of a given observation’s class using the information present in the features.

Let’s suppose we define a line that describes the decision boundary. So, all of the points on one side of the boundary shall have all the data points belonging to class A and all of the points on one side of the boundary shall have all the data points belonging to class B.

S(z)=1/(1+e^-z)

  • S(z) = Output between 0 and 1 (probability estimate)
  • z = Input to the function (z= mx + b)
  • e = Base of natural log

Our current prediction function returns a probability score between 0 and 1. In order to map this to a discrete class (A/B), we select a threshold value or tipping point above which we will classify values into class A and below which we classify values into class B.

p>=0.5,class=A

p<=0.5,class=B

If our threshold was .5 and our prediction function returned .7, we would classify this observation belongs to class A. If our prediction was .2 we would classify the observation belongs to class B.

So, a line with 0.5 is called the decision boundary.

In order to map predicted values to probabilities, we use the Sigmoid function.

Importance of decision boundaries

At work, I sometimes find myself trying to explain the difference between different modeling or machine learning techniques—for example, the difference between regression models and decision trees. But simply describing how they work algorithmically rarely gives somebody that satisfying feeling of intuition that is the product of understanding.

Seeking to understand, people will often ask when one technique works better than another. This is a good question and it helps to see how different techniques perform on different problems. One way to visualize this is to compare plots of decision boundaries. Given a supervised learning problem where there are points of two classes (let’s say red and blue), we can train machine learning techniques to predict which class a hypothetical point should belong to.

A decision boundary is a surface that separates data points belonging to different class labels. Decision Boundaries are not only confined to just the data points that we have provided, but also they span through the entire feature space we trained on. The model can predict a value for any possible combination of inputs in our feature space. If the data we train on is not ‘diverse’, the overall topology of the model will generalize poorly to new instances. So, it is important to analyze all the models which can be best suitable for a ‘diverse’ dataset, before using the model in production.

Examining decision boundaries is a great way to learn how the training data we select affects performance and the ability of our model to generalize. Visualization of decision boundaries can illustrate how sensitive models are to each dataset, which is a great way to understand how specific algorithms work, and their limitations for specific datasets.

Here’s an example produced by a little Python script I whipped up. In the first plot is a randomly generated problem – a two-dimensional space with red and blue points. We then train a variety of machine learning techniques on that problem. Their task is to discriminate between the red and blue points. We can illustrate their solution by coloring the space red or blue depending on what class they predict that region belongs to.

To reiterate, the space is colored according to whether the machine learning technique predicts it belongs to the red or the blue class. (For some techniques, we can use shade to represent their strength of belief in the color of the region. For example, if a model predicts a high probability that a region is blue, then we shade that area darker blue). The line between colored regions is called the decision boundary.

The first thing you might look for is how many points have been misclassified by being included in an incorrectly colored region. To perfectly solve this problem, a very complicated decision boundary is required. The capacity of a technique to form really convoluted decision boundaries isn’t necessarily a virtue, since it can lead to overfitting.

The main point of these plots, though, is to compare the decision boundaries that techniques are capable of. Looking at how different techniques try to solve the problem helps to develop an intuition of how they work. For example, regression, which must solve the problem using a linear equation, is only capable of continuous boundaries. Whereas decision trees, which proceed by bisecting continuous variables, can partition the space into isolated “blocks.”

… and so on. Here’s another example:

This problem illustrates the problem of overfitting. Simple, first-order linear regression is only capable of straight-line boundaries, whereas when we add interactions and polynomial terms then it can form curved boundaries. The straight-line solution of simple logistic regression will almost certainly generalize better than the convoluted boundary produced when we add polynomial terms and interactions. Regularisation can address the problem of overfitting by constraining the polynomial & interaction coefficients and thus, visually speaking, limiting the flexibility of the boundary.

Here’s another example, with a more difficult problem:

This problem is tricky because the red & blue classes are in opposing quadrants. Clearly, no straight line decision boundary is going to work and simple regression fails dismally. Problems, where the classes form multiple clusters, are often suited to k-Nearest Neighbours or tree-based approaches.

Here’s one last example of a difficult problem in which the blue points separate two groups of red points:

Here’s the Python script I used to generate these plots. It’s not the most elegant code, but it’s clear and easy to tinker with. I run it in a Jupyter notebook. You can tweak the make_classification() function to generate different problems.

### Prepeare python
%pylab inline

import numpy as np
import pandas as pd
import statsmodels.api as sm
import scipy.stats as stats
import neurolab as nl
from sklearn.datasets import make_classification
from sklearn import neighbors, tree, svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB


### Helper functions for plotting
# A function to plot observations on scatterplot
def plotcases(ax):
    plt.scatter(xdf['x1'],xdf['x2'],c=xdf['y'], cmap=cm.coolwarm, axes=ax, alpha=0.6, s=20, lw=0.4)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.tick_params(axis='both', which='both', left='off', right='off', top='off', bottom='off', 
                   labelleft='off', labelright='off', labeltop='off', labelbottom='off')

# a function to draw decision boundary and colour the regions
def plotboundary(ax, Z):
    ax.pcolormesh(xx, yy, Z, cmap=cm.coolwarm, alpha=0.1)  
    ax.contour(xx, yy, Z, [0.5], linewidths=0.75, colors='k')


### Generate train data
X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, n_repeated=0,
                           n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.02, class_sep=0.5, 
                           hypercube=True, shift=0.0, scale=0.5, shuffle=True, random_state=5)

xdf = pd.DataFrame(X, columns=['x1','x2'])
xdf['y'] = y


### create plot canvas with 6x4 plots
fig = plt.figure(figsize(12,16), dpi=1600) 
plt.subplots_adjust(hspace=.5)             

nrows = 7
ncols = 4
gridsize = (nrows, ncols)


### 1. plot the problem ###
ax0 = plt.subplot2grid(gridsize,[0,0])
plotcases(ax0)
ax0.title.set_text("Problem")

# take boundaries from first plot and define mesh of points for plotting decision spaces
x_min, x_max = plt.xlim()
y_min, y_max = plt.ylim()
nx, ny = 100, 100   # this sets the num of points in the mesh
xx, yy = np.meshgrid(np.linspace(x_min, x_max, nx),
                     np.linspace(y_min, y_max, ny))


############# kNN #################
### 2. kNN with k=5
knn5 = neighbors.KNeighborsClassifier(n_neighbors=5)
knn5.fit(xdf[['x1','x2']], xdf['y']) 

Z = knn5.predict_proba(np.c_[xx.ravel(), yy.ravel()])
Z = Z[:,1].reshape(xx.shape)

ax = plt.subplot2grid(gridsize, [1,0])
ax.title.set_text("kNN, k=5")
plotboundary(ax, Z)
plotcases(ax)


### 3. kNN with k=15
knn15 = neighbors.KNeighborsClassifier(n_neighbors=15)
knn15.fit(xdf[['x1','x2']], xdf['y']) 

Z = knn15.predict_proba(np.c_[xx.ravel(), yy.ravel()])
Z = Z[:,1].reshape(xx.shape)

ax = plt.subplot2grid(gridsize,[1,1])
ax.title.set_text("kNN, k=15")
plotboundary(ax, Z)
plotcases(ax)


########### logistic regression ##########
### 4. Logistic regression - simple linear
formula = ('y ~ x1 + x2')
LRm = sm.GLM.from_formula(formula=formula, data=xdf, family=sm.families.Binomial()).fit()

XX = pd.DataFrame((vstack([xx.ravel(), yy.ravel()]).T))
XX.columns = xdf.columns[0:2]
Z = LRm.predict(XX)
Z = Z.reshape(xx.shape)

ax = plt.subplot2grid(gridsize, [2,0])
ax.title.set_text("Logistic Regression\n simple")
plotboundary(ax, Z)
plotcases(ax)


### 5. Logistic regression - with basic polynomials
formula = ('y ~ x1 + x2 + I(x1**2) + I(x2**2)')
LRm = sm.GLM.from_formula(formula=formula, data=xdf, family=sm.families.Binomial()).fit()

Z = LRm.predict(XX)
Z = Z.reshape(xx.shape)

ax = plt.subplot2grid(gridsize, [2,1])
ax.title.set_text("Logistic Regression\n basic polynomials")
plotboundary(ax, Z)
plotcases(ax)


### 6. Logistic regression - with polynomials & interactions
formula = ('y ~ x1 + x2 + x1*x2 + I(x1**2) + I(x2**2) + x1*I(x1**2) + x2*I(x2**2)\
                                + I(x1**3) + I(x2**3) + x1*I(x1**3) + x2*I(x2**3)')
LRm = sm.GLM.from_formula(formula=formula, data=xdf, family=sm.families.Binomial()).fit()

Z = LRm.predict(XX)
Z = Z.reshape(xx.shape)

ax = plt.subplot2grid(gridsize, [2,2])
ax.title.set_text("Logistic Regression\n polynomials & interactions")
plotboundary(ax, Z)
plotcases(ax)


### 7. Logistic regression - with polynomials & interactions, regularised
# smaller alpha = weaker regularisation. 
formula = ('y ~ x1 + x2 + x1*x2 + I(x1**2) + I(x2**2) + x1*I(x1**2) + x2*I(x2**2)\
                                + I(x1**3) + I(x2**3) + x1*I(x1**3) + x2*I(x2**3)')
LRm = sm.Logit.from_formula(formula=formula, data=xdf).fit_regularized(alpha=0.5, maxiter = 200)

Z = LRm.predict(XX)
Z = Z.reshape(xx.shape)

ax = plt.subplot2grid(gridsize, [2,3])
ax.title.set_text("Logistic Regression\n polynomials & interactions\n regularised")
plotboundary(ax, Z)
plotcases(ax)


#####  TREE METHODS  ########
### 8. Decision tree
dtree = tree.DecisionTreeClassifier()
dtree = dtree.fit(xdf[['x1','x2']], xdf['y'])

Z = dtree.predict_proba(np.c_[xx.ravel(), yy.ravel()])
Z = Z[:,1].reshape(xx.shape)

ax = plt.subplot2grid(gridsize, [3,0])
ax.title.set_text("Decision tree")
plotboundary(ax, Z)
plotcases(ax)

### 9. Random forest
rf = RandomForestClassifier()
rf.fit(xdf[['x1','x2']], xdf['y'])

Z = rf.predict_proba(np.c_[xx.ravel(), yy.ravel()])
Z = Z[:,1].reshape(xx.shape)

ax = plt.subplot2grid(gridsize, [3,1])
ax.title.set_text("Random forest")
plotboundary(ax, Z)
plotcases(ax)

###### SUPPORT VECTOR MACHINES ###############
## 10. SVM with 4th order polynomials
svc2 = svm.SVC(kernel='poly',degree=4, probability=True)  
svc2.fit(xdf[['x1','x2']], xdf['y']) 

Z = svc2.predict_proba(np.c_[xx.ravel(), yy.ravel()])
Z = Z[:,1].reshape(xx.shape)

ax = plt.subplot2grid(gridsize, [4,0])
ax.title.set_text("SVM\n4th order polynomials")
plotboundary(ax, Z)
plotcases(ax)

## 11. SVM with radial basis function
svc3 = svm.SVC(kernel='rbf', probability=True)  
svc3.fit(xdf[['x1','x2']], xdf['y']) 

Z = svc3.predict_proba(np.c_[xx.ravel(), yy.ravel()])
Z = Z[:,1].reshape(xx.shape)

ax = plt.subplot2grid(gridsize, [4,1])
ax.title.set_text("SVM\n Radial basis function")
plotboundary(ax, Z)
plotcases(ax)


####### Bayesian & Probabilistic methods ############
### 12. Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(xdf[['x1','x2']], xdf['y'])

Z = gnb.predict_proba(np.c_[xx.ravel(), yy.ravel()])
Z = Z[:,1].reshape(xx.shape)

ax = plt.subplot2grid(gridsize, [5,0])
ax.title.set_text("Gaussian Naive Bayes")
plotboundary(ax, Z)
plotcases(ax)


### 13. Gaussian Bayes, non-naive (joint distribution)
## Assumes multivariate normality 
# Specify prior: p(class). Assume that this is constant across all regions
pClass1 = len(xdf[xdf['y']==1]) / float(len(xdf))
pClass2 = len(xdf[xdf['y']==0]) / float(len(xdf))

## Create likelihood distribution: p(datapoint | class)
## For each class, determine mean and variance across both variables
# Collect means for both variables, for both classes
Mux1C1 = xdf['x1'][xdf['y']==1].mean()
Mux2C1 = xdf['x2'][xdf['y']==1].mean()
Mux1C2 = xdf['x1'][xdf['y']==0].mean()
Mux2C2 = xdf['x2'][xdf['y']==0].mean()

# Collect vars for both variables, for both classes
Varx1C1 = xdf['x1'][xdf['y']==1].std()
Varx2C1 = xdf['x2'][xdf['y']==1].std()
Varx1C2 = xdf['x1'][xdf['y']==0].std()
Varx2C2 = xdf['x2'][xdf['y']==0].std()

# use to create Normal distributions from these variables
rangex1 = np.linspace(xdf['x1'].min(), xdf['x1'].max(), num=100)
rangex2 = np.linspace(xdf['x2'].min(), xdf['x2'].max(), num=100)

# generate Gaussian distributions for x1 and x2 within both class1 and class2
C1x1 = np.array([stats.norm(Mux1C1, Varx1C1).pdf(i) for i in rangex1])
C1x2 = np.array([stats.norm(Mux2C1, Varx2C1).pdf(i) for i in rangex2])
C2x1 = np.array([stats.norm(Mux1C2, Varx1C2).pdf(i) for i in rangex1])
C2x2 = np.array([stats.norm(Mux2C2, Varx2C2).pdf(i) for i in rangex2])

# use this to create a joint likelihood distributions for class1 and class2: 
# These contain the joint probabilities for any given point whether it is in class1 or class2
pLikelihoodC1Joint = np.dot(C1x2[:, None], C1x1[None, :])
pLikelihoodC2Joint = np.dot(C2x2[:, None], C2x1[None, :])

# Finally, put them together to create the two posterior distributions: p(class | data)
ScoreClass1GivenData = pLikelihoodC1Joint * pClass1 / (sum(pLikelihoodC1Joint * pClass1) + sum(pLikelihoodC2Joint * pClass2))  
ScoreClass2GivenData = pLikelihoodC2Joint * pClass2 / (sum(pLikelihoodC1Joint * pClass1) + sum(pLikelihoodC2Joint * pClass2))

## for a given point, we calculate both likelihoods, and the max is the class we allocate to:
# eg ScoreClass1GivenData > ScoreClass2GivenData
Z = ScoreClass1GivenData > ScoreClass2GivenData
Z = Z.reshape(xx.shape)

ax = plt.subplot2grid(gridsize,[5,1])
ax.title.set_text("Bayes: Full joint\n (Guassian) distribution")
plotboundary(ax, Z)
plotcases(ax)


### 14. Bayes with joint, nonparametric (kernel density estimated) distributions
##  = Does NOT assume normality

# Specify prior: p(class). [same as above]

# Create likelihood distribution: p(datapoint | class) 
# = estimate distribution using Kernel Density 
dens_1 = sm.nonparametric.KDEMultivariate(data=xdf[['x1','x2']][xdf['y']==0], 
                                          var_type='cc', bw='normal_reference')

dens_2 = sm.nonparametric.KDEMultivariate(data=xdf[['x1','x2']][xdf['y']==1], 
                                          var_type='cc', bw='normal_reference')

# generate distributions for x1 and x2 within both class1 and class2
pLikelihoodC1Joint = dens_1.pdf(c_[xx.ravel(),yy.ravel()]) 
pLikelihoodC2Joint = dens_2.pdf(c_[xx.ravel(),yy.ravel()]) 

# Use Bayes rule to get posterior distribution, then convert back to matrix form
ScoreClass1GivenData = pLikelihoodC1Joint * pClass1 / (sum(pLikelihoodC1Joint * pClass1) + sum(pLikelihoodC2Joint * pClass2))
ScoreClass2GivenData = pLikelihoodC2Joint * pClass2 / (sum(pLikelihoodC1Joint * pClass1) + sum(pLikelihoodC2Joint * pClass2))

ScoreClass1GivenData = ScoreClass1GivenData.reshape(xx.shape)
ScoreClass2GivenData = ScoreClass2GivenData.reshape(xx.shape)

# predict most likely class at each point on grid
Z = ScoreClass1GivenData < ScoreClass2GivenData
Z = Z.reshape(xx.shape)

ax = plt.subplot2grid(gridsize, [5,2])
ax.title.set_text("Bayes: Full joint\n KDE distribution")
plotboundary(ax, Z)
plotcases(ax)


####### Neural Networks #########
## Nb. these take a while to train & the neurolab library can be a bit tempramental
## Can delete this chunk of code if desired
yarr = np.matrix(y).T

## Neural Net - 1HL,  3 hidden nodes
net3 = nl.net.newff([[np.min( X[:,0]), np.max( X[:,0])], [np.min( X[:,1]), np.max( X[:,1])]], [4, 1])
net3.trainf = nl.train.train_gd  # use gradient descent, bfgs is too buggy in neurolab
err = net3.train(X, yarr, show=100, goal=0.01)

Z = net3.sim(np.vstack([xx.ravel(), yy.ravel()]).T)
Z = Z.reshape(xx.shape)

ax = plt.subplot2grid(gridsize,[6,0])
ax.title.set_text("Neural Net\n 1HL x 4 nodes")
plotboundary(ax, Z)
plotcases(ax)


## Neural Net - 2HL,  4 hidden nodes eacg
net44 = nl.net.newff([[np.min( X[:,0]), np.max( X[:,0])], [np.min( X[:,1]), np.max( X[:,1])]], [4, 4, 1])
net44.trainf = nl.train.train_gd  # use gradient descent, bfgs is too buggy in neurolab
err = net44.train(X, yarr, show=100, goal=0.01, lr=0.01)

Z = net44.sim(np.vstack([xx.ravel(), yy.ravel()]).T)
Z = Z.reshape(xx.shape)

ax = plt.subplot2grid(gridsize,[6,1])
ax.title.set_text("Neural Net\n 2HL x 4 nodes each")
plotboundary(ax, Z)
plotcases(ax)

Scikit-Learn Example

A comparison of several classifiers in scikit-learn on synthetic datasets is presented here. The point of this example is to illustrate the nature of decision boundaries of different classifiers. This should be taken with a grain of salt, as the intuition conveyed by these examples does not necessarily carry over to real datasets. Particularly in high-dimensional spaces, data can more easily be separated linearly and the simplicity of classifiers such as naive Bayes and linear SVMs might lead to better generalization than is achieved by other classifiers. The plots show training points in solid colors and testing points semi-transparent. The lower right shows the classification accuracy on the test set.

Input data, Nearest Neighbors, Linear SVM, RBF SVM, Gaussian Process, Decision Tree, Random Forest, Neural Net, AdaBoost, Naive Bayes, QDA
# Code source: Gaël Varoquaux
#              Andreas Müller
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.inspection import DecisionBoundaryDisplay

names = [
    "Nearest Neighbors",
    "Linear SVM",
    "RBF SVM",
    "Gaussian Process",
    "Decision Tree",
    "Random Forest",
    "Neural Net",
    "AdaBoost",
    "Naive Bayes",
    "QDA",
]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis(),
]

X, y = make_classification(
    n_features=2, n_redundant=0, n_informative=2, random_state=1, n_clusters_per_class=1
)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)

datasets = [
    make_moons(noise=0.3, random_state=0),
    make_circles(noise=0.2, factor=0.5, random_state=1),
    linearly_separable,
]

figure = plt.figure(figsize=(27, 9))
i = 1
# iterate over datasets
for ds_cnt, ds in enumerate(datasets):
    # preprocess dataset, split into training and test part
    X, y = ds
    X = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.4, random_state=42
    )

    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5

    # just plot the dataset first
    cm = plt.cm.RdBu
    cm_bright = ListedColormap(["#FF0000", "#0000FF"])
    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
    if ds_cnt == 0:
        ax.set_title("Input data")
    # Plot the training points
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright, edgecolors="k")
    # Plot the testing points
    ax.scatter(
        X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6, edgecolors="k"
    )
    ax.set_xlim(x_min, x_max)
    ax.set_ylim(y_min, y_max)
    ax.set_xticks(())
    ax.set_yticks(())
    i += 1

    # iterate over classifiers
    for name, clf in zip(names, classifiers):
        ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)
        DecisionBoundaryDisplay.from_estimator(
            clf, X, cmap=cm, alpha=0.8, ax=ax, eps=0.5
        )

        # Plot the training points
        ax.scatter(
            X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright, edgecolors="k"
        )
        # Plot the testing points
        ax.scatter(
            X_test[:, 0],
            X_test[:, 1],
            c=y_test,
            cmap=cm_bright,
            edgecolors="k",
            alpha=0.6,
        )

        ax.set_xlim(x_min, x_max)
        ax.set_ylim(y_min, y_max)
        ax.set_xticks(())
        ax.set_yticks(())
        if ds_cnt == 0:
            ax.set_title(name)
        ax.text(
            x_max - 0.3,
            y_min + 0.3,
            ("%.2f" % score).lstrip("0"),
            size=15,
            horizontalalignment="right",
        )
        i += 1

plt.tight_layout()
plt.show()

Real-world data example

In this section, we try to build the decision boundary for various classifiers and decide which is the best algorithm for the dataset. The Dataset contains users’ information, based on which the best model should be built to predict whether the user will buy a car or not. Dataset is available here.

The Independent variables:

  • Age: Age of the user
  • Estimated Salary: Salary of the user.

The dependent variable: ‘Purchased’ which is 1 if the user purchases the car and 0 otherwise.

Step 1: Import all the required libraries

# Package imports
import matplotlib.pyplot as plt
import numpy as np
import sklearn
import sklearn.datasets
import sklearn.linear_model
import matplotlib
import pandas as pd

Step 2: Import the dataset

from google.colab import files
uploaded=files.upload()
import io
df2=pd.read_csv(io.BytesIO(uploaded['Social_Network_Ads.csv']))

Step 3: Applying StandardScaler to the dataset. Variables ‘Salary’ and ‘Age’ are not on the same scale. So, these should be scaled. Or else, a model cannot predict a good result. Standard Scaling also helps to speed up the calculations in an algorithm.

X = df2.iloc[:, :-1].values
y = df2.iloc[:, -1].values
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)

Step 4: Import sklearn libraries for classifiers

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

Step 5: Get the dimension of the dataset.

Step 6: Build a Logistic Regression model and Display the Decision Boundary for Logistic Regression. Decision Boundary can be visualized by dense sampling via meshgrid. However, if the grid resolution is not enough, the boundary will appear inaccurate. The purpose of meshgrid is to create a rectangular grid out of an array of x values and an array of y values. You can get a complete explanation of how to plot a meshgrid from here.

In Meshgrid, we will make an image, where each pixel represents a grid cell in the 2D feature space. The image defines a grid over the 2D feature space. The pixels of the image are then classified using the classifier, which will assign a class label to each grid cell. The classified image is then used as a background for a scatter plot that shows the data points of each class.

Advantage: It classifies the grid points in the 2D feature space.

Disadvantage: A computational cost for making very fine decision boundary maps, as we would have to make the grid finer and finer.

# Display plots inline and change default figure size
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (10.0, 8.0)
# Train the logistic rgeression classifier
clf = sklearn.linear_model.LogisticRegressionCV()
clf.fit(X, y)
​
# Helper function to plot a decision boundary.
def plot_decision_boundary(pred_func):
# Set min and max values and give it some padding
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = 0.01
# Generate a grid of points with distance h between them
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict the function value for the whole gid
Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the contour and training examples
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral
# Plot the decision boundary
plot_decision_boundary(lambda x: clf.predict(x))
plt.title("Logistic Regression")

In Logistic Regression, Decision Boundary is a linear line, which separates class A and class B. Some of the points from class A have come to the region of class B too because in a linear model, it’s difficult to get the exact boundary line separating the two classes.

Step 7: Build a Random Forest model and Plot the decision boundary. Being a Tree-based model it has many trees and the plot has tried to capture all the relevant classes. It is a nonlinear classifier.

# Display plots inline and change default figure size
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (10.0, 8.0)
# Train the RandomForestClassifier
clf1 = RandomForestClassifier(random_state=1, n_estimators=100)
clf1.fit(X, y)
# Plot the decision boundary
plot_decision_boundary(lambda x: clf1.predict(x))
plt.title("Random Forest")

The decision surfaces for the Decision Tree and Random Forest are very complex. The Decision Tree is by far the most sensitive, showing only extreme classification probabilities that are heavily influenced by single points. The Random Forest shows lower sensitivity, with isolated points having much less extreme classification probabilities. The SVM is the least sensitive, since it has a very smooth decision boundary.

Step 8: Build a Support Vector Machine model and Plot the decision boundary

# Display plots inline and change default figure size
%matplotlib inline
from sklearn.svm import SVC
matplotlib.rcParams['figure.figsize'] = (10.0, 8.0)
# Train the Support Vector Machine classifier
clf3 = SVC(gamma='auto')
clf3.fit(X, y)
# Plot the decision boundary
plot_decision_boundary(lambda x: clf3.predict(x))
plt.title("Support Vector Machine")

Support Vector Machine finds a hyperplane that separates the feature space into two classes with the maximum margin. If the problem is not originally linearly separable, the kernel trick is used to turn it into a linearly separable one, by increasing the number of dimensions. Thus a general hypersurface in a small dimension space is turned into a hyperplane in a space with much larger dimensions.

Step 9: Build a Decision Tree model and Plot the decision boundary

# Display plots inline and change default figure size
%matplotlib inline
from sklearn.tree import DecisionTreeClassifier
matplotlib.rcParams['figure.figsize'] = (10.0, 8.0)
​
# Train the Decision Tree classifier
clf4 = DecisionTreeClassifier()
clf4.fit(X, y)
​
# Plot the decision boundary
plot_decision_boundary(lambda x: clf4.predict(x))
plt.title("Decision Tree")

Step 10: Build the Gaussian NaiveBayes model and Plot the decision boundary

# Display plots inline and change default figure size
%matplotlib inline
from sklearn.naive_bayes import GaussianNB
matplotlib.rcParams['figure.figsize'] = (10.0, 8.0)
# Train the Gaussian NaiveBayes classifier
clf5 = GaussianNB()
clf5.fit(X, y)
​
# Plot the decision boundary
plot_decision_boundary(lambda x: clf5.predict(x))
plt.title("GaussianNB Classifier")

Gaussian Naive Bayes has also performed well, having a smooth curve boundary line.

Decision boundaries for higher dimensional data

Decision boundaries can easily be visualized for 2D and 3D datasets. Generalizing beyond 3D forms a challenge in terms of the visualization where we have to transform the boundary which is present in multi-dimension to a lower dimension, that can be displayed and understood by the experts is difficult.

However, a Decision Boundary can be plotted, using tSNE, where the dimensions of the data can be reduced in several steps. for example: If the dimension of my data is 150, then at first this shall be reduced to 50 and then shall be to 2 dimensions.

Libraries TSNE from sklearn.manifold and TruncatedSVD from sklearn.decomposition are used for this.

A very nice research paper is published here, describing about plotting decision boundaries for higher dimensional data.

Visualizing the effect of model parameters on decision boundaries

Model Overview

Below you will see visualizations for four popular models – a logistic regression, support vector machine, decision tree, and random forest. Each chart will feature a plot of the distribution of the data (in two dimensions), colored by its class. In this case, the class refers to the target outcome – for binary classification that is a ‘1’ or ‘0’. By shading the decision boundaries in the background we can see how the different models are actually classifying.

By looking at these predictions we can see the distinct advantages and disadvantages of each model against different distributions. Each row of charts shows a new distribution of data. The 1st row is simplest – it is a generated dataset of two normally distributed 2D blobs where the prediction target is which blob the point belongs to. There is some overlap between the blobs that makes them difficult to classify, but most points are simple. The second row shows a similar distribution but with some correlation between the two dimensions. The third row shows an “opposing moons” shape, where the classifier must predict which moon shape each point belongs to – this is a significantly more challenging task for the algorithm. The fourth row is a concentric circles distribution where the target is which circle the points belong to. This task is also challenging. The final row is from a real dataset – the famous iris dataset, where the X and Y axis corresponds to petal length and width respectively, and the prediction target is which of two particular kinds of flower that observation belongs.

Logistic Regression

Each distribution shows the limitations and challenges of various algorithms. Logistic regression, for example, creates a linear decision boundary in all distributions. This is useful for some distributions – such as in the 1st, 2nd, and 5th rows. However, the 3rd and 4th rows show its limitations. The 3rd row is generated data in the form of opposing arcs – the linear decision boundary has a difficult time separating these points. The 4th row shows concentric circles – the logistic regression struggles even more. On the concentric circles data the logistic regression is clearly unable to properly fit the data.

Support Vector Machine

In opposition to the logistic regression, the support vector machine performs well on most of the distributions. It is able to fit a linear boundary for rows 1, 2, and 5, and also a circular and curved boundary for rows 3 and 4. This is thanks to the kernel trick, which is a piece of mathematics that is far too complicated for this blog, that allows the SVM to project its classification boundary into another function in higher-dimensional space. Do not worry if that explanation does not make sense – SVMs are tricky to understand at a low level. At a high level, the most important insight is that the SVM is able to fit a smooth decision boundary to many non-linear distributions.

Decision Tree

The decision tree seems to fit well to each distribution, but there are two important limitations that should be noted. Firstly, the decision tree only fits a decision boundary that is orthogonal to the axes. Because of this, it has trouble fitting colinear and non-linear relationships. You can see this in its classification of the concentric circles distribution. Even though the true distribution is circles, the decision function fits a right-angled polygon shape.

Secondly, the decision tree has a tendency to overfit. For example, its classification of the opposing moons distribution features two thin strips in its boundary – these are fitted to one training instance (or point in the chart). But if the same decision tree were given another dataset with the same distribution, these thin strips would almost certainly not match the new data. Thus we can visualize that the decision tree is overfitting.

Random Forest

The decision boundary for the random forest looks roughly like a smoothed decision tree. That is because a random forest is just an ensemble of decision trees (with some introduced randomness). These decision trees are trained on random subsets of the data, and so achieve different distributions. These predictions are combined to form the random forest (this is a slight simplification, but is a good enough intuitive explanation). Because of this, the random forest generalizes much better than the decision tree does. Although in these distributions it still looks like it is overfitting, some tweaking of parameters can improve it significantly.

Visualizing Parameters

Now that I have explained the models’ strengths and weaknesses, I will show some ways to remedy them with parameter tuning. Hopefully, this will give an intuitive understanding of how those parameters affect predictions.

Logistic Regression Parameters

As explained above, a significant weakness of logistic regression was that it could only create linear decision boundaries. One way of remedying this is by transforming the data. Creating polynomial features of the data, for example, can allow the logistic regression to fit to nonlinear relationships. As a demonstration, if we had two variables A and B, a two-degree polynomial transform would result in the creation of variables A, B, A², AB, B². For higher order polynomials this expansion can become very long. While this is not a parameter, it is a common method of improving logistic regressions. Logistic regression on polynomial features is visualized below.

We can see from this visualization that higher-degree polynomials are better able to represent the arc shape of the opposing moons distribution. A degree 3 polynomial fits the distribution well. However it is often difficult to choose the right degree for polynomial transformation. As shown in the bottom right, a degree of 20 will likely overfit the distribution – that is the decision boundary fits too specifically to the training data and will not generalize to other distributions.

The above visualizations also introduce C, which is a regularization parameter. Regularization is a way of combatting overfitting. Here, a lower C means a more regularized model (one that fits less). Another way of explaining this is by the bias-variance trade-off. Regularization creates higher bias but lowers variance – so the data overfit less (but potentially underfit). As you can see in the charts above, a good value of C helps correct the model even when a high polynomial degree is chosen.

Support Vector Machine Parameters

The most important parameter choice for an SVM is the kernel function. The above chart showed a Gaussian kernel function – this is the most popular choice for kernel function and is the main reason why SVMs are so popular. This is a piece of maths that is far too complicated for me to explain here, but I encourage you to do more research and become acquainted with it. The other parameter that will be visualized is the SVM’s C. This is an SVM regularization parameter that works in a similar way to C for logistic regression (though this similarity is only in effect, not in their mathematical implementation).

As we can see above, the Gaussian kernel is able to fit the shape of the data. It fits particularly well with a medium amount of regularization – C=0.01 clearly underfits, whereas C=100 might be overfitting slightly. C=1 is just right! In this case, the polynomial and sigmoid kernels do not fit the shape of the data well. However, using a polynomial or sigmoid kernel introduces other parameters that aren’t explored here (such as a polynomial degree and constant coefficient), so these models could be fitted better to the data with more parameter tuning.

Decision Tree Parameters

The only parameter of the decision tree that I will explore here is the maximum depth of the tree. A decision tree works by branching the data into leaves, each of which has a percentage of the data. The decision tree algorithm aims to make each leaf the most ‘pure’ – that is having the highest proportion of a given class. Different algorithms do this in different ways. The maximum depth of the tree defines the maximum number of times that tree will split. This is shown below (note that max_depth=None means the tree has no maximum constraint on its depth).

As we can see, smaller values of maximum depth regularize the decision tree. However, the decision tree tends to underfit when it is not overfitting. No parameters can fix the limitation of its decision boundary being necessarily orthogonal to the axes. Perhaps a better way to fix this is to use a random forest!

Random Forest Parameters

Random forests possess all the parameters of decision trees and more. This is because they are ensembles of decision trees and so the decision tree parameters apply to the trees that will be combined into the random forest. Below I visualize the same max_depth parameter applied to a random forest. I also visualize the number of estimators parameter – this controls how many trees should be ensembled to make the random forest.

While it is a bit more subtle than for the decision tree, the max_depth parameter also helps to regularize the whole random forest – we can see that where max_depth=None there is a small brown section in the upper-left that is likely overfitting. When max_depth=3 this section disappears. The number of trees parameter has a more subtle effect – it does not change the shape of the boundary much, but it introduces more trees that average over the data, creating shapes that look slightly more “rounded”. In this way, the random forest is able to create a decision boundary that looks curved in some places, even though it is created from a combination of orthogonal boundaries! As a note, other ensembles of tree algorithms are very popular and have decision functions that take similar shapes to a random forest – these include AdaBoost and Gradient Boosting algorithms which are considered some of the most powerful machine learning algorithms.

Bonus Section – Test your understanding

Question

Given are the 6 decision boundaries below. Decision boundaries are violet lines. Dots and crosses are two different data sets. We have to decide which one is a:

  • Linear SVM
  • Kernelized SVM (Polynomial kernel of order 2)
  • Perceptron
  • Logistic Regression
  • Neural Network (1 hidden layer with 10 rectified linear units)
  • Neural Network (1 hidden layer with 10 tanh units)

Solution

The first thing that comes to mind is the division between linear and non-linear classifiers. Three classifiers are linear (linear SVM, perceptron, and logistic regression) and three plots show a linear decision boundary (A, B, C). So let’s start with those.

Linear

The most salient linear plot is plot B because it has a line with a slope. This is odd for logistic regression and SVM because they can improve their loss functions more by being a flat line (i.e. being further away from (all) the points). Thus, plot B is the perceptron. Since the perceptron output is either 0 or 1, all the solutions that separate one class from the other are equally good. That is why it does not improve any further.

The difference between plots A and C is more subtle. The decision boundary is slightly lower in plot A. An SVM has a fixed number of support vectors while the loss function of logistic regression is determined by all the points. Since there are more red crosses than blue dots logistic regression avoids the red crosses more than the blue dots. The linear SVM just tries to be as far away from the red support vectors as from the blue support vectors. That’s why plot A is the decision boundary of logistic regression and plot C is made using a linear SVM.

Non-linear

Let’s continue with the non-linear plots and classifiers. Plot F is probably the ReLu NN since it has the sharpest boundaries. A ReLu unit because activated at once if the activation exceeds 0 and this causes the output unit to follow a different linear line. If you look really, really good you can spot about 8 changes of direction in the line so probably 2 units have little impact on the final outcome. So plot F is the ReLu NN.

About the last two ones. Both a tanh NN and the polynomial kernelized SVM can have multiple boundaries. Plot D is obviously classified worse. A tanh NN can improve on this situation by bending the curves differently and putting more blue or red points in the outer region. However, this plot is kind of strange though. We guess the left upper part is classified as red and the right lower part as blue. But how is the middle part classified? It should be red or blue, but then one of the decision boundaries shouldn’t be drawn. The only possible option is thus that the outer parts are classified as one color and the inner part as the other color. That’s strange and really bad. So we are not sure about this one.

Let’s look at plot E. It has both curved and straight lines. For a degree-2 kernelized SVM, it is difficult (close to impossible) to have a straight line decision boundary since the squared distance gradually favors 1 of the 2 classes. The tanh activations functions hover can get saturated such that the hidden state is composed of 0’s and 1’s. In the case then only 1 unit then changes its state to say .5 you can get a linear decision boundary. So we would say that plot E is a tanh NN and thus plot D is a kernelized SVM. Too bad for the poor old SVM though.

Summary

A – Logistic Regression
B – Perceptron
C – Linear SVM
D – Kernelized SVM (Polynomial kernel of order 2)
E – Neural Network (1 hidden layer with 10 tanh units)
F – Neural Network (1 hidden layer with 10 rectified linear units)

Conclusions

In this article, we learned what is the role of Decision Boundary in determining a classifier model, built several classifier models, and plotted their respective decision boundaries to select the best model and also knew that plotting a Decision Boundary for higher dimensional data is a complex task, can be plotted, using tSNE, where the dimensions of the data can be reduced in several steps.

References:

http://www.subsubroutine.com/sub-subroutine/2016/2/15/understanding-machine-learning-techniques-by-the-decision-boundaries-they-are-capable-of

https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

https://medium.com/@gaurav_bio/creating-visualizations-to-better-understand-your-data-and-models-part-2-28d5c46e956

https://medium.com/analytics-vidhya/decision-boundary-for-classifiers-an-introduction-cc67c6d3da0e

https://stats.stackexchange.com/questions/296375/quiz-tell-the-classifier-by-its-decision-boundary

https://medium.com/@gaurav_bio/creating-visualizations-to-better-understand-your-data-and-models-part-2-28d5c46e956

https://www.thedataschool.com.au/sebastian-van-gerwen/an-intuitive-guide-to-the-differences-between-machine-learning-models/

https://python.plainenglish.io/decision-boundary-in-python-41ab212cb0e7

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.