Understanding Deep U-Nets for Semantic Segmentation: A salt identification case study with Keras
2022-07-19
Understanding DenseNet architecture with PyTorch code
2022-07-25
Show all

Partial Dependence Plots with Python code

17 mins read

What Are Partial Dependence Plots

Some people complain machine learning models are black boxes. These people will argue we cannot see how these models are working on any given dataset, so we can neither extract insight nor identify problems with the model.

By and large, people making this claim are unfamiliar with partial dependence plots. Partial dependence plots show how each variable or predictor affects the model’s predictions. This is useful for questions like:

  • How much wage differences between men and women are due solely to gender, as opposed to differences in educational backgrounds or work experience?
  • Controlling for house characteristics, what impact do longitude and latitude have on home prices? To restate this, we want to understand how similarly sized houses would be priced in different areas, even if the homes actually at these sites are different sizes.
  • Are health differences between the two groups due to differences in their diets, or due to other factors?

If you are familiar with linear or logistic regression models, partial dependence plots can be interpreted similarly to the coefficients in those models. But partial dependence plots can capture more complex patterns from your data, and they can be used with any model. A Partial Dependence Plot (PDP) is a useful tool for gaining insights into the relationship between features and predictions. It helps us understand how different values of a particular feature impact model’s predictions. In this post, we will learn the very basics of PDPs and familiarise ourselves with a few useful ways to plot them using Scikit-learn.

Data

We will use a subset of the titanic dataset (the data is available through Seaborn under the BSD-3 license) for the first section of this post. Let’s import libraries and load our dataset. Then, we will train a random forest model and evaluate its performance.

import numpy as np
import pandas as pd
# sklearn version: v1.0.1
from sklearn.model_selection import train_test_split
from sklearn.ensemble import (RandomForestClassifier, 
                              AdaBoostClassifier)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score
from sklearn.inspection import (partial_dependence, 
                                PartialDependenceDisplay)
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid', context='talk', palette='Set2')
columns = ['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare', 
           'adult_male']
df = sns.load_dataset('titanic')[columns].dropna()
X = df.drop(columns='survived')
y = df['survived']
X_train, X_test, y_train, y_test =  train_test_split(
    X, y, random_state=42, test_size=.25
)
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
def evaluate(model, X_train, y_train, X_test, y_test):
    name = str(model).split('(')[0]
    print(f"========== {name} ==========")
    y_train_pred = model.predict_proba(X_train)[:,1]
    roc_auc_train = roc_auc_score(y_train, y_train_pred)
    print(f"Train ROC AUC: {roc_auc_train:.4f}")
    
    y_test_pred = model.predict_proba(X_test)[:,1]
    roc_auc_test = roc_auc_score(y_test, y_test_pred)
    print(f"Test ROC AUC: {roc_auc_test:.4f}")
    
evaluate(rf, X_train, y_train, X_test, y_test)

Now, let’s learn the basics of PDPs.

Introduction to Partial Dependence Plots

Let’s start by looking at a PDP as an example. We will plot one for pclass using PartialDependenceDisplay:

var = 'pclass'
PartialDependenceDisplay.from_estimator(rf, X_train, [var]);

PDPs show the average effect on predictions as the value of feature changes.

In the plot above, the vertical axis shows the predicted probability, and the horizontal axis shows pclass values. The green line captures how the average predicted probability changes as the pclass values change. We see that the average survival probability decreases as passenger class increases from 1 to 3.

To better understand PDPs, let’s briefly look into how one can manually build the previous plot. We will first find the unique values of pclass. Then️ for each unique value, we will replace values in pclass column in the training data with it and record how predictions change.

values = X_train[var].sort_values().unique()
print(f"Unique values: {values}")
individual = np.empty((len(X_train), len(values)))
for i, value in enumerate(values):
    X_copy = X_train.copy()
    X_copy[var] = value
    individual[:, i] = rf.predict_proba(X_copy)[:, 1]
individual

Here we see how individual predictions (a.k.a. individual conditional expectation, ICE) for each record in the training dataset will change if we change the value of pclass. By averaging these predictions (a.k.a. partial dependence, PD), we get the inputs for PDPs.

individual.mean(axis=0)

By plotting these values along with the unique values of pclass, we can reproduce the PDP. Plotting from raw values ourselves gives us more flexibility and control over how we visualize the PDP compared to using PartialDependenceDisplay.

sns.lineplot(x=values, y=individual.mean(axis=0), style=0, 
             markers=True, legend=False)
plt.ylim(0.2,0.6)
plt.ylabel("Partial dependence")
plt.xlabel(var);

Manual calculation like the above is great for learning and understanding concepts, however, it’s not practical to continue using the approach in real use cases. In practice, we will use Scikit-learn’s more efficient partial_dependence function to extract raw values.

raw_values = partial_dependence(rf, X_train, var, kind='both')
raw_values

Here, we specified kind='both' to see the individual predictions as well the average predictions. If we are only after the average predictions, we can use kind='average':

partial_dependence(rf, X_train, var, kind='average')

Of note, average predictions from partial_dependence(…, kind='both') and partial_dependence(…, kind='average') may not always match exactly for some machine learning algorithms where more efficient recursion method is available for the latter.

Let’s check if our manually calculated values match the Scikit-learn’s version:

print(np.array_equal(raw_values['individual'][0], individual))
print(np.isclose(raw_values['average'][0], 
                 np.mean(individual, axis=0)))

PartialDependenceDisplay allows us to plot a subset of individual predictions along with the average to get a better sense of the data:

n = 50
PartialDependenceDisplay.from_estimator(
    rf, X_train, ['pclass'], kind="both", n_jobs=3, subsample=n
)
plt.legend(bbox_to_anchor=(1,1));

This provides more context. We can reproduce a similar graph ourselves from the raw values:

sns.lineplot(x=values, y=individual.mean(axis=0), style=0, 
             markers=True, legend=False)
sns.lineplot(data=pd.DataFrame(individual, columns=values)\
                    .sample(n).reset_index().melt('index'), 
             x='variable', y='value', style='index', dashes=False, 
             legend=False, alpha=0.1, size=1, color='#63C1A4')
plt.ylabel("Partial dependence")
plt.xlabel(var);

For discrete variables like pclass, we don’t have to limit ourselves to a line plot and can even use a bar plot and since we have the full freedom to build any chart from the raw values:

raw_df = pd.DataFrame(raw_values['individual'][0], 
                      columns=raw_values['values'])
sns.barplot(data=raw_df.melt(var_name=var), x=var, y='value')
plt.ylabel("Partial dependence");

It’s likely that we would want to look at PDPs for multiple variables. Having learned the basics, let’s look at a few ways to plot PDPs for multiple variables.

PDPs for multiple variables

Since our toy dataset has a handful of features, let’s plot PDPs for each feature. We will first use PartialDependenceDisplay:

n_cols = 2
n_rows = int(len(X_train.columns)/n_cols)
fig, ax = plt.subplots(n_rows, n_cols, figsize=(10, 12))
PartialDependenceDisplay.from_estimator(rf, X_train, X_train.columns, ax=ax, n_cols=n_cols)
fig.suptitle('Partial Dependence Plots')
fig.tight_layout();

From these plots, we can see the type of relationship between a feature and a prediction. Some relationships look linear whereas others are more complex.

Now, let’s plot PDPs from the raw values extracted with partial_dependence:

fig, ax = plt.subplots(n_rows, n_cols, figsize=(10,12), sharey=True)
for i, x in enumerate(X_train.columns):
    raw_values = partial_dependence(rf, X_train, i, kind='average')
    loc = i//n_cols, i%n_cols
    sns.lineplot(x=raw_values['values'][0], 
                 y=raw_values['average'][0], ax=ax[loc], style=0, 
                 markers=True, legend=False)
    ax[loc].set_xlabel(x)
    if i%n_cols==0:
        ax[loc].set_ylabel('Partial dependence')
fig.suptitle('Partial Dependence Plots')
fig.tight_layout()

Alternatively, we can also plot a subset of the individual predictions to give us a bit more context behind the average values:

plt.figure(figsize=(10,12))
for i, x in enumerate(X_train.columns):
    raw_values = partial_dependence(rf, X_train, i, kind='both')
    ax = plt.subplot(n_rows, n_cols, i+1)
    sns.lineplot(x=raw_values['values'][0], y=raw_values['average'][0], 
                 style=0, markers=True, legend=False, ax=ax)
    sns.lineplot(data=pd.DataFrame(raw_values['individual'][0], 
                                   columns=raw_values['values'][0])\
                        .sample(n).reset_index().melt('index'), 
                 x='variable', y='value', style='index', dashes=False, 
                 legend=False, alpha=0.1, size=1, color='#63C1A4')
    ax.set_xlabel(x)
    ax.set_ylabel('Partial dependence')
plt.suptitle('Partial Dependence Plots')
plt.tight_layout()

These plots help us understand the relationship between features and their effect on the target prediction and sense check if the patterns the model learned are sensible and explainable. PDPs can also be used to intuitively assess and compare models. In the next section, we will look at how we can plot PDPs for multiple models.

PDPs for multiple models

Let’s build two more models and extract the raw values for all three models:

pclass_df = pd.DataFrame(columns=values)
pclass_df.loc['rf'] = partial_dependence(
    rf, X_train, var, kind='average'
)['average'][0]
ada = AdaBoostClassifier(random_state=42)
ada.fit(X_train, y_train)
evaluate(ada, X_train, y_train, X_test, y_test)
pclass_df.loc['ada'] = partial_dependence(
    ada, X_train, var, kind='average'
)['average'][0]
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
evaluate(knn, X_train, y_train, X_test, y_test)
pclass_df.loc['knn'] = partial_dependence(
    knn, X_train, var, kind='average'
)['average'][0]
pclass_df

Now, we can plot partial dependence for each model:

pclass_df = pclass_df.reset_index().melt('index')
sns.lineplot(data=pclass_df, x='variable', y='value', 
             hue='index');
sns.scatterplot(data=pclass_df, x='variable', y='value', 
                hue='index', legend=False)
plt.legend(bbox_to_anchor=(1, 1))
plt.ylabel("Partial dependence")
plt.xlabel(var);

For AdaBoost and K-nearest neighbors classifiers, the predicted probabilities are almost indifferent to the passenger class.

Let’s now plot a similar comparison for all variables:

summary = {}
fig, ax = plt.subplots(n_rows, n_cols, figsize=(10,12), sharey=True)
for i, x in enumerate(X_train.columns):
    summary[x] = pd.DataFrame(columns=values)
    raw_values = partial_dependence(rf, X_train, x, kind='average')
    summary[x] = pd.DataFrame(columns=raw_values['values'][0])
    summary[x].loc['rf'] = raw_values['average'][0]
    summary[x].loc['ada'] = partial_dependence(
        ada, X_train, x, kind='average'
    )['average'][0]
    summary[x].loc['knn'] = partial_dependence(
        knn, X_train, x, kind='average'
    )['average'][0]
    
    data = summary[x].reset_index().melt('index')
    loc = i//n_cols, i%n_cols
    if i==1:
        sns.lineplot(data=data, x='variable', y='value', 
                     hue='index',ax=ax[loc]);
        ax[loc].legend(bbox_to_anchor=(1, 1));
    else: 
        sns.lineplot(data=data, x='variable', y='value', 
                     hue='index', ax=ax[loc], legend=False);
    sns.scatterplot(data=data, x='variable', y='value', 
                    hue='index', ax=ax[loc], legend=False)
    ax[loc].set_xlabel(x)
    if i%n_cols==0:
        ax[loc].set_ylabel('Partial dependence')
fig.suptitle('Partial Dependence Plots')
fig.tight_layout()

Looking at PDPs by the different models can help choose a model that is more sensible and explainable.

Another Example: Melbourne Housing dataset

We’ll start with 2 partial dependence plots showing the relationship (according to our model) between Price and a couple of variables from the Melbourne Housing dataset. We’ll walk through how these plots are created and interpreted.

import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence
from sklearn.preprocessing import Imputer

cols_to_use = ['Distance', 'Landsize', 'BuildingArea']

def get_some_data():
    data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')
    y = data.Price
    X = data[cols_to_use]
    my_imputer = Imputer()
    imputed_X = my_imputer.fit_transform(X)
    return imputed_X, y
    

X, y = get_some_data()
my_model = GradientBoostingRegressor()
my_model.fit(X, y)
my_plots = plot_partial_dependence(my_model, 
                                   features=[0,2], 
                                   X=X, 
                                   feature_names=cols_to_use, 
                                   grid_resolution=10)


The left plot shows the partial dependence between our target, Sales Price, and the distance variable. Distance in this dataset measures the distance to Melbourne’s central business district.

The partial dependence plot is calculated only after the model has been fit. The model is fit on real data. In that real data, houses in different parts of town may differ in myriad ways (different ages, sizes, etc.) But after the model is fit, we could start by taking all the characteristics of a single house. Say, a house with 2 bedrooms, 2 bathrooms, a large lot, an age of 10 years, etc.

We then use the model to predict the price of that house, but we change the distance variable before making a prediction. We first predict the price for that house at when sitting distance to 4. We then predict its price setting distance to 5. Then predict again for 6. And so on. We trace out how predicted price changes (on the vertical axis) as we move from small values of distance to large values (on the horizontal axis).

In this description, we used only a single house. But because of interactions, the partial dependence plot for a single house may be atypical. So, instead, we repeat that mental experiment with multiple houses, and we plot the average predicted price on the vertical axis. You’ll see some negative numbers. That doesn’t mean the price would sell for a negative price. Instead, it means the prices would have been less than the actual average price for that distance.

In the left graph, we see house prices fall as we get further from the central business distract. Though there seems to be a nice suburb about 16 kilometers out, where home prices are higher than many nearer and further suburbs.

The right graph shows the impact of the building area, which is interpreted similarly. A larger building area means higher prices.

These plots are useful both to extract insights, as well as to sanity check that your model is learning something you think is sensible.

We won’t focus on the code to load the data, just the code to make the plot.

def get_some_data():
    cols_to_use = ['Distance', 'Landsize', 'BuildingArea']
    data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')
    y = data.Price
    X = data[cols_to_use]
    my_imputer = Imputer()
    imputed_X = my_imputer.fit_transform(X)
    return imputed_X, y
    
from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence

# get_some_data is defined in hidden cell above.
X, y = get_some_data()
# scikit-learn originally implemented partial dependence plots only for Gradient Boosting models
# this was due to an implementation detail, and a future release will support all model types.
my_model = GradientBoostingRegressor()
# fit the model as usual
my_model.fit(X, y)
# Here we make the plot
my_plots = plot_partial_dependence(my_model,       
                                   features=[0, 2], # column numbers of plots we want to show
                                   X=X,            # raw predictors data.
                                   feature_names=['Distance', 'Landsize', 'BuildingArea'], # labels on graphs
                                   grid_resolution=10) # number of values to plot on x axis


Some tips related to plot_partial_dependence:

  • The features are the column numbers from the X array or dataframe that you wish to have plotted. This starts to look bad beyond 2 or 3 variables. You could make repeated calls to plot 2 or 3 at a time.
  • There are options to establish what points on the horizontal axis are plotted. The simplest is grid_resolution which we use to determine how many different points are plotted. These plots tend to look jagged as that value increases because you will pick up lots of randomness or noise in your model. It’s best not to take the small or jagged fluctuations too literally. Smaller values of grid_resolution smooth this out. It’s also much less of an issue for datasets with many rows.
  • There is a function called partial_dependence to get the raw data making up this plot, rather than making the visual plot itself. This is useful if you want to control how it is visualized using a plotting package like Seaborn. With moderate effort, you could make much nicer-looking plots.

Advantages

The computation of partial dependence plots is intuitive: The partial dependence function at a particular feature value represents the average prediction if we force all data points to assume that feature value. In my experience, lay people usually understand the idea of PDPs quickly.

If the feature for which you computed the PDP is not correlated with the other features, then the PDPs perfectly represent how the feature influences the prediction on average. In the uncorrelated case, the interpretation is clear: The partial dependence plot shows how the average prediction in your dataset changes when the j-th feature is changed. It is more complicated when features are correlated, see also disadvantages.

Partial dependence plots are easy to implement.

The calculation for the partial dependence plots has a causal interpretation. We intervene in a feature and measure the changes in the predictions. In doing so, we analyze the causal relationship between the feature and the prediction. The relationship is causal for the model – because we explicitly model the outcome as a function of the features – but not necessarily for the real world!

Disadvantages

The realistic maximum number of features in a partial dependence function is two. This is not the fault of PDPs, but of the 2-dimensional representation (paper or screen) and also of our inability to imagine more than 3 dimensions.

Some PD plots do not show the feature distribution. Omitting the distribution can be misleading because you might overinterpret regions with almost no data. This problem is easily solved by showing a rug (indicators for data points on the x-axis) or a histogram.

The assumption of independence is the biggest issue with PD plots. It is assumed that the feature(s) for which the partial dependence is computed are not correlated with other features. For example, suppose you want to predict how fast a person walks, given the person’s weight and height. For the partial dependence of one of the features, e.g. height, we assume that the other features (weight) are not correlated with height, which is obviously a false assumption. For the computation of the PDP at a certain height (e.g. 200 cm), we average over the marginal distribution of weight, which might include a weight below 50 kg, which is unrealistic for a 2-meter person. In other words: When the features are correlated, we create new data points in areas of the feature distribution where the actual probability is very low (for example it is unlikely that someone is 2 meters tall but weighs less than 50 kg). One solution to this problem is Accumulated Local Effect plots or short ALE plots that work with the conditional instead of the marginal distribution.

Heterogeneous effects might be hidden because PD plots only show the average marginal effects. Suppose that for a feature half your data points have a positive association with the prediction – the larger the feature value the larger the prediction – and the other half has a negative association – the smaller the feature value the larger the prediction. The PD curve could be a horizontal line since the effects of both halves of the dataset could cancel each other out. You then conclude that the feature has no effect on the prediction. By plotting the individual conditional expectation curves instead of the aggregated line, we can uncover heterogeneous effects.

Conclusion

Partial dependence plots are a great way (though not the only way) to extract insights from complex models. These can be incredibly powerful for communicating those insights to colleagues or non-technical users. There are a variety of opinions on how to interpret these plots when they come from non-experimental data. Some claim you can conclude nothing about cause-and-effect relationships from data unless it comes from experiments. Others are more positive about what can be learned from non-experimental data (also called observational data). It’s a divisive topic in the data science world, beyond the scope of this tutorial. However, most agree that these are useful to understand your model. Also, given the messiness of most real-world data sources, it’s also a good sanity check that your model is capturing realistic patterns.

Partial dependence plots provide insights into how predictions can be impacted by the change in features. One disadvantage of PDPs is that it assumes that features are independent of each other. While we have looked at classification use-case, PDPs can be also used for regression. In this post, we have focused on the simplest form of PDPs: 1-way PDPs. For keen learners who are curious to learn more about the PDPs, 2-way and/or 3-way PDPs that can provide insights on the interaction between features are interesting topics to look into.

Resources:

https://towardsdatascience.com/partial-dependence-plots-with-scikit-learn-966ace4864fc

https://www.kaggle.com/code/dansbecker/partial-dependence-plots/notebook

https://christophm.github.io/interpretable-ml-book/pdp.html

https://medium.com/dataman-in-ai/how-is-the-partial-dependent-plot-computed-8d2001a0e556

Partial Dependence Plots (Opening the Black Box)

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.