2022-08-14
###### Performing A/B test in Python example – A case study from Udacity Data Scientist Nano Degree
2022-08-24

Regressions are one of the most commonly used tools in a data scientist’s kit. The quality of a regression model is how well its predictions match up against actual values, but how do we actually evaluate quality? Luckily, smart statisticians have developed error metrics to judge the quality of a model and enable us to compare regressions against other regressions with different parameters. These metrics are short and useful summaries of the quality of our data. This article will dive into four common regression metrics and discuss their use cases. There are many types of regression, but this article will focus exclusively on metrics related to linear regression.

Linear regression is the most commonly used model in research and business and is the simplest to understand, so it makes sense to start developing your intuition on how they are assessed. The intuition behind many of the metrics we’ll cover here extends to other types of models and their respective metrics.

## A primer on linear regression

In the context of regression, models refer to mathematical equations used to describe the relationship between two variables. In general, these models deal with the prediction and estimation of values of interest in our data called outputs. Models will look at other aspects of the data called inputs that we believe affect the outputs and use them to generate estimated outputs.

These inputs and outputs have many names that you may have heard before. Inputs can also be called independent variables or predictors, while outputs are also known as responses or dependent variables. Simply speaking, models are just functions where the outputs are some function of the inputs. The linear part of linear regression refers to the fact that a linear regression model is described mathematically in the form:  If that looks too mathematical, take solace in that linear thinking is particularly intuitive. If you’ve ever heard of “practice makes perfect,” then you know that more practice means better skills; there is some linear relationship between practice and perfection. The regression part of linear regression does not refer to some return to a lesser state. Regression here simply refers to the act of estimating the relationship between our inputs and outputs. In particular, regression deals with the modeling of continuous values (think: numbers) as opposed to discrete states (think: categories).

Taken together, a linear regression creates a model that assumes a linear relationship between the inputs and outputs. The higher the inputs are, the higher (or lower, if the relationship was negative) the outputs are. What adjusts how strong the relationship is and what the direction of this relationship is between the inputs and outputs are our coefficients. The first coefficient without input is called the intercept, and it adjusts what the model predicts when all your inputs are 0. We will not delve into how these coefficients are calculated, but know that there exists a method to calculate the optimal coefficients, given which inputs we want to use to predict the output.

Given the coefficients, if we plug in values for the inputs, the linear regression will give us an estimate of what the output should be. As we’ll see, these outputs won’t always be perfect. Unless our data is a perfectly straight line, our model will not precisely hit all of our data points. One of the reasons for this is the ϵ (named “epsilon”) term. This term represents an error that comes from sources out of our control, causing the data to deviate slightly from its true position. Our error metrics will be able to judge the differences between prediction and actual values, but we cannot know how much the error has contributed to the discrepancy. While we cannot ever completely eliminate epsilon, it is useful to retain a term for it in a linear model.

## Comparing model predictions against reality

Since our model will produce an output given any input or set of inputs, we can then check these estimated outputs against the actual values that we tried to predict. We call the difference between the actual value and the model’s estimate a residual. We can calculate the residual for every point in our data set, and each of these residuals will be of use in the assessment. These residuals will play a significant role in judging the usefulness of a model.

If our collection of residuals is small, it implies that the model that produced them does a good job at predicting our output of interest. Conversely, if these residuals are generally large, it implies that the model is a poor estimator. We technically can inspect all of the residuals to judge the model’s accuracy, but unsurprisingly, this does not scale if we have thousands or millions of data points. Thus, statisticians have developed summary measurements that take our collection of residuals and condense them into a single value that represents the predictive ability of our model. There are many of these summary statistics, each with its own advantages and pitfalls. For each, we’ll discuss what each statistic represents, their intuition, and the typical use case. We’ll cover:

• Mean Absolute Error
• Mean Square Error
• Mean Absolute Percentage Error
• Mean Percentage Error

Note: Even though you see the word error here, it does not refer to the epsilon term from above! The error described in these metrics refers to the residuals!

## Staying rooted in real data

In discussing these error metrics, it is easy to get bogged down by the various acronyms and equations used to describe them. To keep ourselves grounded, we’ll use a model that I’ve created using the Video Game Sales Data Set from Kaggle. The specifics of the model I’ve created are shown below.  My regression model takes in two inputs (critic score and user score), so it is a multiple variable linear regression. The model took in my data and found that 0.039 and -0.099 were the best coefficients for the inputs.

For my model, I chose my intercept to be zero since I’d like to imagine there’d be zero sales for scores of zero. Thus, the intercept term is crossed out. Finally, the error term is crossed out because we do not know its true value in practice. I have shown it because it depicts a more detailed description of what information is encoded in the linear regression equation.

## The rationale behind the model

Let’s say that I’m a game developer who just created a new game, and I want to know how much money I will make. I don’t want to wait, so I developed a model that predicts total global sales (my output) based on an expert critic’s judgment of the game and general player judgment (my inputs). If both critics and players love the game, then I should make more money… right? When I actually get the critic and user reviews for my game, I can predict how much glorious money I’ll make. Currently, I don’t know if my model is accurate or not, so I need to calculate my error metrics to check if I should perhaps include more inputs or if my model is even any good!

## Mean Absolute Error (MAE)

The mean absolute error (MAE) is the simplest regression error metric to understand. We’ll calculate the residual for every data point, taking only the absolute value of each so that negative and positive residuals do not cancel out. We then take the average of all these residuals. Effectively, MAE describes the typical magnitude of the residuals. The formal equation is shown below:  The picture below is a graphical description of the MAE. The green line represents our model’s predictions, and the blue points represent our data.

The MAE is also the most intuitive of the metrics since we’re just looking at the absolute difference between the data and the model’s predictions. Because we use the absolute value of the residual, the MAE does not indicate underperformance or overperformance of the model (whether or not the model under or overshoots actual data). Each residual contributes proportionally to the total amount of error, meaning that larger errors will contribute linearly to the overall error. Like we’ve said above, a small MAE suggests the model is great at prediction, while a large MAE suggests that your model may have trouble in certain areas. An MAE of 0 means that your model is a perfect predictor of the outputs (but this will rarely happen).

While the MAE is easily interpretable, using the absolute value of the residual often is not as desirable as squaring this difference. Depending on how you want your model to treat outliers, or extreme values, in your data, you may want to bring more attention to these outliers or downplay them. The issue of outliers can play a major role in which error metric you use.

## Calculating MAE against our model

Calculating MAE is relatively straightforward in Python. In the code below, `sales` contains a list of all the sales numbers, and `X` contains a list of tuples of size 2. Each tuple contains the critic score and user score corresponding to the sale in the same index. The `lm` contains a `LinearRegression` object from scikit-learn, which I used to create the model itself. This object also contains the coefficients. The `predict` method takes in inputs and gives the actual prediction based on those inputs.

```# Perform the intial fitting to get the LinearRegression object
from sklearn import linear_model
lm = linear_model.LinearRegression()
lm.fit(X, sales)

mae_sum = 0
for sale, x in zip(sales, X):
prediction = lm.predict(x)
mae_sum += abs(sale - prediction)
mae = mae_sum / len(sales)

print(mae)
>>> [ 0.7602603 ]
```

Our model’s MAE is 0.760, which is fairly small given that our data’s sales range from 0.01 to about 83 (in millions).

## Mean Square Error (MSE)

The mean square error (MSE) is just like the MAE but squares the difference before summing them all instead of using the absolute value. We can see this difference in the equation below.

## Consequences of the Square Term

Because we are squaring the difference, the MSE will almost always be bigger than the MAE. For this reason, we cannot directly compare the MAE to the MSE. We can only compare our model’s error metrics to those of a competing model. The effect of the square term in the MSE equation is most apparent with the presence of outliers in our data. While each residual in MAE contributes proportionally to the total error, the error grows quadratically in MSE. This ultimately means that outliers in our data will contribute to a much higher total error in the MSE than they would the MAE. Similarly, our model will be penalized more for making predictions that differ greatly from the corresponding actual value. This is to say that large differences between actual and predicted are punished more in MSE than in MAE. The following picture graphically demonstrates what an individual residual in the MSE might look like.  Outliers will produce these exponentially larger differences, and it is our job to judge how we should approach them.

## The problem of outliers

Outliers in our data are a constant source of discussion for the data scientists that try to create models. Do we include the outliers in our model creation or do we ignore them? The answer to this question is dependent on the field of study, the data set on hand, and the consequences of having errors in the first place. For example, I know that some video games achieve superstar status and thus have disproportionately higher earnings. Therefore, it would be foolish of me to ignore these outlier games because they represent a real phenomenon within the data set. I would want to use the MSE to ensure that my model takes these outliers into account more.

If I wanted to downplay their significance, I would use the MAE since the outlier residuals won’t contribute as much to the total error as MSE. Ultimately, the choice between MSE and MAE is application-specific and depends on how you want to treat large errors. Both are still viable error metrics but will describe different nuances about the prediction errors of your model.

## A note on MSE and a close relative

Another error metric you may encounter is the Root Mean Squared Error (RMSE). As the name suggests, it is the square root of the MSE. Because the MSE is squared, its units do not match that of the original output. Researchers will often use RMSE to convert the error metric back into similar units, making interpretation easier. Since the MSE and RMSE both square the residual, they are similarly affected by outliers. The RMSE is analogous to the standard deviation (MSE to variance) and is a measure of how large your residuals are spread out. Both MAE and MSE can range from 0 to positive infinity, so as both of these measures get higher, it becomes harder to interpret how well your model is performing. Another way we can summarize our collection of residuals is by using percentages so that each prediction is scaled against the value it’s supposed to estimate.

## Calculating MSE against our model

Like MAE, we’ll calculate the MSE for our model. Thankfully, the calculation is just as simple as MAE.

```mse_sum = 0
for sale, x in zip(sales, X):
prediction = lm.predict(x)
mse_sum += (sale - prediction)**2
mse = mse_sum / len(sales)

print(mse)
>>> [ 3.53926581 ]
```

With the MSE, we would expect it to be much larger than MAE due to the influence of outliers. We find that this is the case: the MSE is an order of magnitude higher than the MAE. The corresponding RMSE would be about 1.88, indicating that our model misses actual sale values by about \$1.8M.

## Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error (MAPE) is the percentage equivalent of MAE. The equation looks just like that of MAE, but with adjustments to convert everything into percentages.  Just as MAE is the average magnitude of error produced by your model, the MAPE is how far the model’s predictions are off from their corresponding outputs on average. Like MAE, MAPE also has a clear interpretation since percentages are easier for people to conceptualize. Both MAPE and MAE are robust to the effects of outliers thanks to the use of absolute value.

However, for all of its advantages, we are more limited in using MAPE than we are MAE. Many of MAPE’s weaknesses actually stem from the use of the division operation. Now that we have to scale everything by the actual value, MAPE is undefined for data points where the value is 0. Similarly, the MAPE can grow unexpectedly large if the actual values are exceptionally small themselves. Finally, the MAPE is biased towards predictions that are systematically less than the actual values themselves. That is to say, MAPE will be lower when the prediction is lower than the actual compared to a prediction that is higher by the same amount. The quick calculation below demonstrates this point.

We have a measure similar to MAPE in the form of the mean percentage error. While the absolute value in MAPE eliminates any negative values, the mean percentage error incorporates both positive and negative errors into its calculation.

## Should the MAPE be High or Low?

The MAPE is a commonly used measure in machine learning because of how easy it is to interpret. The lower the value for MAPE, the better the machine learning model is at predicting values. Inversely, the higher the value for MAPE, the worse the model is at predicting values.

For example, if we calculate a MAPE value of 20% for a given machine learning model, then the average difference between the predicted value and the actual value is 20%.

As a percentage, the error measurement is more intuitive to understand than other measures such as the mean square error. This is because many other error measurements are relative to the range of values. This requires you to jump through some additional mental hurdles to determine the scope of the error.

## What is a Good MAPE Score?

The MAPE returns a percentage, which can make it intuitive to understand. Because the percentage reflects the average percentage error, the lower the score the better.

Below, you’ll find some general guidelines on what a good MAPE score is:

Different interpretations of MAPE Scores

A MAPE score, like anything else in machine learning, should not be taken at face value. Keep in mind the range of your data (as lower ranges will amplify the MAPE) and the type of data you’re working with.

As you’ll learn in a later section, the MAPE does have some problems with some data, especially lower-volume data. Because of this, make sure you have a good sense of how your data is structured before making decisions using MAPE alone.

## Use Python to Calculate the MAPE Score from Scratch

It’s very simple to create a function for the MAPE using the built-in NumPy library.

Let’s see how we can do this:

```# Creating a Function for MAPE
import numpy as np

def mape(y_test, pred):
y_test, pred = np.array(y_test), np.array(pred)
mape = np.mean(np.abs((y_test - pred) / y_test))
return mape
```

Let’s break down what we did here:

1. We imported NumPy to simplify array operations
2. We defined a function, MAPE, that takes two arrays: the testing array and the predicted array
3. Both these arrays are converted into NumPy arrays
4. The MAPE is calculated using the formula above

Let’s run through a very simple machine learning example using a linear regression model in Scikit-Learn:

```# A practical example of MAPE in machine learning
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

def mape(y_test, pred):
y_test, pred = np.array(y_test), np.array(pred)
mape = np.mean(np.abs((y_test - pred) / y_test))
return mape

X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y)

lnr = LinearRegression()
lnr.fit(X_train, y_train)
predictions = lnr.predict(X_test)

print(mape(y_test, predictions))

# Returns: 0.339
```

In the example above, we created a simple machine learning model. The model predicted some values – these were stored in the `predictions` variable.

We tested the accuracy of our model by passing in our predictions and the actual values, `y_test` into our function, `mape()`This returned a value of 0.339, which is equal to 33.9%.

## Calculating the MAPE Using Sklearn

Scikit-Learn also comes with a function for the MAPE built-in, the `mean_absolute_percentage_error()` function from the `metrics` module.

Like our function above, the function takes the true values and the predicted values as input:

```# Using the mean_absolute_percentage_error function
from sklearn.metrics import mean_absolute_percentage_error

error = mean_absolute_percentage_error(y_true, predictions)
```

Let’s recreate our earlier example using this function:

```# A practical example of MAPE in sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_percentage_error

X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y)

lnr = LinearRegression()
lnr.fit(X_train, y_train)
predictions = lnr.predict(X_test)

print(mean_absolute_percentage_error(y_test, predictions))

# Returns: 0.339
```

In the next section, you’ll learn about some common problems with the MAPE score.

## Common Problems with the MAPE score

While the MAPE is easy to understand, this simplicity can also lead to some problems. One of the major problems with the MAPE score is how easily it is influenced by values of a low range.

For example, a predicted value of 3 and a true value of 2 indicate an error of 50%. Meanwhile, the data are only 1 off. If the real value was 100 and the predicted value was 101, then the error would only be 1%.

This is where the matter of interpretation comes in. In the example above, a difference between the values of 2 and 3 may be insignificant (in which case the MAPE is a poor metric). However, the difference may actually be incredibly meaningful, in which case the MAPE is a good metric.

Keep in mind the context of your data when interpreting the score.

## Calculating MAPE against our model

```mape_sum = 0
for sale, x in zip(sales, X):
prediction = lm.predict(x)
mape_sum += (abs((sale - prediction))/sale)
mape = mape_sum/len(sales)

print(mape)
>>> [ 5.68377867 ]
```

We know for sure that there are no data points for which there are zero sales, so we are safe to use MAPE. Remember that we must interpret it in terms of percentage points. MAPE states that our model’s predictions are, on average, 5.6% off from the actual value.

• Expressed as a percentage, which is scale-independent and can be used for comparing forecasts on different scales. We should remember though that the values of MAPE may exceed 100%.
• Easy to explain to stakeholders.

## Shortcomings

• MAPE takes undefined values when there are zero values for the actuals, which can happen in, for example, demand forecasting. Additionally, it takes extreme values when the actuals are very close to zero.
• MAPE is asymmetric and it puts a heavier penalty on negative errors (when forecasts are higher than actuals) than on positive errors. This is caused by the fact that the percentage error cannot exceed 100% for forecasts that are too low. While there is no upper limit for the forecasts which are too high. As a result, MAPE will favor models that are under-forecast rather than over-forecast.
• MAPE assumes that the unit of measurement of the variable has a meaningful zero value. So while forecasting demand and using MAPE makes sense, it does not when forecasting temperature expressed on the Celsius scale (and not only that one), as the temperature has an arbitrary zero point.
• MAPE is not everywhere differentiable, which can result in problems while using it as the optimization criterion.

What is a good value for MAPE?

Obviously the lower the value for MAPE the better, but there is no specific value that you can call “good” or “bad.” It depends on a couple of factors:

• The type of industry
• The MAPE value compared to a simple forecasting model

Let’s explore these two factors in depth.

### MAPE Varies by Industry

Often companies create forecasts for the demand of their products and then use MAPE as a way to measure the accuracy of the forecasts.

Unfortunately, there is no “standard” MAPE value because it can vary so much by the type of company.

For example, a company that rarely changes its pricing will likely have steady and predictable demand, which means it may have a model that produces a very low MAPE, perhaps under 3%.

For other companies that constantly run promotions and specials, their demand will vary greatly over time and thus a forecasting model will likely have a harder time predicting demand as accurately which means the models may have a higher value for MAPE.

You should be highly skeptical of “industry standards” for MAPE.

### Compare MAPE to a Simple Forecasting Model

Rather than trying to compare the MAPE of your model with some arbitrary “good” value, you should instead compare it to the MAPE of simple forecasting models.

There are two well-known simple forecasting models:

1. The average forecasting method.

This type of forecast model simply predicts the value for the next upcoming period to be the average of all prior periods. Although this method seems overly simplistic, it actually tends to perform well in practice.

2. The naïve forecasting method.

This type of forecast model predicts the value for the next upcoming period to be equal to the prior period. Again, although this method is quite simple it tends to work surprisingly well.

When developing a new forecasting model, you should compare the MAPE of that model to the MAPE of these two simple forecasting methods.

If the MAPE of your new model is not significantly better than these two methods, then you shouldn’t consider it to be useful.

## symmetric Mean Absolute Percentage Error (sMAPE)

Having discussed the MAPE, we also take a look at one of the suggested alternatives to it — the symmetric MAPE. It was supposed to overcome the asymmetry mentioned above — the boundlessness of the forecasts that are higher than the actuals.

There are a few different versions of sMAPE out there. Another popular and commonly accepted one adds absolute values to both terms in the denominator to account for the sMAPE being undefined when both the actual value and the forecast are equal to 0.

• Expressed as a percentage.
• Fixes the shortcoming of the original MAPE — it has both the lower (0%) and the upper (200%) bounds.

### Shortcomings

• Unstable when both the true value and the forecast are very close to zero. When it happens, we will deal with division by a number very close to zero.
• sMAPE can take negative values, so the interpretation of an “absolute percentage error” can be misleading.
• The range of 0% to 200% is not that intuitive to interpret, hence often the division by the 2 in the denominator of the sMAPE formula is omitted.
• Whenever the actual value or the forecast has the value is 0, sMAPE will automatically hit the upper boundary value.
• Same assumptions as the MAPE regarding the meaningful zero value.
• While fixing the asymmetry of boundlessness, sMAPE introduces another kind of delicate asymmetry caused by the denominator of the formula. Imagine two cases. In the first one, we have A = 100 and F = 120. The sMAPE is 18.2%. Now a very similar case, in which we have A = 100 and F = 80. Here we come out with the sMAPE of 22.2%.

## Mean Percentage Error

The mean percentage error (MPE) equation is exactly like that of MAPE. The only difference is that it lacks the absolute value operation.

Even though the MPE lacks the absolute value operation, it is actually its absence that makes MPE useful. Since positive and negative errors will cancel out, we cannot make any statements about how well the model predictions perform overall. However, if there are more negative or positive errors, this bias will show up in the MPE. Unlike MAE and MAPE, MPE is useful to us because it allows us to see if our model systematically underestimates (more negative error) or overestimates (positive error).

If you’re going to use a relative measure of error like MAPE or MPE rather than an absolute measure of error like MAE or MSE, you’ll most likely use MAPE. MAPE has the advantage of being easily interpretable, but you must be wary of data that will work against the calculation (i.e. zeroes). You can’t use MPE in the same way as MAPE, but it can tell you about systematic errors that your model makes.

## Calculating MPE against our model

```mpe_sum = 0
for sale, x in zip(sales, X):
prediction = lm.predict(x)
mpe_sum += ((sale - prediction)/sale)
mpe = mpe_sum/len(sales)

print(mpe)
>>> [-4.77081497]
```

All the other error metrics have suggested to us that, in general, the model did a fair job at predicting sales based on critic and user scores. However, the MPE indicates to us that it actually systematically underestimates the sales. Knowing this aspect of our model is helpful to us since it allows us to look back at the data and reiterate on which inputs to include that may improve our metrics. Overall, I would say that my assumptions in predicting sales were a good start. The error metrics revealed trends that would have been unclear or unseen otherwise.

## Conclusion

We’ve covered a lot of ground with the four summary statistics, but remembering them all correctly can be confusing. The table below will give a quick summary of the acronyms and their basic characteristics.

All of the above measures deal directly with the residuals produced by our model. For each of them, we use the magnitude of the metric to decide if the model is performing well. Small error metric values point to good predictive ability, while large values suggest otherwise. That being said, it’s important to consider the nature of your data set in choosing which metric to present. Outliers may change your choice in metric, depending on if you’d like to give them more significance to the total error. Some fields may just be more prone to outliers, while others are may not see them so much.

In any field though, having a good idea of what metrics are available to you is always important. We’ve covered a few of the most common error metrics used, but there are others that also see use. The metrics we covered use the mean of the residuals, but the median residual also sees use. As you learn other types of models for your data, remember that intuition we developed behind our metrics and apply them as needed.

## Resources

https://www.dataquest.io/blog/understanding-regression-error-metrics/

https://towardsdatascience.com/choosing-the-correct-error-metric-mape-vs-smape-5328dec53fac

https://stats.stackexchange.com/questions/299712/what-are-the-shortcomings-of-the-mean-absolute-percentage-error-mape

https://www.enjoyalgorithms.com/blog/evaluation-metrics-regression-models

##### Amir Masoud Sefidian
Machine Learning Engineer