We are in an era of personalization. The user wants personalized content and businesses are capitalizing on the same. Recommendation Engines, usually built using Machine Learning techniques, have become one of the essential tools to serve this demand. However, developing a recommendation engine has its challenges. How to best evaluate a recommender system is a topic of this post.
When dealing with ranking tasks, prediction accuracy and decision support metrics fall short. The prediction accuracy metrics include the mean absolute error (MAE), and root mean square error (RMSE). These focus on comparing the actual vs predicted ratings. They operate at the individual rating prediction level. If a user rated an item with 4.5 these metrics tell us how far-off are our predictions if we predicted a rating of 1.2 or 4.3.
Next in line, the decision support metrics include Precision, Recall, and F1 score. These focus on measuring how well a recommender helps users make good decisions. They help the user to select “good” items and to avoid “bad” items. These types of metrics start to emphasize what is important for recommendation systems. If we recommend 100 items to a user, what matters most are the items in the first 5, 10, or 20 positions. Precision is the percentage of selected elements that are relevant to the user. Its focus is recommending mostly useful stuff. Recall is the percentage of relevant items that the system selected. Its focus is not missing useful stuff. The F1 score is the combination of the two. The F1 harmonic mean is a way to balance precision and recall to get a single metric.
For our ranking task, the metrics have one major drawback. These decision-support metrics cover the entire data set. They are not targeted to the “Top-N” recommendations. Both precision and recall are about the entire result set. To expand these metrics, precision and recall are usually outfitted with a top-n bound. This comes in the form of Precision@N and Recall@N. Interestingly, I could not find a good source that describes the F1@N score which would represent the harmonic mean of the P@N and R@N. Let’s carry on anyway.
The modified Precision@N metric is the percentage of the “top-n” items that are good. This incorporates some level of top-n evaluation. This means that it focuses on the top recommended items. However, they are still similar to the original Precision, Recall, and F1 measures. They are all primarily concerned with being good at finding things. We need metrics that emphasize being good at finding and ranking things.
I conveniently just learned about these terms in Andrew Ng’s ML course on Coursera, so let’s start there. If we have a binary classifier for predicting having a condition (y=1y=1) vs not, then we define:
In maybe more familiar terminology, precision is (1 – false positive rate), and recall is (1 – false negative rate). Actually, I’m baffled as to why two new terms were needed for this.
OK, so how does this map (ding!) to recommender systems? In modeling pretty much all recommendation systems we’re going to have the following quantities with their corresponding ones in the binary classifier:
Terminology in Binary Classifier | Terminology in Recommender System |
---|---|
# with condition | # of all the possible relevant (“correct”) items for a user |
# predicted positive | # of items we recommended (we predict these items as “relevant”) |
# correct positives | # of our recommendations that are relevant |
OK so now with almost no effort we have:
Let’s say I am asked to recommend N=5 products (this means I predict “positive” for five products), from all the possible products there are only m=3 that are actually relevant to the user, and my successes and failures in my ranked list are [0,1,1,0,0]. Then:
Here’s a visual example: we’re being asked to recommend financial “products” to Bank users and we compare our recommendations to the products that a user actually added the following month (those are all the possible “relevant” ones).
So that’s nice, but Precision and Recall don’t seem to care about ordering. So instead let’s talk about precision and recall at cutoff k. Imagine taking your list of N recommendations and considering only the first element, then only the first two.
Time to level up. Let’s see how rank-aware evaluation metrics can help.
It can be hard to imagine how to evaluate a recommender system. Comparing lists of recommended items to lists of relevant items is not intuitive. Traditional tasks predict who died on the Titanic, or what breed of dog is in an ImageNet dataset. These do not emphasize rank-aware ML metrics that are central to recommender systems.
Recommender systems have a very particular and primary concern. They need to be able to put relevant items very high up the list of recommendations. Most probably, the users will not scroll through 200 items to find their favorite brand of earl grey tea. We need rank-aware metrics to select recommenders that aim at these two primary goals:
1) Where does the recommender place the items it suggests?
2) How good is the recommender at modeling relative preference?
This is where the following metrics can help:
MRR: Mean Reciprocal Rank
MAP: Mean Average Precision
NDCG: Normalized Discounted Cumulative Gain
The 3 metrics above come from two families of metrics. The first family comprises binary relevance-based metrics. These metrics care to know if an item is good or not in the binary sense. The second family comprises utility-based metrics. These expand the sense of good/bad with a measurement of absolute or relative goodness. Let us describe the characteristics of each metric in the following section.
This is the simplest metric of the three. It tries to measure “Where is the first relevant item?”. It is closely linked to the binary relevance family of metrics. The algorithm goes as follows:
Suppose we have the following three recommendation lists for three users. We can compute the reciprocal rank of each user by finding the rank of the first relevant item, per list. Then we do a simple averaging over all users.
Next is the MAP metric. Let’s say we have a binary relevance data set. We want to evaluate the whole list of recommended items up to a specific cut-off N. This cut-off was previously incorporated using the Precision@N metric. The P@N decision support metric calculates the fraction of n recommendations that are good. The drawback of this metric is that it does not consider the recommended list as an ordered list. P@N considers the whole list as a set of items and treats all the errors in the recommended list equally.
The goal is to cut the error in the first few elements rather than much later in the list. For this, we need a metric that weights the errors accordingly. The goal is to weigh heavily the errors at the top of the list. Then gradually decrease the significance of the errors as we go down the lower items in a list.
So that’s nice, but Precision and Recall don’t seem to care about ordering. So instead let’s talk about precision and recall at cutoff k. Imagine taking your list of N recommendations and considering only the first element, then only the first two, then only the first three, etc. These subsets can be indexed by k. Precision and Recall at cutoff k, P(k), and r(k), are simply the precision and recall calculated by considering only the subset of your recommendations from rank 1 through k. Really it would be more intuitive to say “up to cutoff k” rather than “at”.
Sticking with the bank example, here is what I mean:
OK, are you ready for Average Precision now? If we are asked to recommend N items, the number of relevant items in the full space of items is m, then:
where rel(k) is just an indicator that says whether that kth item was relevant (rel(k)=1) or not (rel(k)=0). I’d like to point out that instead of recommending N items would have recommended, say, 2N, but the AP@N metric says we only care about the average precision up to the Nth item.
Let’s imagine recommending N=3 products (AP@3) to a user who actually added a total of m=3 products. Here are some examples of outcomes for our algorithm:
Recommendations | Precision @k’s | AP@3 |
---|---|---|
[0, 0, 1] | [0, 0, 1/3] | (1/3)(1/3) = 0.11 |
[0, 1, 1] | [0, 1/2, 2/3] | (1/3)[(1/2) + (2/3)] = 0.38 |
[1, 1, 1] | [1/1, 2/2, 3/3] | (1/3)[(1) + (2/2) + (3/3)] = 1 |
First notice that the more correct recommendations I have, the larger my AP. This is because the kth subset precision is included in AP sum only if you got the kth recommendation correct, thus AP rewards you for giving correct recommendations (surprising absolutely no one).
There is more subtlety here though! In each row, I’ve bolded the P(k) term which came from the correct recommendation in the third slot. Notice it is larger when there have been more successes in front of it – that’s because the precision of the kth subset is higher the more correct guesses you’ve had up to point k. Thus, AP rewards you for front-loading the recommendations that are most likely to be correct.
These two features are what make AP a useful metric when your algorithm is returning a ranked ordering of items where each item is either correct or incorrect, and items further down in the list are less likely to be used. One more set of examples might make this second point a little clearer.
Recommendations | Precision @k’s | AP@3 |
---|---|---|
[1, 0, 0] | [1/1, 1/2, 1/3] | (1/3)(1) = 0.33 |
[0, 1, 0] | [0, 1/2, 1/3] | (1/3)(1/2) = 0.15 |
[0, 0, 1] | [0, 0, 1/3] | (1/3)(1/3) = 0.11 |
In all these cases you got just one recommendation correct, but AP was higher the further up the ranking that correct guess fell.
A final point of note is that adding another recommendation can never decrease your AP score, so if you are asked for N recommendations give all of them, even if you don’t feel very confident about the ones lower down the list! AP will never penalize you for taking on additional recommendations to your list – just make sure you front-load the best ones.
OK, that was Average Precision, which applies to a single data point (like a single user). What about MAP@N? All that remains is to average the AP@N metric over all your |U| users. Yes, an average of an average.
We can think of P(i) and r(i) as functions of the index ii, and we can plot them accordingly, e.g. P(i) vs. i. The resulting plot would of course depend heavily on the particular sequence of correct/incorrect recommendations that we are indexing through. More often what you see is a plot in the P(i) x r(i) plane that traces out the trajectory of both these quantities as you index through the list of recommendations. Analytically we can already imagine what such a trajectory will do: if at the next i we got a correct recommendation then both precision and recall should increase, whereas if we got that recommendation wrong then precision will decrease while recall will be unchanged.
recoms = [0, 1, 0, 1, 0, 1, 1] # N = 7
NUM_ACTUAL_ADDED_ACCT = 5
precs = []
recalls = []
for indx, rec in enumerate(recoms):
precs.append(sum(recoms[:indx+1])/(indx+1))
recalls.append(sum(recoms[:indx+1])/NUM_ACTUAL_ADDED_ACCT)
print(precs)
print(recalls)
[0.0, 0.5, 0.3333333333333333, 0.5, 0.4, 0.5, 0.5714285714285714]
[0.0, 0.2, 0.2, 0.4, 0.4, 0.6, 0.8]
import matplotlib.pyplot as plt
% matplotlib inline
fig, ax = plt.subplots()
ax.plot(recalls, precs, markersize=10, marker="o")
ax.set_xlabel("Recall")
ax.set_ylabel("Precision")
ax.set_title("P(i) vs. r(i) for Increasing $i$ for AP@7")
The jaggedness of this type of plot has led to the creation of some “smoothed” or “interpolated” average precision metrics as alternatives, but I’ll not talk about them here.
The Average Prediction (AP) metric tries to approximate this weighting sliding scale. It uses a combination of the precision at successive sub-lists, combined with the change in recall in these sub-lists. The calculation goes as follows:
Here is a diagram to help with visualizing the process:
From the figure above, we see that the Average Precision metric is at the single recommendation list, i.e. user level. Computing the precision through this item means sub-dividing the recommendation list. We examine a new sub-list every time we get a relevant item. Then we calculate the precision on this current sublist. We do this for every sublist until we reach the end of our recommendations. Now that we have a set of precisions, we average them to get the average precision for a single user. Then we get the AP for all users and get the mean average precision.
This is primarily an approximation of the original goal of the AP metric. The AP metric represents the area under the precision-recall curve. We get the precision-recall curve by computing the precision as a function of recall values. In this excellent lecture, the concept is expanded in great detail. Very recommended. The overall process is to generate a PR curve for every user-recommended list. Then generate an interpolated PR curve, and finally average the interpolated PR curves. This is the process visually:
To compare two systems we want the largest possible area under the PR curve. In the above example, we compare systems A, B, and C. We notice that system A is better than system C for all levels of recall. However, systems A and B intersect where system B does better at higher levels of recall. The problem with this scenario is that it is hard to determine which system does better overall. Plots are harder to interpret than single metrics. This is why researchers came up with a single metric to approximate the Average Precision (i.e. area under the precision-recall curve). Here is my annotated approximation I adapted from the wikipedia page that describes this process:
One last point is realizing what we are actually averaging. This means averaging noisy signals across many users. Below is a plot of the noise that is common across many users. This presentation goes in more details about this issue. This concern is useful to keep in mind when interpreting the MAP score. In the plot below we can see the bright red line is the average PR-curve. The other individual curves in the plot below are for each user for a list of N users. Indeed effectiveness can vary widely across queries. The MAP averaging will undoubtedly have an effect on the reported performance. Such sample curves can help evaluate the quality of the MAP metric.
I invite you to take a look at further writings around the meaning of the PR-curve. The following works here and here provide nice deep dives into the MAP metric.
Often you see the AP metric modified slightly when there might be more possible correct recommendations than the number of recommendations you are asked to give. Say, a super active user at the bank who adds m=10 accounts the next month, while your algorithm is only supposed to report N=5. In this case, the normalization factor used is 1/min(m,N), which prevents your AP score from being unfairly suppressed when your number of recommendations couldn’t possibly capture all the correct ones.
You also might encounter a somewhat sloppier usage where there is no indicator function rel(k) in the AP@N sum. In this case the precision at cutoff k is being implicitly defined to be zero when the kth recommendation was incorrect, that way it still doesn’t contribute to the sum:
Finally, if it’s possible for there to be no relevant or correct recommendations possible (m=0) then often the AP is defined to be zero for those points. Note that it will have the effect of dragging the MAP number of an algorithm down the more users there are who didn’t actually add any products. This doesn’t matter for comparing the performance of two algorithms on the same data set, but it does mean that you shouldn’t place any kind of absolute meaning on the final number.
There is an alternative formulation for the AP in terms of Precision and Recall, and I didn’t want you to feel left out when people start talking about it at parties:
where Δr(k) is the change in recall from the k−1th to the kth subset. This formulation is actually kind of nice because we don’t need to “leave out” terms in the sum with an indicator function, instead the change in recall term is zero when the kth recommendation is incorrect so those guys get wiped out anyway. Hopefully, you noticed that the prefactor 1/m1/m is missing too, it turns out that when the kth recommendation is correct the change in the recall is exactly 1/m. OK let me stop talking and make with the examples. Same as before, recommending N=3 products (AP@3) to a user who actually added a total of m=3 products:
Recs | Prec @k’s | Recall @k’s | Change r @k’s | AP@3 |
---|---|---|---|---|
[0, 0, 1] | [0, 0, 1/3] | [0, 0, 1/3] | [0, 0, 1/3] | (1/3)(1/3) = 0.11 |
[0, 1, 1] | [0, 1/2, 2/3] | [0, 1/3, 2/3] | [0, 1/3, 1/3] | (1/3)(1/2) + (1/3)(2/3) = 0.38 |
[1, 1, 1] | [1, 2/2, 3/3] | [1/3, 2/3, 3/3] | [1/3, 1/3, 1/3] | (1/3)(1) + (1/3)(1) + (1/3)(1) = 1 |
Recs | Prec @k’s | Recall @k’s | Change r @k’s | AP@3 |
---|---|---|---|---|
[1, 0, 0] | [1, 1/2, 1/3] | [1/3, 1/3, 1/3] | [1/3, 0, 0] | (1)(1/3) = 0.33 |
[0, 1, 0] | [0, 1/2, 1/3] | [0, 1/3, 1/3] | [0, 1/3, 0] | (1/2)(1/3) = 0.15 |
[0, 0, 1] | [0, 0, 1/3] | [0, 0, 1/3] | [0, 0, 1/3] | (1/3)(1/3) = 0.11 |
Hopefully, you can convince yourself that you are getting exactly the same result for this formulation of AP as we got before.
MAP is a very popular evaluation metric for algorithms that do information retrieval like google search results, but it also can apply to user-targeted product recommendations. If you have an algorithm that is returning a ranked ordering of items, each item is either hit or miss (like relevant vs. irrelevant search results) and items further down in the list are less likely to be used/seen (like search results at the bottom of the page), then MAP might be a useful metric.
Using MAP to evaluate a recommender algorithm implies that you are treating the recommendation like a ranking task. This often makes perfect sense since a user has a finite amount of time and attention and we want to show the top recommendations first and maybe market them more aggressively.
In recommendation systems MAP computes the mean of the Average Precision (AP) over all your users. The AP is a measure that takes in a ranked list of your N recommendations and compares it to a list of the true set of “correct” or “relevant” recommendations for that user. AP rewards you for having a lot of “correct” (relevant) recommendations in your list, and rewards you for putting the most likely correct recommendations at the top (you are penalized more when incorrect guesses are higher up in the ranking). So the order of “hits” and “misses” matters a lot in computing an AP score, but once you have front-loaded your best guesses you can never decrease your AP by tacking on more.
To deal with these issues the recsys community has come up with another more recent metric. This metric takes into account the fined grained information included in the ratings. Let’s take a look at the Normalized Discounted Cumulative Gain (NDCG) metric.
The goal of the MAP measure is similar to the goal of the NDCG metric. They both value putting highly relevant documents high up on the recommended lists. However, the NDCG further tunes the recommended list evaluation. It is able to use the fact that some documents are “more” relevant than others. Highly relevant items should come before medium-relevant items, which should come before non-relevant items.
Let us take a look at the theory of NDCG and how we can evaluate a recommendation engine using it. To understand NDCG, we need to understand its predecessors: Cumulative Gain(CG) and Discounted Cumulative Gain(DCG). Also, we need to keep the following assumption in mind:
The highly relevant documents are more useful than moderately relevant documents, which are in turn more useful than irrelevant documents. NDCG is a measure of ranking quality.
Gain is just the relevance score for each item recommended.
Every recommendation has a relevance score associated with it. Cumulative Gain is the sum of all the relevance scores in a recommendation set.
Thus, CG for ordered recommendation set A with document relevance scores will be:
There is a drawback with Cumulative Gain. Consider the following two ordered recommendation sets with relevance scores of individual documents.
We know that Set B is better than Set A as it is recommending in decreasing order of relevance, but as per the Cumulative Gain, both sets are equally good. What exactly lacking is the use of position along with relevance scores. DCG fills this gap. The computation involves discounting the relevance score by dividing it by the log of the corresponding position.
Alternatively, it can also be computed using the below expression.
This second expression penalizes heavily as compared to the first one if the document with higher relevance is ranked lower. Depending on the application you can choose either one of the expressions to compute the DCG and NDCG. If the relevance scores are binary i.e. either 0 or 1, then the two expressions yield the same result.
Let us compute the DCG for both ordered sets using the first expression.
The DCG result for the above example is aligned with our intuition. Set B is better than Set A.
DCG seems a good measure at first as it takes position significance into account. However, it is still not complete. Depending on various factors, the number of recommendations served may vary for every user. Thus, the DCG will vary accordingly. We need a score that has proper upper and lower bounds so that we can take a mean across all the recommendations scores to report a final score. NDCG brings in this normalization.
For every recommendation set, to compute NDCG, we need to compute :
NDCG is then the ratio of DCG of recommended order to DCG of ideal order.
This ratio will always be in the range [0,1].
Consider the following ordered recommendation set served to one of the users.
The ideal order for this recommendation set will be:
The corresponding DCG scores using the first expression:
Thus, the NDCG for this recommendation set will be:
To evaluate a recommendation engine, compute a mean of NDCG for the recommendations served to the test set of the users.
Here is an annotated diagram that shows the stages of calculating the NDCG linearly:
Before the NDCG we had the cumulative gain CG. This represented a basic measure to accumulate the graded relevances. This metric does not take into account the position of the elements in the ranked list. For ranking tasks, we need to increase the relative impact of the position of elements in the ranked list. The standard Discounted Cumulative Gain, DCG, adds a logarithmic reduction factor to penalize the relevance score proportionally to the position of the item. Furthermore, in industrial applications, it is common to see that the relevance scores get a boost to emphasize retrieving relevant documents. This appears in the industrial DCG formula.
We are dealing with dynamic systems. Users will get a variable number of relevant items recommended. This makes the DCG measure not comparable across users. We need to normalize the metric to be between 0 and 1. To this effect, we determine the ideal ranking for a user. Then we use that ranking as the Ideal Discounted Cumulative Gain IDCG. This provides a nice normalization factor. It helps compute the Normalized Discounted Cumulative Gain. As this is a per-user metric, we need to calculate this metric for all users in the test set. Then we average across users to get a single number. This average is then used for comparing recommendation systems to each other. To visualize this process, we go through the calculation in the figure below with the predicted and ideal ranking for a single user.
As I said the primary advantage of the NDCG is that it takes into account the graded relevance values. If your dataset has the right form and you are dealing with graded relevance, then the NDCG measure is your go-to metric.
Code : Python program for Normalized Discounted Cumulative Gain
# import required package
from sklearn.metrics import ndcg_score, dcg_score
import numpy as np
# Relevance scores in Ideal order
true_relevance = np.asarray([[3, 2, 1, 0, 0]])
# Relevance scores in output order
relevance_score = np.asarray([[3, 2, 0, 0, 1]])
# DCG score
dcg = dcg_score(true_relevance, relevance_score)
print("DCG score : ", dcg)
# IDCG score
idcg = dcg_score(true_relevance, true_relevance)
print("IDCG score : ", idcg)
# Normalized DCG score
ndcg = dcg / idcg
print("nDCG score : ", ndcg)
# or we can use the scikit-learn ndcg_score package
print("nDCG score (from function) : ", ndcg_score(
true_relevance, relevance_score))
Output:
DCG score : 4.670624189796882
IDCG score : 4.761859507142915
nDCG score : 0.980840401274087
nDCG score (from function) : 0.980840401274087
Coverage is the percentage of the possible recommendations that an implemented recommendation algorithm can produce. Coverage is essential from the perspective of the new item being recommended. It gives an idea of how quickly a new item will appear in the recommendation list. New items will reduce the coverage metric since they need to be purchased/graded by at least a couple of people. Based on the used recommendation technology, the list of the items that the system can produce prediction varies. If the system uses only collaborative filtering, items must have ratings higher than the decided threshold. Similarly, some systems may apply filters, and only items that satisfy the given filters can be used for prediction. One can donate the list of items that can be used for the recommendation as Is to indicate a subset of possible items in the inventory. Then, coverage can be easily calculated by dividing the possible items by the total number of items:
Diversity is a measure of how your recommendations are different from each other. Consider the customer finished watching the first movie of a trilogy on Netflix. Low diversity recommender would recommend only the next parts of the trilogy or the films directed by the same director. On the other hand, high diversity can be achieved by recommending items completely random. Diversity may seem like a subjective measure, but it can be calculated by using the similarity between recommended items. One can calculate the similarity between the recommended item and get the average of similarity scores. Consider the average similarity score of the top 10 recommendations as sim10. Diversity can be concluded as the opposite of the average similarity score, and it can be calculated as the following formula:
Considering only diversity as the measure of quality can be misleading. That is why it should be used as a supplement alongside other measurements.
Resources:
https://towardsdatascience.com/evaluate-your-recommendation-engine-using-ndcg-759a851452d1
https://towardsdatascience.com/ranking-evaluation-metrics-for-recommender-systems-263d0a66ef54
http://sdsawtelle.github.io/blog/output/mean-average-precision-MAP-for-recommender-systems.html