Difference between model-based and model-free reinforcement learning
A tutorial on Motion Estimation with Optical Flow with Python Implementation
Show all

Upper Confidence Bound (UCB) Algorithm Explained with Python code

7 mins read

In this tutorial, I will explain to you the application of the Upper Confidence Bound(UCB) algorithm to solve the Multi Bandit problem and show you the whole coding process in Python.

What is the Multi-Armed Bandit Problem?
Before going to learn the multi-armed bandit problem, first, take a look at the exploration vs. exploitation dilemma. This dilemma exists in many aspects of our life. Well, say you take your lunch at your favorite restaurant every day as you are confident about what you get from there is good. But you may be missing the chances of discovering even a better option. If you explore all the restaurants in your locality one by one, the probability of tasting the worst food in your life would be pretty high. But then again, there is also a probability that you get an even better option! This dilemma is called exploration vs. exploitation dilemma. Now let’s go to the classic example of this dilemma- the Multi-Armed Bandit Problem

Multi-Armed Bandit Problem:

This is a use case of reinforcement learning, where we are given a slot machine called a multi-armed bandit (the slot machines in casinos are called bandits as it turns out all casinos configure these machines in such a way that all gamblers end up losing money!) with each arm having its own rigged probability distribution of success. Pulling any one of these arms gives you a stochastic reward of either 1, for success, or 0, for a failure. Now your task is to find an optimal strategy that will give you the highest rewards in the long run without the prior knowledge of the probability distribution of success of the machines.

This problem can be related to more business examples such as displaying the optimal ads to the viewer.

So in this tutorial, we are going to solve a use case of the multi-armed bandit problem. We take an example of an online advertising campaign dataset where we have 10 different versions of a similar ad. You see the dataset looks similar to that of a multi-armed bandit problem!

Let’s have a quick check on our multi-armed bandit problem


You can download the dataset from here.

UCB Algorithm Intuition:

You can find the best option by using an A/B test. Where you can show a different ad to a user and take his feedback, that is whether the user clicks or not. And summing up all the feedback, you can find the best ad to display. Bat wait, there is a problem, in the A/B test you are incurring more cost and time by doing exploration and exploitation separately. So we need to combine exploration and exploitation and get the result as soon as possible. We do not have enough time and money to actually do this exploitation, so we will do this during the campaign.

The Upper Confidence Bound algorithm is the kind of algorithm that helps us to perform exploitation and exploration together.

How does the UCB Algorithm work?

At the start of the campaign, we don’t know what is the best arm or ad. So we cannot distinguish or discriminate any arm or ad. So the UCB algorithm assumes they all have the same observed average value. Then the algorithm creates a confidence bound for each arm or ad. So it randomly picks any of the arms or ads. Then two things can happen- the user clicks the ad or the arm gives a reward or does not. Let’s say the ad did not have a click or the arm was a failure. So the observed average of the ad or arm will go down. And the confidence bound will also go down. If it had a click, the observed average would go up and the confidence bound also go up. By exploiting the best one we are decreasing the confidence bound. As we are adding more and more samples, the probability of other arms or ads doing well is also going up. This is the fundamental concept of UCB.

Now let’s see what steps are involved to perform this algorithm on our dataset:


UCB in Python:

We will implement the whole algorithm in Python. First of all, we need to import some essential libraries.

# Importing the Essential Libraries 
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Now, let’s import the dataset:

# Importing the Dataset 
dataset = pd.read_csv('Ads_CTR_Optimisation.csv')

Now, implement the UCB algorithm:

# Implementing UCB
import math
N = 10000
d = 10
ads_selected = []
numbers_of_selections = [0] * d
sums_of_rewards = [0] * d
total_reward = 0
for n in range(0, N):
    ad = 0
    max_upper_bound = 0
    for i in range(0, d):
        if (numbers_of_selections[i] > 0):
            average_reward = sums_of_rewards[i] / numbers_of_selections[i]
            delta_i = math.sqrt(3/2 * math.log(n + 1) / numbers_of_selections[i])
            upper_bound = average_reward + delta_i
            upper_bound = 1e400
        if upper_bound > max_upper_bound:
            max_upper_bound = upper_bound
            ad = i
    numbers_of_selections[ad] = numbers_of_selections[ad] + 1
    reward = dataset.values[n, ad]
    sums_of_rewards[ad] = sums_of_rewards[ad] + reward
    total_reward = total_reward + reward

Code Explanation:
Here, three vectors are used.

# ads_selected = [], is used to append the different types of ads selected in each round 
# numbers_of_selections = [0] * d, is used to count the number of time an ad was selected 
# sums_of_rewards = [0] * d, is used to calculate the cumulative sum of rewards at each round

As we don’t have any prior knowledge about the selection of each ad, we will take the first 10 rounds as trial rounds. So, we set the if condition:  if (numbers_of_selections[i] > 0) so that the ads are selected at least once before entering into the main algorithm.

Then we implement the second step of the algorithm, computing the average reward of each ad i up to round n and the upper confidence bound for each ad.

Here we applied a trick in the else condition by taking the variable upper_bound  to a huge number. This is because we want the first 10 rounds as trial rounds where the 10 ads are selected at least once. This trick will help us to do so.

After 10 trial rounds, the algorithm will work as the steps explained earlier.

Visualizing the Result:
Now, it’s time to see what we get from our model. For this, we will visualize the output.

plt.title('Histogram of ads selections')
plt.ylabel('Number of times each ad was selected')

From the above visualization, we can see that the fourth ad got the highest click. So our model advice us to place the fourth version of the ad to the user for getting the highest number of clicks!

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.