17 mins read
## Computing perplexity from sentence probabilities

## Perplexity and Entropy

## A quick recap of language models

## Evaluating language models

## Perplexity as the normalized inverse probability of the test set

### 1 Probability of the test set

### 2 Normalising

### 3 Bringing it all together

## Perplexity as the exponential of the cross-entropy

### Cross-entropy of a language model

### Weighted branching factor: rolling a die

### Weighted branching factor: language models

## Summary

In general, perplexity is a measurement of how well a probability model predicts a sample. In the context of Natural Language Processing, perplexity is one way to evaluate language models. A language model is a probability distribution over sentences: it’s both able to generate plausible human-written sentences (if it’s a good language model) and to evaluate the goodness of already written sentences. Presented with a well-written document, a good language model should be able to give it a higher probability than a badly written document, i.e. it should not be “perplexed” when presented with a well-written document.

Thus, the perplexity metric in NLP is a way to capture the degree of ‘uncertainty’ a model has in predicting (i.e. assigning probabilities to) text. Now, let’s try to compute the probabilities assigned by language models to some example sentences and derive an intuitive explanation of what perplexity is.

Suppose we have trained a small language model over an English corpus. The model is only able to predict the probability of the next word in the sentence from a small subset of six words: *“a”*,* “the”*,* “red”*, *“fox”*,* “dog”*,* *and *“.”*.

Let’s compute the probability of the sentence *W, *which is “*a red fox.*”.

The probability of a generic sentence *W*, made of the words *w1*, *w2*, up to *wn*, can be expressed as the following:

P(W) = P(w1, w2, …, wn)

Using our specific sentence *W*, the probability can be extended as the following:

P(“a red fox.”) =

P(“a”) * P(“red” | “a”) * P(“fox” | “a red”) * P(“.” | “a red fox”)

Suppose these are the probabilities assigned by our language model to a generic first word in a sentence:

As can be seen from the chart, the probability of “*a*” as the first word of a sentence is:

P(“a”) = 0.4

Next, suppose these are the probabilities given by our language model to a generic second word that follows “*a*”:

The probability of “*red*” as the second word in the sentence after “*a*” is:

P(“red” | “a”) = 0.27

Similarly, these are the probabilities of the next words:

Finally, the probability assigned by our language model to the whole sentence “*a red fox.*” is:

P(“a red fox.”) =

P(“a”) * P(“red” | “a”) * P(“fox” | “a red”) * P(“.” | “a red fox”)

= 0.4 * 0.27 * 0.55 * 0.79

= 0.0469

It would be nice to compare the probabilities assigned to different sentences to see which sentences are better predicted by the language model. However, since the probability of a sentence is obtained from a product of probabilities, the longer the sentence the lower will be its probability (since it’s a product of factors with values smaller than one). We should find a way of measuring these sentence probabilities, without the influence of the sentence length.

This can be done by normalizing the sentence probability by the number of words in the sentence. Since the probability of a sentence is obtained by multiplying many factors, we can average them using the geometric mean.

Let’s call *Pnorm(W) *the normalized probability of the sentence *W*. Let *n* be the number of words in *W*. Then, applying the geometric mean:

Pnorm(W) = P(W) ^ (1 / n)

Using our specific sentence “*a red fox.*”:

Pnorm(“a red fox.”) = P(“a red fox”) ^ (1 / 4) = 0.465

Great! This number can now be used to compare the probabilities of sentences with different lengths. The higher this number is over a well-written sentence, the better is the language model.

So, what does this have to do with perplexity? Well, perplexity is just the reciprocal of this number.

Let’s call *PP(W)* the perplexity computed over the sentence *W*. Then:

PP(W) = 1 / Pnorm(W)

= 1 / (P(W) ^ (1 / n))

= (1 / P(W)) ^ (1 / n)

Which is the formula of perplexity. Since perplexity is just the reciprocal of the normalized probability, the lower the perplexity over a well-written sentence the better is the language model. Let’s try computing the perplexity with a second language model that assigns equal probability to each word at each prediction. Since the language models can predict six words only, the probability of each word will be 1/6.

P(“a red fox.”) = (1/6) ^ 4 = 0.00077

Pnorm(“a red fox.”) = P(“a red fox.”) ^ (1/4) = 1/6

PP(“a red fox”) = 1 / Pnorm(“a red fox.”) = 6

…which, as expected, is a higher perplexity than the one produced by the well-trained language model.

Perplexity can be computed also starting from the concept of Shannon entropy. Let’s call *H(W)* the entropy of the language model when predicting a sentence *W*. Then, it turns out that:

PP(W) = 2 ^ (H(W))

This means that, when we optimize our language model, the following sentences are all more or less equivalent:

- We are maximizing the normalized sentence probabilities given by the language model over well-written sentences.
- We are minimizing the perplexity of the language model over well-written sentences.
- We are minimizing the entropy of the language model over well-written sentences.

A **language model** is a statistical model that assigns probabilities to words and sentences. Typically, we might be trying to guess the ** next word** w in a sentence given all previous words, often referred to as the

For example, given the history “For dinner I’m making __”, what’s the probability that the next word is “cement”? What’s the probability that the next word is “fajitas”?

Hopefully, P(fajitas|For dinner I’m making) > P(cement|For dinner I’m making).

We are also often interested in the probability that our model assigns to a full sentence *W* made of the sequence of words (*w_1,w_2,…,w_N*).

For example, we’d like a model to assign higher probabilities to sentences that are **real** and **syntactically correct**.

A **unigram** **model** only works at the level of individual words. Given a sequence of words W, a unigram model would output the probability:

where the individual probabilities P(w_i) could for example be estimated based on the frequency of the words in the training corpus.

An **n-gram model**, instead, looks at the previous (n-1) words to estimate the next one. For example, a trigram model would look at the previous 2 words, so that:

Language models can be **embedded** in more complex systems to aid in performing language tasks such as translation, classification, speech recognition, etc.

**Perplexity** is an **evaluation metric** for language models. But why would we want to use it? Why can’t we just look at the loss/accuracy of our final system on the task we care about?

We can in fact use two different approaches to evaluate and compare language models:

**Extrinsic evaluation**. This involves evaluating the models by employing them in an actual**task**(such as machine translation) and looking at their final loss/accuracy. This is the best option as it’s the only way to tangibly see how different models affect the task we’re interested in. However, it can be computationally expensive and slow as it requires training a full system.**Intrinsic evaluation**. This involves finding some metric to evaluate the language model itself, not taking into account the specific tasks it’s going to be used for. While intrinsic evaluation is not as “good” as extrinsic evaluation as a final metric, it’s a useful way of**quickly**comparing models.**Perplexity is an intrinsic evaluation method.**

This is probably the most frequently seen definition of perplexity. In this section, we’ll see why it makes sense.

First of all, what makes a good language model? As mentioned earlier, we want our model to assign high probabilities to sentences that are real and syntactically correct, and low probabilities to fake, incorrect, or highly infrequent sentences. Assuming our dataset is made of sentences that are in fact real and correct, this means that the best model will be the one that assigns the **highest probability to the test set**. Intuitively, if a model assigns a high probability to the test set, it means that it is **not surprised** to see it (it’s not *perplexed* by it), which means that it has a good understanding of how the language works.

However, it’s worth noting that datasets can have** varying numbers of sentences**, and sentences can have varying numbers of words. Clearly, adding more sentences introduces more uncertainty, so other things being equal a larger test set is likely to have a lower probability than a smaller one. Ideally, we’d like to have a metric that is independent of the size of the dataset. We could obtain this by **normalizing** the probability of the test set **by the total number of words**, which would give us a **per-word measure**.

How do we do this? If what we wanted to normalize was the sum of some terms, we could just divide it by the number of words to get a per-word measure. But the probability of a sequence of words is given by a product.

For example, let’s take a unigram model:

How do we normalize this probability? It’s easier to do it by looking at the log probability, which turns the product into a sum:

We can now normalize this by dividing by N to obtain the **per-word log probability**:

… and then remove the log by exponentiating:

We can see that we’ve obtained **normalization by taking the N-th root**.

Now going back to our original equation for perplexity, we can see that we can interpret it as the **inverse probability of the test set**, **normalized** **by the number of words** in the test set:

Note:

- Since we’re taking the inverse probability, a
**lower perplexity**indicates a**better model** - In this case, W is the test set. It contains the sequence of words of all sentences one after the other, including the start-of-sentence and end-of-sentence tokens, <SOS> and <EOS>.

For example, a test set with two sentences would look like this:

W = (<SOS>, This, is, the, first, sentence, . ,<EOS>,<SOS>, This, is, the, second, one, . ,<EOS>)

N is the count of all tokens in our test set, including SOS/ EOS and punctuation. In the example above N = 16.

If we want, we can also calculate the perplexity of a single sentence, in which case W would simply be that one sentence.

**Note**: if you need a refresher on entropy I heartily recommend this document by Sriram Vajapeyam.

Perplexity can also be defined as the exponential of the cross-entropy:

First of all, we can easily check that this is in fact equivalent to the previous definition:

But how can we explain this definition based on cross-entropy?

We know that entropy can be interpreted as the *average number of bits required to store the information in a variable*, and it’s given by:

We also know that the **cross-entropy** is given by:

which can be interpreted as the average number of bits required to store the information in a variable, if instead of the real probability distribution p we’re using an **estimated distribution** q.

In our case, p is the real distribution of our language, while q is the distribution estimated by our model on the training set. Clearly, we can’t know the real p, but given a long enough sequence of words W (so a large N), we can approximate the per-word cross-entropy using the Shannon-McMillan-Breiman theorem:

Let’s rewrite this to be consistent with the notation used in the previous section. Given a sequence of words W of length N and a trained language model P, we approximate the cross-entropy as:

Let’s look again at our definition of perplexity:

From what we know of cross-entropy we can say that H(W) **is the average number of bits needed to encode each word**. This means that the perplexity

How can we interpret this? We can look at perplexity as to the **weighted branching factor**.

So we’ve said:

For example, if we find that H(W) = 2, it means that on average each word needs 2 bits to be encoded, and using 2 bits we can encode 2² = 4 words.

But what does this mean? For simplicity, let’s forget about language and words for a moment and imagine that our model is actually trying to predict the **outcome of rolling a die**. A regular die has 6 sides, so the **branching factor** of the die is 6. The branching factor simply indicates **how many possible outcomes** there are whenever we roll.

Let’s say we train our model on this fair die, and the model learns that each time we roll there is a 1/6 probability of getting any side. Then let’s say we create a test set by rolling the die 10 more times and we obtain the (highly unimaginative) sequence of outcomes T = {1, 2, 3, 4, 5, 6, 1, 2, 3, 4}. What’s the perplexity of our model on this test set?

So the perplexity matches the branching factor.

Let’s now imagine that we have an **unfair die**, which rolls a 6 with a probability of 7/12, and all the other sides with a probability of 1/12 each. We again train a model on a training set created with this unfair die so that it will learn these probabilities. We then create a new test set T by rolling the die 12 times: we get a 6 on 7 of the rolls, and other numbers on the remaining 5 rolls. What’s the perplexity now?

**The perplexity is lower**. This is because our model now knows that rolling a 6 is more probable than any other number, so it’s less “surprised” to see one, and since there are more 6s in the test set than other numbers, the overall “surprise” associated with the test set is lower. The *branching factor* is still 6, because all 6 numbers are still possible options at any roll. However, the ** weighted branching factor** is now lower, due to one option being a lot more likely than the others. This is like saying that under these new conditions, at each roll our model is

To clarify this further, let’s push it to the extreme. Let’s say we now have an unfair die that gives a 6 with 99% probability, and the other numbers with a probability of 1/500 each. We again train the model on this die and then create a test set with 100 rolls where we get a 6 99 times and another number once. The perplexity is now:

The branching factor is still 6 but the weighted branching factor is now 1, because at each roll the model is almost certain that it’s going to be a 6, and rightfully so. So while *technically* at each roll there are still 6 possible options, there is only 1 option that is a strong favorite.

Let’s tie this back to language models and cross-entropy.

First of all, if we have a language model that’s trying to guess the next word, the branching factor is simply the number of words that are possible at each point, which is just the size of the vocabulary.

We said earlier that perplexity in a language model is* the average number of words that can be encoded using* *H(W)* *bits*. We can now see that this simply represents the **average branching factor** of the model. As we said earlier, if we find a cross-entropy value of 2, this indicates a perplexity of 4, which is the “average number of words that can be encoded”, and that’s simply the average branching factor. All this means is that **when trying to guess the next word, our model is** ** as confused as if it had to pick between 4 different words**.

**Perplexity**is a metric used to judge how good a language model is- We can define perplexity as the
**inverse probability of the test set**,**normalized by the number of words**:

- We can alternatively define perplexity by using the
**cross-entropy**, where the cross-entropy indicates the average number of bits needed to encode one word, and perplexity is the number of words that can be encoded with those bits:

- We can interpret perplexity as to the weighted branching factor. If we have a perplexity of 100, it means that whenever the model is trying to guess the next word it is as confused as if it had to pick between 100 words.

References:

https://towardsdatascience.com/perplexity-in-language-models-87a196019a94