This year, we saw a dazzling application of machine learning. The OpenAI GPT-2 exhibited an impressive ability to write coherent and passionate essays that exceed what we anticipated current language models are able to produce. The GPT-2 wasn’t a particularly novel architecture – its architecture is very similar to the decoder-only transformer. The GPT2 was, however, a very large, transformer-based language model trained on a massive dataset. In this post, we’ll look at the architecture that enabled the model to produce its results. We will go into the depths of its self-attention layer. And then we’ll look at applications for the decoder-only transformer beyond language modeling.
So what exactly is a language model?
A language model is basically a machine learning model that is able to look at parts of a sentence and predict the next word. The most famous language models are smartphone keyboards that suggest the next word based on what you’ve currently typed.
In this sense, we can say that the GPT-2 is basically the next word prediction feature of a keyboard app, but one that is much larger and more sophisticated than what your phone has. The GPT-2 was trained on a massive 40GB dataset called WebText that the OpenAI researchers crawled from the internet as part of the research effort. To compare in terms of storage size, the keyboard app I use, SwiftKey, takes up 78MBs of space. The smallest variant of the trained GPT-2, takes up 500MBs of storage to store all of its parameters. The largest GPT-2 variant is 13 times the size so it could take up more than 6.5 GBs of storage space.
One great way to experiment with GPT-2 is using the AllenAI GPT-2 Explorer. It uses GPT-2 to display ten possible predictions for the next word (alongside their probability score). You can select a word and then see the next list of predictions to continue writing the passage.
The original transformer model is made up of an encoder and decoder – each is a stack of what we can call transformer blocks. That architecture was appropriate because the model tackled machine translation – a problem where encoder-decoder architectures have been successful in the past.
A lot of the subsequent research work saw the architecture shed either the encoder or decoder, and use just one stack of transformer blocks – stacking them up as high as practically possible, feeding them massive amounts of training text, and throwing vast amounts of computing at them (hundreds of thousands of dollars to train some of these language models, likely millions in the case of AlphaStar).
How high can we stack up these blocks? It turns out that’s one of the main distinguishing factors between the different GPT2 model sizes:
The GPT-2 is built using transformer decoder blocks. BERT, on the other hand, uses transformer encoder blocks. We will examine the difference in the following section. But one key difference between the two is that GPT2, like traditional language models, outputs one token at a time. Let’s for example prompt a well-trained GPT-2 to recite the first law of robotics:
The way these models actually work is that after each token is produced, that token is added to the sequence of inputs. And that new sequence becomes the input to the model in its next step. This is an idea called “auto-regression”. This is one of the ideas that made RNNs unreasonably effective.
The GPT2 and some later models like TransformerXL and XLNet are auto-regressive in nature. BERT is not. That is a trade-off. In losing auto-regression, BERT gained the ability to incorporate the context on both sides of a word to gain better results. XLNet brings back autoregression while finding an alternative way to incorporate the context on both sides.
The initial transformer paper introduced two types of transformer blocks:
First is the encoder block:
An encoder block from the original transformer paper can take inputs up to a certain max sequence length (e.g. 512 tokens). It’s okay if an input sequence is shorter than this limit, we can just pad the rest of the sequence.
Second, there’s the decoder block which has a small architectural variation from the encoder block – a layer to allow it to pay attention to specific segments from the encoder:
One key difference in the self-attention layer here is that it masks future tokens – not by changing the word to [mask] like BERT, but by interfering in the self-attention calculation blocking information from tokens that are to the right of the position being calculated.
If, for example, we’re to highlight the path of position #4, we can see that it is only allowed to attend to the present and previous tokens:
It’s important that the distinction between self-attention (what BERT uses) and masked self-attention (what GPT-2 uses) is clear. A normal self-attention block allows a position to peak at tokens to its right. Masked self-attention prevents that from happening:
Subsequent to the original paper, Generating Wikipedia by Summarizing Long Sequences proposed another arrangement of the transformer block that is capable of doing language modeling. This model threw away the Transformer encoder. For that reason, let’s call the model the “Transformer-Decoder”. This early transformer-based language model was made up of a stack of six transformer decoder blocks:
The decoder blocks are identical. I have expanded the first one so you can see its self-attention layer is the masked variant. Notice that the model now can address up to 4,000 tokens in a certain segment — a massive upgrade from the 512 in the original transformer.
These blocks were very similar to the original decoder blocks, except they did away with that second self-attention layer. A similar architecture was examined in Character-Level Language Modeling with Deeper Self-Attention to create a language model that predicts one letter/character at a time.
The OpenAI GPT-2 model uses these decoder-only blocks.
Let’s lay a trained GPT-2 on our surgery table and look at how it works.
The GPT-2 can process 1024 tokens. Each token flows through all the decoder blocks along its own path.
The simplest way to run a trained GPT-2 is to allow it to ramble on its own (which is technically called generating unconditional samples) – alternatively, we can give it a prompt to have it speak about a certain topic (a.k.a generating interactive conditional samples). In the rambling case, we can simply hand it the start token and have it start generating words (the trained model uses
<|endoftext|> as its start token. Let’s call it
The model only has one input token, so that path would be the only active one. The token is processed successively through all the layers, then a vector is produced along that path. That vector can be scored against the model’s vocabulary (all the words the model knows, 50,000 words in the case of GPT-2). In this case, we selected the token with the highest probability, ‘the’. But we can certainly mix things up – you know how if you keep clicking the suggested word in your keyboard app, it sometimes can stuck in repetitive loops where the only way out is if you click the second or third suggested word. The same can happen here. GPT-2 has a parameter called top-k that we can use to have the model consider sampling words other than the top word (which is the case when top-k = 1).
In the next step, we add the output from the first step to our input sequence, and have the model make its next prediction:
Notice that the second path is the only one that’s active in this calculation. Each layer of GPT-2 has retained its own interpretation of the first token and will use it in processing the second token (we’ll get into more detail about this in the following section about self-attention). GPT-2 does not re-interpret the first token in light of the second token.
Let’s look at more details to get to know the model more intimately. Let’s start with the input. As in other NLP models we’ve discussed before, the model looks up the embedding of the input word in its embedding matrix – one of the components we get as part of a trained model.
Each row is a word embedding: a list of numbers representing a word and capturing some of its meaning. The size of that list is different in different GPT2 model sizes. The smallest model uses an embedding size of 768 per word/token.
So in the beginning, we look up the embedding of the start token
<s> in the embedding matrix. Before handing that to the first block in the model, we need to incorporate positional encoding – a signal that indicates the order of the words in the sequence to the transformer blocks. Part of the trained model is a matrix that contains a positional encoding vector for each of the 1024 positions in the input.
With this, we’ve covered how input words are processed before being handed to the first transformer block. We also know two of the weight matrices that constitute the trained GPT-2.
Sending a word to the first transformer block means looking up its embedding and adding up the positional encoding vector for position #1.
The first block can now process the token by first passing it through the self-attention process, then passing it through its neural network layer. Once the first transformer block processes the token, it sends its resulting vector up the stack to be processed by the next block. The process is identical in each block, but each block has its own weights in both self-attention and the neural network sublayers.
Language heavily relies on context. For example, look at the second law:
Second Law of Robotics
A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
I have highlighted three places in the sentence where the words are referring to other words. There is no way to understand or process these words without incorporating the context they are referring to. When a model processes this sentence, it has to be able to know that:
This is what self-attention does. It bakes in the model’s understanding of relevant and associated words that explain the context of a certain word before processing that word (passing it through a neural network). It does that by assigning scores to how relevant each word in the segment is and adding up their vector representation.
As an example, this self-attention layer in the top block is paying attention to “a robot” when it processes the word “it”. The vector it will pass to its neural network is a sum of the vectors for each of the three words multiplied by their scores.
Self-attention is processed along the path of each token in the segment. The significant components are three vectors:
A crude analogy is to think of it like searching through a filing cabinet. The query is like a sticky note with the topic you’re researching. The keys are like the labels of the folders inside the cabinet. When you match the tag with a sticky note, we take out the contents of that folder, these contents are the value vector. Except you’re not only looking for one value but a blend of values from a blend of folders.
Multiplying the query vector by each key vector produces a score for each folder (technically: dot product followed by softmax).
We multiply each value by its score and sum up – resulting in our self-attention outcome.
This weighted blend of value vectors results in a vector that paid 50% of its “attention” to the word
robot, 30% to the word
a, and 19% to the word
it. Later in the post, we’ll get deeper into self-attention. But first, let’s continue our journey up the stack toward the output of the model.
When the top block in the model produces its output vector (the result of its own self-attention followed by its own neural network), the model multiplies that vector by the embedding matrix.
Recall that each row in the embedding matrix corresponds to the embedding of a word in the model’s vocabulary. The result of this multiplication is interpreted as a score for each word in the model’s vocabulary.
We can simply select the token with the highest score (top_k = 1). But better results are achieved if the model considers other words as well. So a better strategy is to sample a word from the entire list using the score as the probability of selecting that word (so words with a higher score have a higher chance of being selected). A middle ground is setting top_k to 40, and having the model consider the 40 words with the highest scores.
With that, the model has completed an iteration resulting in outputting a single word. The model continues iterating until the entire context is generated (1024 tokens) or until an end-of-sequence token is produced.
And there we have it. A rundown of how the GPT2 works. If you’re curious to know exactly what happens inside the self-attention layer, then the following bonus section is for you. We will describe self-attention in order to make describing later transformer models easier to examine and describe (looking at you, TransformerXL and XLNet).
I’d like to note a few oversimplifications in this post:
Earlier in the post, we showed this image to showcase self-attention being applied in a layer that is processing the word
In this section, we’ll look at the details of how that is done. Note that we’ll look at it in a way to try to make sense of what happens to individual words. That’s why we’ll be showing many single vectors. The actual implementations are done by multiplying giant matrices together. But I want to focus on the intuition of what happens on a word level here.
Let’s start by looking at the original self-attention as it’s calculated in an encoder block. Let’s look at a toy transformer block that can only process four tokens at a time.
Self-attention is applied through three main steps:
Let’s focus on the first path. We’ll take its query, and compare it against all the keys. That produces a score for each key. The first step in self-attention is to calculate the three vectors for each token path (let’s ignore attention heads for now):
Now that we have the vectors, we use the query and key vectors only for step #2. Since we’re focused on the first token, we multiply its query by all the other key vectors resulting in a score for each of the four tokens.
We can now multiply the scores by the value vectors. A value with a high score will constitute a large portion of the resulting vector after we sum them up.
The lower the score, the more transparent we’re showing the value vector. That’s to indicate how multiplying by a small number dilutes the values of the vector.
If we do the same operation for each path, we end up with a vector representing each token containing the appropriate context of that token. Those are then presented to the next sublayer in the transformer block (the feed-forward neural network):
Now that we’ve looked inside a transformer’s self-attention step, let’s proceed to look at masked self-attention. Masked self-attention is identical to self-attention except when it comes to step #2. Assuming the model only has two tokens as input and we’re observing the second token. In this case, the last two tokens are masked. So the model interferes in the scoring step. It basically always scores the future tokens as 0 so the model can’t peak to future words:
This masking is often implemented as a matrix called an attention mask. Think of a sequence of four words (“robot must obey orders”, for example). In a language modeling scenario, this sequence is absorbed in four steps – one per word (assuming for now that every word is a token). As these models work in batches, we can assume a batch size of 4 for this toy model that will process the entire sequence (with its four steps) as one batch.
In matrix form, we calculate the scores by multiplying a queries matrix by a keys matrix. Let’s visualize it as follows, except instead of the word, there would be the query (or key) vector associated with that word in that cell:
After the multiplication, we slap on our attention mask triangle. It set the cells we want to mask to -infinity or a very large negative number (e.g. -1 billion in GPT2):
Then, applying softmax on each row produces the actual scores we use for self-attention:
What this scores table means is the following:
Let’s get into more detail on GPT-2’s masked attention.
We can make the GPT-2 operate exactly as masked self-attention works. But during evaluation, when our model is only adding one new word after each iteration, it would be inefficient to recalculate self-attention along earlier paths for tokens that have already been processed.
In this case, we process the first token (ignoring
<s> for now).
GPT-2 holds on to the key and value vectors of the “
a” token. Every self-attention layer holds on to its respective key and value vectors for that token:
Now in the next iteration, when the model processes the word
robot, it does not need to generate query, key, and value queries for the “
a” token. It just reuses the ones it saved from the first iteration:
Let’s assume the model is processing the word
it. If we’re talking about the bottom block, then its input for that token would be the embedding of
it + the positional encoding for slot #9:
Every block in a transformer has its own weights (broken down later in the post). The first we encounter is the weight matrix that we use to create the queries, keys, and values.
Self-attention multiplies its input by its weight matrix (and adds a bias vector, not illustrated here).
The multiplication results in a vector that’s basically a concatenation of the query, key, and value vectors for the word
Multiplying the input vector by the attention weights vector (and adding a bias vector afterward) results in the key, value, and query vectors for this token.
In the previous examples, we dove straight into self-attention ignoring the “multi-head” part. It would be useful to shed some light on that concept now. Self-attention is conducted multiple times on different parts of the Q, K, V vectors. “Splitting” attention heads is simply reshaping the long vector into a matrix. The small GPT2 has 12 attention heads, so that would be the first dimension of the reshaped matrix:
In the previous examples, we’ve looked at what happens inside one attention head. One way to think of multiple attention-heads is like this (if we’re to only visualize three of the twelve attention heads):
We can now proceed to scoring – knowing that we’re only looking at one attention head (and that all the others are conducting a similar operation):
Now the token can get scored against all of keys of the other tokens (that were calculated in attention head #1 in previous iterations):
As we’ve seen before, we now multiply each value with its score, then sum them up, producing the result of self-attention for attention-head #1:
The way we deal with the various attention heads is that we first concatenate them into one vector:
But the vector isn’t ready to be sent to the next sublayer just yet. We need to first turn this Frankenstein’s-monster of hidden states into a homogenous representation.
We’ll let the model learn how to best map concatenated self-attention results into a vector that the feed-forward neural network can deal with. Here comes our second large weight matrix that projects the results of the attention heads into the output vector of the self-attention sublayer:
And with this, we have produced the vector we can send along to the next layer:
The fully-connected neural network is where the block processes its input token after self-attention has included the appropriate context in its representation. It is made up of two layers. The first layer is four times the size of the model (Since GPT2 small is 768, this network would have 768*4 = 3072 units). Why four times? That’s just the size the original transformer rolled with (model dimension was 512 and layer #1 in that model was 2048). This seems to give transformer models enough representational capacity to handle the tasks that have been thrown at them so far.
(Not shown: A bias vector)
The second layer projects the result from the first layer back into the model dimension (768 for the small GPT2). The result of this multiplication is the result of the transformer block for this token.
(Not shown: A bias vector)
That’s the most detailed version of the transformer block we’ll get into! You now pretty much have the vast majority of the picture of what happens inside a transformer language model. To recap, our brave input vector encounters these weight matrices:
And each block has its own set of these weights. On the other hand, the model has only one token embedding matrix and one positional encoding matrix:
If you want to see all the parameters of the model, then I have tallied them here:
They add up to 124M parameters instead of 117M for some reason. I’m not sure why, but that’s how many of them seem to be in the published code.
The decoder-only transformer keeps showing promise beyond language modeling. There are plenty of applications where it has shown success which can be described by similar visuals as the above. Let’s close this post by looking at some of these applications
An encoder is not required to conduct translation. The same task can be addressed by a decoder-only transformer:
This is the task that the first decoder-only transformer was trained on. Namely, it was trained to read a Wikipedia article (without the opening section before the table of contents), and to summarize it. The actual opening sections of the articles were used as the labels in the training datasest:
The paper trained the model against Wikipedia articles, and thus the trained model was able to summarize articles:
In Sample Efficient Text Summarization Using a Single Pre-Trained Transformer, a decoder-only transformer is first pre-trained on language modeling, then finetuned to do summarization. It turns out to achieve better results than a pre-trained encoder-decoder transformer in limited data settings.
The GPT2 paper also shows the results of summarization after pre-training the model on language modeling.
The Music Transformer uses a decoder-only transformer to generate music with expressive timing and dynamics. “Music Modeling” is just like language modeling – just let the model learn music in an unsupervised way, then have it sample outputs (what we called “rambling”, earlier).
You might be curious as to how music is represented in this scenario. Remember that language modeling can be done through vector representations of either characters, words, or tokens that are parts of words. With a musical performance (let’s think about the piano for now), we have to represent the notes, but also velocity – a measure of how hard the piano key is pressed.
A performance is just a series of these one-hot vectors. A midi file can be converted into such a format. The paper has the following example input sequence:
The one-hot vector representation for this input sequence would look like this:
I love a visual in the paper that showcases self-attention in the Music Transformer. I’ve added some annotations to it here:
“Figure 8: This piece has a recurring triangular contour. The query is at one of the latter peaks and it attends to all of the previous high notes on the peak, all the way to the beginning of the piece.” … “[The] figure shows a query (the source of all the attention lines) and previous memories being attended to (the notes that are receiving more softmax probabiliy is highlighted in). The coloring of the attention lines corresponds to different heads and the width to the weight of the softmax probability.”
If you’re unclear on this representation of musical notes, check out this video.
The tech world is abuzz with GPT3 hype. Massive language models (like GPT3) are starting to surprise us with their abilities. While not yet completely reliable for most businesses to put in front of their customers, these models are showing sparks of cleverness that are sure to accelerate the march of automation and the possibilities of intelligent computer systems. Let’s remove the aura of mystery around GPT3 and learn how it’s trained and how it works.
A trained language model generates text.
We can optionally pass it some text as input, which influences its output.
The output is generated from what the model “learned” during its training period where it scanned vast amounts of text.
Training is the process of exposing the model to lots of text. That process has been completed. All the experiments you see now are from that one trained model. It was estimated to cost 355 GPU years and cost $4.6m.
The dataset of 300 billion tokens of text is used to generate training examples for the model. For example, these are three training examples generated from the one sentence at the top.
You can see how you can slide a window across all the text and make lots of examples.
The model is presented with an example. We only show it the features and ask it to predict the next word.
The model’s prediction will be wrong. We calculate the error in its prediction and update the model so next time it makes a better prediction.
Repeat millions of times
Now let’s look at these same steps with a bit more detail.
GPT3 actually generates output one token at a time (let’s assume a token is a word for now).
Please note: This is a description of how GPT-3 works and not a discussion of what is novel about it (which is mainly the ridiculously large scale). The architecture is a transformer decoder model based on this paper https://arxiv.org/pdf/1801.10198.pdf
GPT3 is MASSIVE. It encodes what it learns from training in 175 billion numbers (called parameters). These numbers are used to calculate which token to generate at each run.
The untrained model starts with random parameters. Training finds values that lead to better predictions.
These numbers are part of hundreds of matrices inside the model. Prediction is mostly a lot of matrix multiplication. To shed light on how these parameters are distributed and used, we’ll need to open the model and look inside.
GPT3 is 2048 tokens wide. That is its “context window”. That means it has 2048 tracks along which tokens are processed.
Let’s follow the purple track. How does a system process the word “robotics” and produce “A”?
The important calculations of the GPT3 occur inside its stack of 96 transformer decoder layers.
See all these layers? This is the “depth” in “deep learning”.
Each of these layers has its own 1.8B parameter to make its calculations. That is where the “magic” happens. This is a high-level view of that process:
You can see a detailed explanation of everything inside the decoder in my blog post The Illustrated GPT2.
The difference with GPT3 is the alternating dense and sparse self-attention layers.
This is an X-ray of an input and response (“Okay human”) within GPT3. Notice how every token flows through the entire layer stack. We don’t care about the output of the first words. When the input is done, we start caring about the output. We feed every word back into the model.
In the React code generation example, the description would be the input prompt (in green), in addition to a couple of examples of description=>code, I believe. And the react code would be generated like the pink tokens here token after token.
My assumption is that the priming examples and the description are appended as input, with specific tokens separating examples and the results. Then fed into the model.
It’s impressive that this works like this. Because you just wait until fine-tuning is rolled out for the GPT3. The possibilities will be even more amazing.
Fine-tuning actually updates the model’s weights to make the model better at a certain task.