Principal Component Analysis (PCA) Explained
Shannon entropy and its properties
Show all

Understanding Attention Mechanism in Sequence 2 Sequence Machine Translation

39 mins read


Recurrent Neural Networks (or more precisely LSTM/GRU) have been found to be very effective in solving complex sequence-related problems given a large amount of data. They have real-time applications in speech recognition, Natural Language Processing (NLP) problems, time series forecasting, etc.

Sequence to Sequence (often abbreviated to seq2seq) model is a special class of Recurrent Neural Network architectures typically used (but not restricted) to solve complex Language related problems like Machine Translation, Question Answering, creating Chat-bots, Text Summarization, etc.


The purpose of this blog post is to give a detailed explanation of the sequence-to-sequence model and attention mechanism and give an intuitive understanding of how they solve these complex tasks. We will take the problem of Machine Translation (translating a text from one language to another, in our case from English to Marathi) as the running example in this post. However, the technical details apply to any sequence-to-sequence problem in general.

Since we are using Neural Networks to perform Machine Translation, more commonly it is called Neural Machine translation (NMT).


This post assumes that you:

a. Know Fundamental concepts in Machine Learning and Neural Networks

b. Know High School Linear Algebra and Probability

c. Have a working knowledge of LSTM networks in Python and Keras

The Unreasonable Effectiveness of Recurrent Neural Networks (explains how RNNs can be used to build language models) and Understanding LSTM Networks (explains the working of LSTMs with solid intuition) are two brilliant blogs that I strongly suggest to go through if you haven’t. The concepts explained in these blogs are extensively used in this post.

Encoder-Decoder Architecture

The most common architecture used to build Seq2Seq models is the Encoder-Decoder architecture. This is the one we will use for this post. Below is a very high-level view of this architecture.


Points to note:

a. Both encoder and the decoder are typically LSTM models (or sometimes GRU models)

b. Encoder reads the input sequence and summarizes the information in something called as the internal state vectors (in the case of LSTM these are called the hidden state and cell state vectors). We discard the outputs of the encoder and only preserve the internal states.

c. Decoder is an LSTM whose initial states are initialized to the final states of the Encoder LSTM. Using these initial states, the decoder starts generating the output sequence.

d. The decoder behaves a bit differently during the training and inference procedure. During the training, we use a technique called teacher force which helps to train the decoder faster. During inference, the input to the decoder at each time step is the output from the previous time step.

e. Intuitively, the encoder summarizes the input sequence into state vectors (sometimes also called Thought vectors), which are then fed to the decoder which starts generating the output sequence given the Thought vectors. The decoder is just a language model conditioned on the initial states.

Now we will understand all the above steps in detail by considering the example of translating an English sentence (input sequence) into its equivalent Marathi sentence (output sequence).

Encoder LSTM

This section provides a brief overview of the main components of the Encoder LSTM. I will keep this intuitive without going into the Mathematical arena. So here’s what they are:

LSTM processing an input sequence of length ‘k’

The LSTM reads the data one sequence after the other. Thus if the input is a sequence of length ‘k’, we say that LSTM reads it in ‘k’ time steps (think of this as a for loop with ‘k’ iterations).

Referring to the above diagram, below are the 3 main components of an LSTM:

a. Xi => Input sequence at time step i

b. hi and ci => LSTM maintains two states (‘h’ for hidden state and ‘c’ for cell state) at each time step. Combined together these are the internal states of the LSTM at time step i.

c. Yi => Output sequence at time step i

Important: Technically all of these components (Xi, hi, ci and Yi) are actually vectors of floating point numbers (explained below)

Let’s try to map all of these in the context of our problem. Recall that our problem is to translate an English sentence to its Marathi equivalent. For the purpose of this blog, we will consider the below example. Say, we have the following sentence

Input sentence (English)=> “Rahul is a good boy”

Output sentence (Marathi) => “राहुल चांगला मुलगा आहे”

For now, just focus on the input i.e. the English sentence

The explanation for Xi:

Now a sentence can be seen as a sequence of either words or characters. For example in the case of words, the above English sentence can be thought of as a sequence of 5 words (‘Rahul’, ‘is’, ‘a’, ‘good’, ‘boy’). And in the case of characters, it can be thought of as a sequence of 19 characters (‘R’, ‘a’, ‘h’, ‘u’, ‘l’, ‘ ‘, ……, ‘y’).

We will break the sentence by words as this scheme is more common in real-world applications. Hence the name ‘Word Level NMT’. So, referring to the diagram above, we have the following input:

X1 = ‘Rahul’, X2 = ‘is’, X3 = ‘a’, X4 = ‘good, X5 = ‘boy’.

The LSTM will read this sentence word by word in 5-time steps as follows

Encoder LSTM

But one question that we must answer is how to represent each Xi (each word) as a vector.

There are various word embedding techniques that map (embed) a word into a fixed length vector. I will assume the reader to be familiar with the concept of word embeddings and won’t cover this topic in detail. However, we will use the built-in Embedding Layer of the Keras API to map each word into a fixed-length vector.

The explanation for hi and ci:

The next question is what is the role of the internal states (hi and ci) at each time step?

In very simple terms, they remember what the LSTM has read (learned) till now. For example:

h3, c3 =>These two vectors will remember that the network has read “Rahul is a” till now. Basically, it’s the summary of information till time step 3 which is stored in the vectors h3 and c3 (thus called the states at time step 3).

Similarly, we can thus say that h5, c5 will contain the summary of the entire input sentence, since this is where the sentence ends (at time step 5). These states coming out of the last time step are also called the “Thought vectors” as they summarize the entire sequence in a vector form.

Then what about h0,c0? These vectors are typically initialized to zero as the model has not yet started to read the input.

Note: The size of both of these vectors is equal to the number of units (neurons) used in the LSTM cell.

The explanation for Yi:

Finally, what about Yi at each time step? These are the output (predictions) of the LSTM model at each time step.

But what type of a vector is Yi? More specifically in the case of word-level language models, each Yi is actually a probability distribution over the entire vocabulary which is generated by using a softmax activation. Thus each Yi is a vector of size “vocab_size” representing a probability distribution.

Depending on the context of the problem they might sometimes be used or sometimes discarded.

In our case, we have nothing to output unless we have read the entire English sentence. Because we will start generating the output sequence (equivalent Marathi sentence) once we have read the entire English sentence. Thus we will discard the Yi of the Encoder for our problem.

Summary of the encoder:

We will read the input sequence (English sentence) word by word and preserve the internal states of the LSTM network generated after the last time step hk, ck (assuming the sentence has ‘k’ words). These vectors (states hk and ck) are called the encoding of the input sequence, as they encode (summarize) the entire input in a vector form. Since we will start generating the output once we have read the entire sequence, outputs (Yi) of the Encoder at each time step are discarded.

Moreover, you must also understand what type of vectors are Xi, hi, ci and Yi. What are their sizes (shapes) and what do they represent? If you have any confusion understanding this part, then you need to first strengthen your understanding of LSTM and language models.

Decoder LSTM — Training Mode

Unlike the Encoder LSTM which has the same role to play in both the training phase as well as in the inference phase, the Decoder LSTM has a slightly different role to play in both of these phases.

In this section, we’ll try to understand how to configure the Decoder during the training phase, while in the next section we’ll understand how to use it during inference.

Recall that given the input sentence “Rahul is a good boy”, the goal of the training process is to train (teach) the decoder to output “राहुल चांगला मुलगा आहे”. Just as the Encoder scanned the input sequence word by word, similarly the Decoder will generate the output sequence word by word.

For some technical reasons (explained later) we will add two tokens in the output sequence as follows:

Output sequence => “START_ राहुल चांगला मुलगा आहे _END”

Now consider the diagram below:

Decoder LSTM — Training Mode

The most important point is that the initial states (h0, c0) of the decoder are set to the final states of the encoder. This intuitively means that the decoder is trained to start generating the output sequence depending on the information encoded by the encoder. Obviously, the translated Marathi sentence must depend on the given English sentence.

In the first time step, we provide the START_ token so that the decoder starts generating the next token (the actual first word of Marathi sentence). And after the last word in the Marathi sentence, we make the decoder learn to predict the _END token. This will be used as the stopping condition during the inference procedure, basically, it will denote the end of the translated sentence and we will stop the inference loop (more on this later).

We use a technique called “Teacher Forcing” wherein the input at each time step is given as the actual output (and not the predicted output) from the previous time step. This helps in more faster and efficient training of the network. To understand more about teacher forcing, refer to this blog.

Finally, the loss is calculated on the predicted outputs from each time step and the errors are backpropagated through time in order to update the parameters of the network. Training the network over a longer period with a sufficiently large amount of data results in pretty good predictions (translations) as we’ll see later.

The entire training process (Encoder + Decoder) can be summarized in the below diagram:

Summary of the training process

Decoder LSTM — Inference Mode

Let’s now try to understand the setup required for inference. As already stated the Encoder LSTM plays the same role in reading the input sequence (English sentence) and generating the thought vectors (hk, ck). However, the decoder now has to predict the entire output sequence (Marathi sentence) given these thought vectors.

Let’s try to visually understand by taking the same example.

Input sequence => “Rahul is a good boy”

(Expected) Output Sequence => “राहुल चांगला मुलगा आहे”

Step 1: Encode the input sequence into the Thought Vectors:

Encode the input sequence into thought vectors

Step 2: Start generating the output sequence in a loop, word by word:

At t = 1

Decoder at t=1

At t = 2

Decoder at t = 2

At t = 3

Decoder at t = 3

At t = 4

Decoder at t = 4

At t = 5

Decoder at t = 5

Inference Algorithm:

a. During inference, we generate one word at a time. Thus the Decoder LSTM is called in a loop, every time processing only one-time step.

b. The initial states of the decoder are set to the final states of the encoder.

c. The initial input to the decoder is always the START_ token.

d. At each time step, we preserve the states of the decoder and set them as initial states for the next time step.

e. At each time step, the predicted output is fed as input in the next time step.

f. We break the loop when the decoder predicts the END_ token.

The entire inference procedure can be summarized in the below diagram:

Summary of the inference process

Code Walkthrough

Nothing beats the understanding developed when we actually implement the code, no matter how much effort is put in to understand the theory (that does not however mean that we do not discuss any theory, but what I mean to say is theory must always be followed by the implementation).


Download and unzip file from here.

Before we start building the models, we need to perform some data cleaning and preparation. Without going into too many details, I will assume the reader to understand the below (self-explanatory) steps that are usually a part of any language processing project.

lines= pd.read_table('mar.txt', names=['eng', 'mar'])

# Lowercase all characters
lines.eng=lines.eng.apply(lambda x: x.lower())
lines.mar=lines.mar.apply(lambda x: x.lower())

# Remove quotes
lines.eng=lines.eng.apply(lambda x: re.sub("'", '', x))
lines.mar=lines.mar.apply(lambda x: re.sub("'", '', x))
exclude = set(string.punctuation) # Set of all special characters

# Remove all the special characters
lines.eng=lines.eng.apply(lambda x: ''.join(ch for ch in x if ch not in exclude))
lines.mar=lines.mar.apply(lambda x: ''.join(ch for ch in x if ch not in exclude))

# Remove all numbers from text
remove_digits = str.maketrans('', '', digits)
lines.eng=lines.eng.apply(lambda x: x.translate(remove_digits))
lines.mar = lines.mar.apply(lambda x: re.sub("[२३०८१५७९४६]", "", x))

# Remove extra spaces
lines.eng=lines.eng.apply(lambda x: x.strip())
lines.mar=lines.mar.apply(lambda x: x.strip())
lines.eng=lines.eng.apply(lambda x: re.sub(" +", " ", x))
lines.mar=lines.mar.apply(lambda x: re.sub(" +", " ", x))

# Add start and end tokens to target sequences
lines.mar = lines.mar.apply(lambda x : 'START_ '+ x + ' _END')

Below we compute the vocabulary for both English and Marathi. We also compute the vocabulary sizes and the length of the maximum sequence for both languages. Finally, we create 4 Python dictionaries (two for each language) to convert a given token into an integer index and vice-versa.

# Vocabulary of English
for eng in lines.eng:
    for word in eng.split():
        if word not in all_eng_words:

# Vocabulary of French 
for mar in lines.mar:
    for word in mar.split():
        if word not in all_marathi_words:

# Max Length of source sequence
for l in lines.eng:
    lenght_list.append(len(l.split(' ')))
max_length_src = np.max(lenght_list)

# Max Length of target sequence
for l in lines.mar:
    lenght_list.append(len(l.split(' ')))
max_length_tar = np.max(lenght_list)

input_words = sorted(list(all_eng_words))
target_words = sorted(list(all_marathi_words))

# Calculate Vocab size for both source and target
num_encoder_tokens = len(all_eng_words)
num_decoder_tokens = len(all_marathi_words)
num_decoder_tokens += 1 # For zero padding

# Create word to token dictionary for both source and target
input_token_index = dict([(word, i+1) for i, word in enumerate(input_words)])
target_token_index = dict([(word, i+1) for i, word in enumerate(target_words)])

# Create token to word dictionary for both source and target
reverse_input_char_index = dict((i, word) for word, i in input_token_index.items())
reverse_target_char_index = dict((i, word) for word, i in target_token_index.items())

Then we make a 90–10 train and test split and write a Python generator function to load the data in batches as follows:

def generate_batch(X = X_train, y = y_train, batch_size = 128):
    ''' Generate a batch of data '''
    while True:
        for j in range(0, len(X), batch_size):
            encoder_input_data = np.zeros((batch_size, max_length_src),dtype='float32')
            decoder_input_data = np.zeros((batch_size, max_length_tar),dtype='float32')
            decoder_target_data = np.zeros((batch_size, max_length_tar, num_decoder_tokens),dtype='float32')
            for i, (input_text, target_text) in enumerate(zip(X[j:j+batch_size], y[j:j+batch_size])):
                for t, word in enumerate(input_text.split()):
                    encoder_input_data[i, t] = input_token_index[word] # encoder input seq
                for t, word in enumerate(target_text.split()):
                    if t<len(target_text.split())-1:
                        decoder_input_data[i, t] = target_token_index[word] # decoder input seq
                    if t>0:
                        # decoder target sequence (one hot encoded)
                        # does not include the START_ token
                        # Offset by one timestep
                        decoder_target_data[i, t - 1, target_token_index[word]] = 1.
            yield([encoder_input_data, decoder_input_data], decoder_target_data)

Then we define the model required for training as follows:

# Encoder
encoder_inputs = Input(shape=(None,))
enc_emb =  Embedding(num_encoder_tokens, latent_dim, mask_zero = True)(encoder_inputs)
encoder_lstm = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(enc_emb)

# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None,))
dec_emb_layer = Embedding(num_decoder_tokens, latent_dim, mask_zero = True)
dec_emb = dec_emb_layer(decoder_inputs)

# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(dec_emb, initial_state=encoder_states)

# Use a softmax to generate a probability distribution over the target vocabulary for each time step
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Compile the model
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

You should be able to conceptually connect every line with the explanation I have provided in sections 4 and 5 above.

Let’s look at the model architecture generated from the plot_model utility of the Keras.

Training the network

We train the network for 50 epochs with a batch size of 128. On a P4000 GPU, it takes slightly more than 2 hours to train.

Inference Setup:

# Encode the input sequence to get the "thought vectors"
encoder_model = Model(encoder_inputs, encoder_states)

# Decoder setup
# Below tensors will hold the states of the previous time step
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

# Get the embeddings of the decoder sequence
dec_emb2= dec_emb_layer(decoder_inputs)

# To predict the next word in the sequence, set the initial states to the states from the previous time step
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]

# A dense softmax layer to generate prob dist. over the target vocabulary
decoder_outputs2 = decoder_dense(decoder_outputs2)

# Final decoder model
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs2] + decoder_states2)

Finally, we generate the output sequence by invoking the above setup in a loop as follows:

def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)
    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1,1))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0] = target_token_index['START_']
    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += ' '+sampled_char
        # Exit condition: either hit max length or find stop token.
        if (sampled_char == '_END' or len(decoded_sentence) > 50):
            stop_condition = True
        # Update the target sequence (of length 1).
        target_seq = np.zeros((1,1))
        target_seq[0, 0] = sampled_token_index
        # Update states
        states_value = [h, c]
    return decoded_sentence

At this point, you must be able to conceptually connect every line of the code in the above two blocks with the explanation provided in section 6.

Results and Evaluation

The purpose is to give an intuitive explanation of how to build basic level sequence to sequence models using LSTM and not to develop a top-quality language translator. So keep in mind that the results are not world-class (and you don’t start comparing with google translate) for many reasons. The most important reason being is that the dataset size is very small, only 33000 pairs of sentences (yes these are too few). If you want to improve the quality of translations, I will list some suggestions towards the end of this blog. However, for now, let’s see some results generated from the above model (they are not too bad either).

Evaluation on train dataset:

Input English sentence: it is a holiday tomorrow
Actual Marathi Translation:  उद्या सुट्टी आहे 
Predicted Marathi Translation:  उद्या नाताळ आहे

Input English sentence: i will give you this book
Actual Marathi Translation:  मी तुम्हाला हे पुस्तक देईन 
Predicted Marathi Translation:  मी तुला हे पुस्तक तुला दिलं

Input English sentence: this sentence has five words
Actual Marathi Translation:  या वाक्यात पाच शब्द आहेत 
Predicted Marathi Translation:  या पाच शब्द आहेत

Input English sentence: did you clean your room
Actual Marathi Translation:  तुझी खोली साफ केलीस का 
Predicted Marathi Translation:  तुझी खोली साफ केली का

Input English sentence: she wrapped herself in a blanket
Actual Marathi Translation:  त्यांनी स्वतःला एका चादरीत गुंडाळून घेतलं 
Predicted Marathi Translation:  तिने एका छोट्या चादर घातली

Input English sentence: he is still alive
Actual Marathi Translation:  तो अजूनही जिवंत आहे 
Predicted Marathi Translation:  ते अजूनही जिवंत आहेत

Input English sentence: you cant speak french can you
Actual Marathi Translation:  तुला फ्रेंच बोलता येत नाही ना 
Predicted Marathi Translation:  तुला फ्रेंच बोलता येत नाही का नाही माहीत

Evaluation on test dataset:

Input English sentence: my wife isnt beautiful yours is
Actual Marathi Translation:  माझी बायको सुंदर नाहीये तुझी आहे 
Predicted Marathi Translation:  माझी तुझी गर्लफ्रेंड सुंदर आहे भेटलो

Input English sentence: who lives in this house
Actual Marathi Translation:  या घरात कोण राहतं 
Predicted Marathi Translation:  या घरात कोणी असतो

Input English sentence: somethings happened to tom
Actual Marathi Translation:  टॉमला काहीतरी झालंय 
Predicted Marathi Translation:  टॉमला काहीतरी झालं आहे

Input English sentence: the dog jumped over a chair
Actual Marathi Translation:  कुत्र्याने खुर्चीवरून उडी मारली 
Predicted Marathi Translation:  एक कुत्र्याला दरवाजा बंद केला

Input English sentence: i can prove who the murderer is
Actual Marathi Translation:  खुनी कोण आहे हे मी सिद्ध करू शकतो 
Predicted Marathi Translation:  मी जिंकू शकतो हे मला माहीत आहे

Input English sentence: everyone could hear what tom said
Actual Marathi Translation:  टॉम जे म्हणाला ते सर्वांना ऐकू येत होतं 
Predicted Marathi Translation:  टॉमने काय म्हटलं म्हटलं तर खरं

Input English sentence: those are my books
Actual Marathi Translation:  ती माझी पुस्तकं आहेत 
Predicted Marathi Translation:  ती माझी पुस्तकं आहेत

What can we conclude?

Even though the results are not the best, they are not that bad as well. Certainly much better than what a randomly generated sequence would result in. In some sentences, we can even note that the words predicted are not correct but they are semantically quite close to the correct words.

Also, another point to be noticed is that the results on the training set are a bit better than the results on the test set, which indicates that the model might be over-fitting a bit.


If you are interested to improve the quality, you can try out the below measures:

a. Get much more data. Top-quality translators are trained on millions of sentence pairs.

b. Build more complex models like Attention.

c. Use dropout and other forms of regularization techniques to mitigate over-fitting.

d. Perform Hyper-parameter tuning. Play with learning rate, batch size, dropout rate, etc. Try using bidirectional Encoder LSTM. Try using multi-layered LSTMs.

e. Try using beam search instead of a greedy approach.

f. Try BLEU score to evaluate your model.

g. The list is never ending and goes on.

You can find the full code on the GitHub repo here.

Introduction to Attention Mechanism

Attention is one of the most influential ideas in the Deep Learning community. Even though this mechanism is now used in various problems like image captioning and others, it was initially designed in the context of Neural Machine Translation using Seq2Seq Models. We would be using attention to design a system that translates a given English sentence to Marathi.

So what’s wrong with seq2seq models?

The seq2seq models is normally composed of an encoder-decoder architecture, where the encoder processes the input sequence and encodes/compresses/summarizes the information into a context vector (also called the “thought vector”) of a fixed length. This representation is expected to be a good summary of the entire input sequence. The decoder is then initialized with this context vector, using which it starts generating the transformed output.

A critical and apparent disadvantage of this fixed-length context vector design is the incapability of the system to remember longer sequences. Often it has forgotten the earlier parts of the sequence once it has processed the entire sequence. The attention mechanism was born to resolve this problem.

The central idea behind Attention

For illustrative purposes, we will borrow the same example that we used to explain Seq2Seq models.

Input (English) Sentence: “Rahul is a good boy”

Target (Marathi) Sentence: “राहुल चांगला मुलगा आहे”

The only change will be that instead of an LSTM layer that I used in the previous explanation, here I will use a GRU layer. The reason is that LSTM has two internal states (hidden state and cell state) and GRU has only one internal state (hidden state). This will help simplify the concept and explanation.

Recall the below diagram in which I summarized the entire process procedure of Seq2Seq modeling.

In the traditional Seq2Seq model, we discard all the intermediate states of the encoder and use only its final states (vector) to initialize the decoder. This technique works well for smaller sequences, however as the length of the sequence increases, a single vector becomes a bottleneck and it gets very difficult to summarize long sequences into a single vector. This observation was made empirically as it was noted that the performance of the system decreases drastically as the size of the sequence increases.

The central idea behind Attention is not to throw away those intermediate encoder states but to utilize all the states in order to construct the context vectors required by the decoder to generate the output sequence.

Why the name Attention?

Let’s name each of the intermediate states of the encoder as below:

Encoder GRU

Notice that since we are using a GRU instead of an LSTM, we only have a single state at each time step and not two states, which thus helps to simplify the illustration. Also, note that attention is useful, especially in the case of longer sequences but for the sake of simplicity we will consider the same above example for illustration.

Recall that these states (h1 to h5) are nothing but vectors of fixed length. To develop some intuition think of these states as vectors that store local information within the sequence. For example;

h1 stores the information present at the start of the sequence (words like ‘Rahul’ and ‘is’) while h5 stores the information present in the later part of the sequence (words like ‘good’ and ‘boy’).

Let’s represent our Encoder GRU with the below simplified diagram:

Compact Representation of Encoder GRU

Now the idea is to utilize all of this local information collectively in order to decide the next sequence while decoding the target sentence. Imagine you are translating “Rahul is a good boy” to “राहुल चांगला मुलगा आहे”. Ask yourself, how do you do it in your mind?

When you predict “राहुल”, it’s obvious that this name is the result of the word “Rahul” present in the input English sentence regardless of the rest of the sentence. We say that while predicting “राहुल”, we pay more attention to the word “Rahul” in the input sentence.

Similarly while predicting the word “चांगला”, we pay more attention to the word “good” in the input sentence. Similarly while predicting the word “मुलगा”, we pay more attention to the word “boy” in the input sentence. And so on.

Hence the name “ATTENTION”.

As human beings we are quickly able to understand these mappings between different parts of the input sequence and corresponding parts of the output sequence. However its not that straightforward for an artificial neural network to automatically detect these mappings. Thus the Attention mechanism is developed to “learn” these mappings through Gradient Descent and Back-propagation.

How does Attention work?

Let’s get technical and dive into the nitty-gritty of the Attention mechanism.

Decoding at time step 1

Continuing the above example, let’s say we now want our decoder to start predicting the first word of the target sequence i.e. “राहुल”

At time step 1, we can break the entire process into five steps as below:

Decoding at time step 1

Before we start decoding, we first need to encode the input sequence into a set of internal states (in our case h1, h2, h3, h4, and h5). Now the hypothesis is that the next word in the output sequence is dependent on the current state of the decoder (decoder is also a GRU) as well as on the hidden states of the encoder. Thus at each time step, we consider these two things and follow the below steps:

Step 1 — Compute a score for each encoder state

Since we are predicting the first word itself, the decoder does not have any current internal state. For this reason, we will consider the last state of the encoder (i.e. h5) as the previous decoder state. Now using these two components (all the encoder states and the current state of the decoder), we will train a simple feed-forward neural network.


Recall we are trying to predict the first word in the target sequence i.e. “राहुल”. As per the idea behind attention, we do not need all the encoder states to predict this word, but we need those encoder states that store information about the word “Rahul” in the input sequence.

As discussed previously these intermediate encoder states store the local information of the input sequence. So it is highly likely that the information of the word “Rahul” will be present in the states, let’s say, h1 and h2.

Thus we want our decoder to pay more attention to the states h1 and h2 while paying less attention to the remaining states of the encoder. For this reason, we train a feed-forward neural network that will learn to identify relevant encoder states by generating a high score for the states for which attention is to be paid while a low score for the states which are to be ignored.

Let s1, s2, s3, s4, and s5 be the scores generated for the states h1, h2, h3, h4, and h5 correspondingly. Since we assumed that we need to pay more attention to the states h1 and h2 and ignore h3, h4, and h5 in order to predict “राहुल”, we expect the above neural to generate scores such that s1 and s2 are high while s3, s4, and s5 are relatively low.

Step 2— Compute the attention weights

Once these scores are generated, we apply a softmax on these scores to produce the attention weights e1, e2, e3, e4, and e5 as shown above. The advantage of applying softmax is as below:

a) All the weights lie between 0 and 1, i.e., 0 ≤ e1, e2, e3, e4, e5 ≤ 1

b) All the weights sum to 1, i.e., e1+e2+3+e4+e5 = 1

Thus we get a nice probabilistic interpretation of the attention weights.

In our case, we would expect values like the below: (just for intuition)

e1 = 0.75, e2 = 0.2, e3 = 0.02, e4 = 0.02, e5 = 0.01

This means that while predicting the word “राहुल”, the decoder needs to put more attention on the states h1 and h2 (since values of e1 and e2 are high) while ignoring the states h3, h4, and h5 (since the values of e3, e4 and e5 are very small).

Step 3— Compute the context vector

Once we have computed the attention weights, we need to compute the context vector (thought vector) which will be used by the decoder in order to predict the next word in the sequence. Calculated as follows:

context_vector = e1 * h1 + e2 * h2 + e3 * h3 + e4 * h4 + e5 * h5

Clearly, if the values of e1 and e2 are high and those of e3, e4 and e5 are low then the context vector will contain more information from the states h1 and h2 and relatively less information from the states h3, h4, and h5.

Step 4— Concatenate context vector with the output of the previous time step

Finally, the decoder uses the below two input vectors to generate the next word in the sequence

a) The context vector

b) The output word generated from the previous time step.

We simply concatenate these two vectors and feed the merged vector to the decoder. Note that for the first time step, since there is no output from the previous time step, we use a special <START> token for this purpose.

Step 5— Decoder Output

The decoder then generates the next word in the sequence (in this case, it is expected to generate “राहुल”) and along with the output, the decoder will also generate an internal hidden state, and let’s call it “d1”.

Decoding at time step 2

Now in order to generate the next word “चांगला”, the decoder will repeat the same procedure which can be summarized in the below diagram: The changes are highlighted in green circles

Decoding at time step 2

Decoding at time step 3

Decoding at time step 3

Decoding at time step 4

Decoding at time step 4

Decoding at time step 5

Decoding at time step 5

Once the decoder outputs the <END> token, we stop the generation process. Note that unlike the fixed context vector used for all the decoder time steps in the case of the traditional Seq2Seq models, here in the case of Attention, we compute a separate context vector for each time step by computing the attention weights every time. Thus using this mechanism our model is able to find interesting mappings between different parts of the input sequence and corresponding parts of the output sequence. Note that during the training of the network, we use teacher forcing in order to input the actual word rather than the predicted word from the previous time step.

Code Walkthrough

As in the case of any NLP task, after reading the input file, we perform the basic cleaning and preprocessing as follows:

# Set the file path
file_path = './mar.txt' # this might be different in your system

# read the file
lines = open(file_path, encoding='UTF-8').read().strip().split('\n')

# perform basic cleaning
exclude = set(string.punctuation) # Set of all special characters
remove_digits = str.maketrans('', '', string.digits) # Set of all digits

def preprocess_eng_sentence(sent):
    '''Function to preprocess English sentence'''
    sent = sent.lower()
    sent = re.sub("'", '', sent)
    sent = ''.join(ch for ch in sent if ch not in exclude)
    sent = sent.translate(remove_digits)
    sent = sent.strip()
    sent = re.sub(" +", " ", sent)
    sent = '<start> ' + sent + ' <end>'

def preprocess_mar_sentence(sent):
    '''Function to preprocess Marathi sentence'''
    sent = re.sub("'", '', sent)
    sent = ''.join(ch for ch in sent if ch not in exclude)
    sent = re.sub("[२३०८१५७९४६]", "", sent)
    sent = sent.strip()
    sent = re.sub(" +", " ", sent)
    sent = '<start> ' + sent + ' <end>'
    return sent

# Generate pairs of cleaned English and Marathi sentences
sent_pairs = []
for line in lines:
    sent_pair = []
    eng, mar = line.split('\t')
    eng = preprocess_eng_sentence(eng)
    mar = preprocess_mar_sentence(mar)
    return sent

Create a class to map every word to an index and vice-versa for any given vocabulary:

# This class creates a word -> index mapping (e.g,. "dad" -> 5) and vice-versa 
# (e.g., 5 -> "dad") for each language,
class LanguageIndex():
    def __init__(self, lang):
        self.lang = lang
        self.word2idx = {}
        self.idx2word = {}
        self.vocab = set()


    def create_index(self):
        for phrase in self.lang:
            self.vocab.update(phrase.split(' '))

        self.vocab = sorted(self.vocab)

        self.word2idx['<pad>'] = 0
        for index, word in enumerate(self.vocab):
            self.word2idx[word] = index + 1

        for word, index in self.word2idx.items():
            self.idx2word[index] = word

# Function to calculate maximum length of the sequence
def max_length(tensor):
    return max(len(t) for t in tensor)

We use the input pipeline to create the dataset and then load it later in mini-batches. To read more about the input pipeline in TensorFlow, go through the official documentation here and here.

def load_dataset(pairs, num_examples):
    # pairs => already created cleaned input, output pairs

    # index language using the class defined above    
    inp_lang = LanguageIndex(en for en, ma in pairs)
    targ_lang = LanguageIndex(ma for en, ma in pairs)
    # Vectorize the input and target languages
    # English sentences
    input_tensor = [[inp_lang.word2idx[s] for s in en.split(' ')] for en, ma in pairs]
    # Marathi sentences
    target_tensor = [[targ_lang.word2idx[s] for s in ma.split(' ')] for en, ma in pairs]
    # Calculate max_length of input and output tensor
    # Here, we'll set those to the longest sentence in the dataset
    max_length_inp, max_length_tar = max_length(input_tensor), max_length(target_tensor)
    # Padding the input and output tensor to the maximum length
    input_tensor = tf.keras.preprocessing.sequence.pad_sequences(input_tensor, 
    target_tensor = tf.keras.preprocessing.sequence.pad_sequences(target_tensor, 
    return input_tensor, target_tensor, inp_lang, targ_lang, max_length_inp, max_length_tar

# Create the tensors
input_tensor, target_tensor, inp_lang, targ_lang, max_length_inp, max_length_targ = load_dataset(sent_pairs, len(lines))

# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.1, random_state = 101)

# Set the parameters of the model
BUFFER_SIZE = len(input_tensor_train)
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word2idx)
vocab_tar_size = len(targ_lang.word2idx)

# Create batch generator to be used by modle to load data in batches
dataset =, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

Now using the model sub-classing API of TensorFlow, we define the model as follows. To read more about model sub-classing, read the official documentation here.

Note: Please read the comments in the below section of the code to get a better understanding of using the concepts we discussed above. Most of the important lines of the code point to the corresponding section of the explanation given above.

def gru(units):
  # If you have a GPU, we recommend using CuDNNGRU(provides a 3x speedup than GRU)
  # the code automatically does that.
    if tf.test.is_gpu_available():
        return tf.keras.layers.CuDNNGRU(units, 
        return tf.keras.layers.GRU(units, 

class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
        super(Encoder, self).__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = gru(self.enc_units)
    def call(self, x, hidden):
        x = self.embedding(x)
        output, state = self.gru(x, initial_state = hidden)        
        return output, state
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.enc_units))

class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = gru(self.dec_units)
        self.fc = tf.keras.layers.Dense(vocab_size)
        # used for attention
        self.W1 = tf.keras.layers.Dense(self.dec_units)
        self.W2 = tf.keras.layers.Dense(self.dec_units)
        self.V = tf.keras.layers.Dense(1)
    def call(self, x, hidden, enc_output):
        # enc_output shape == (batch_size, max_length, hidden_size)
        # hidden shape == (batch_size, hidden size)
        # hidden_with_time_axis shape == (batch_size, 1, hidden size)
        # we are doing this to perform addition to calculate the score
        hidden_with_time_axis = tf.expand_dims(hidden, 1)
        # score shape == (batch_size, max_length, 1)
        # we get 1 at the last axis because we are applying tanh(FC(EO) + FC(H)) to self.V
        # this is the step 1 described in the blog to compute scores s1, s2, ...
        score = self.V(tf.nn.tanh(self.W1(enc_output) + self.W2(hidden_with_time_axis)))
        # attention_weights shape == (batch_size, max_length, 1)
        # this is the step 2 described in the blog to compute attention weights e1, e2, ...
        attention_weights = tf.nn.softmax(score, axis=1)
        # context_vector shape after sum == (batch_size, hidden_size)
        # this is the step 3 described in the blog to compute the context_vector = e1*h1 + e2*h2 + ...
        context_vector = attention_weights * enc_output
        context_vector = tf.reduce_sum(context_vector, axis=1)
        # x shape after passing through embedding == (batch_size, 1, embedding_dim)
        x = self.embedding(x)
        # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
        # this is the step 4 described in the blog to concatenate the context vector with the output of the previous time step
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
        # passing the concatenated vector to the GRU
        output, state = self.gru(x)
        # output shape == (batch_size * 1, hidden_size)
        output = tf.reshape(output, (-1, output.shape[2]))
        # output shape == (batch_size * 1, vocab)
        # this is the step 5 in the blog, to compute the next output word in the sequence
        x = self.fc(output)
        # return current output, current state and the attention weights
        return x, state, attention_weights
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.dec_units))
# Create objects of Class Encoder and Class Decoder
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

Define Optimizer, Loss Function, and Checkpoints

optimizer = tf.train.AdamOptimizer()

def loss_function(real, pred):
    mask = 1 - np.equal(real, 0)
    loss_ = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=real, logits=pred) * mask
    return tf.reduce_mean(loss_)
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,

Using Eager Execution, we train the network for 10 epochs. To read more about Eager Execution, refer to the official documentation here.


for epoch in range(EPOCHS):
    start = time.time()
    hidden = encoder.initialize_hidden_state()
    total_loss = 0
    for (batch, (inp, targ)) in enumerate(dataset):
        loss = 0
        with tf.GradientTape() as tape:
            enc_output, enc_hidden = encoder(inp, hidden)
            dec_hidden = enc_hidden
            dec_input = tf.expand_dims([targ_lang.word2idx['<start>']] * BATCH_SIZE, 1)       
            # Teacher forcing - feeding the target as the next input
            for t in range(1, targ.shape[1]):
                # passing enc_output to the decoder
                predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
                loss += loss_function(targ[:, t], predictions)
                # using teacher forcing
                dec_input = tf.expand_dims(targ[:, t], 1)
        batch_loss = (loss / int(targ.shape[1]))
        total_loss += batch_loss
        variables = encoder.variables + decoder.variables
        gradients = tape.gradient(loss, variables)
        optimizer.apply_gradients(zip(gradients, variables))
        if batch % 100 == 0:
            print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
    # saving (checkpoint) the model every epoch = checkpoint_prefix)
    print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                        total_loss / N_BATCH))
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

Inference setup and testing:

def evaluate(inputs, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ):
    attention_plot = np.zeros((max_length_targ, max_length_inp))
    sentence = ''
    for i in inputs[0]:
        if i == 0:
        sentence = sentence + inp_lang.idx2word[i] + ' '
    sentence = sentence[:-1]
    inputs = tf.convert_to_tensor(inputs)
    result = ''

    hidden = [tf.zeros((1, units))]
    enc_out, enc_hidden = encoder(inputs, hidden)

    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang.word2idx['<start>']], 0)

    # start decoding
    for t in range(max_length_targ): # limit the length of the decoded sequence
        predictions, dec_hidden, attention_weights = decoder(dec_input, dec_hidden, enc_out)
        # storing the attention weights to plot later on
        attention_weights = tf.reshape(attention_weights, (-1, ))
        attention_plot[t] = attention_weights.numpy()

        predicted_id = tf.argmax(predictions[0]).numpy()

        result += targ_lang.idx2word[predicted_id] + ' '

        # stop decoding if '<end>' is predicted
        if targ_lang.idx2word[predicted_id] == '<end>':
            return result, sentence, attention_plot
        # the predicted ID is fed back into the model
        dec_input = tf.expand_dims([predicted_id], 0)

    return result, sentence, attention_plot

def predict_random_val_sentence():
    actual_sent = ''
    k = np.random.randint(len(input_tensor_val))
    random_input = input_tensor_val[k]
    random_output = target_tensor_val[k]
    random_input = np.expand_dims(random_input,0)
    result, sentence, attention_plot = evaluate(random_input, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ)
    print('Input: {}'.format(sentence[8:-6]))
    print('Predicted translation: {}'.format(result[:-6]))
    for i in random_output:
        if i == 0:
        actual_sent = actual_sent + targ_lang.idx2word[i] + ' '
    actual_sent = actual_sent[8:-7]
    print('Actual translation: {}'.format(actual_sent))
    attention_plot = attention_plot[:len(result.split(' '))-2, 1:len(sentence.split(' '))-1]
    sentence, result = sentence.split(' '), result.split(' ')
    sentence = sentence[1:-1]
    result = result[:-2]
    # use plotly to plot the heatmap
    trace = go.Heatmap(z = attention_plot, x = sentence, y = result, colorscale='Reds')
# Finally call the function multiple times to visualize random results from the test set

Visualizing the Results

If you are new to heat maps, this is how you can interpret the above plot:

Notice that the cell at the intersection of “father” and “बाबांनी” is pretty dark. This means when the decoder predicts the word “बाबांनी”, it is paying more attention to the input word “father” (which is what we wanted). Similarly while predicting the word “कॅमेरा”, the decoder pays a lot of attention to the input word “camera”. And so on.


The first thing to be noted is that the translation results are much better than the previous results. Secondly, the model is able to find the correct local mappings between the input and the output sequences which do match our intuition. Given more data and with more hyper-parameter tuning, the results and mappings will definitely improve by a good margin.

Using LSTM layers in place of GRU and adding a Bidirectional wrapper on the encoder will also help in improved performance.

Deep Learning models are generally considered black boxes, meaning that they do not have the ability to explain their outputs. However, Attention is one of the successful methods that help to make our model interpretable and explain why it does what it does.

The only disadvantage of the Attention mechanism is that it is very time consuming and hard to parallelize the system. To solve this problem, Google Brain came up with the “Transformer Model” which uses only Attention and gets rid of all the Convolutional and Recurrent Layers, thus making it highly parallelizable and computing efficiently.


Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.