This post aims at giving a high level explanation of what Deep Learning Attention Mechanism is, as well as detailing a few technical steps in the computation of attention.
If you’re looking for more equations or examples, the references give a large amount of details, in particular the recent review by Cho et al .
Unfortunately, these models are not always straightforward to implement by yourself and only a few open source implementations have been released up to now.
Neural processes involving attention have been largely studied in Neuroscience and Computational Neuroscience [1, 2]. A particularly studied aspect is visual attention: many animals focus on specific parts of their visual inputs to compute the adequate responses. This principle has a large impact on neural computation as we need to select the most pertinent piece of information, rather than using all available information, a large part of it being irrelevant to compute the neural response.
A similar idea -focusing on specific parts of the input- has been applied in Deep Learning, for speech recognition, translation, reasoning, and visual identification of objects.
Attention for Image Captioning
Let’s introduce an example to explain attention mechanism. The task we want to achieve is image captioning: we want to generate a caption for a given image.
A « classic » image captioning system would encode the image, using a pre-trained Convolutional Neural Network that would produce a hidden state . Then, it would decode this hidden state by using a Recurrent Neural Network (RNN), and generate recursively each word of the caption. Such a method has been applied by several groups, including  (see image below):
The problem with this method is that, when the model is trying to generate the next word of the caption, this word is usually describing only a part of the image. Using the whole representation of the image to condition the generation of each word cannot efficiently produce different words for different parts of the image. This is exactly where an attention mechanism is helpful.
With an attention mechanism, the image is first divided into parts, and we compute with a Convolutional Neural Network (CNN) representations of each part . When the RNN is generating a new word, the attention mechanism is focusing on the relevant part of the image, so the decoder only uses specific parts of the image.
On the figure below (upper row), we can see for each word of the caption what part of the image (in white) is used to generate it.
For more examples, we can look at the « relevant » part of these images to generate the underlined words.
Examples of attending the correct object. (Taken from )
We are now going to explain how an attention model works, in a general setting. Acomprehensive review of attention models applications  details the implementation of an attention based Encoder-Decoder Network.
What is an attention model ?
What is an attention model, in a general setting ?
An attention model is a method that takes arguments (in the precedent examples, the would be the ), and a context . It return a vector which is supposed to be the « summary » of the , focusing on information linked to the context . More formally, it returns a weighted arithmetic mean of the , and the weights are chosen according the relevance of each given the context .
In the example presented before, the context is the beginning of the generated sentence, the are the representations of the parts of the image (), and the output is a representation of the filtered image, with a filter putting the focus of the interesting part for the word currently generated.
One interesting feature of attention model is that the weight of the arithmetic means are accessible and can be plotted. This is exactly the figures we were showing before, a pixel is whiter if the weight of this image is high.
But what is exactly this black box doing ? A figure for the whole attention model would be this one :
This network could seem to be complicated, but we are going to explain it step by step.
First, we recognize the input. is the context, and the are the « part of the data » we are looking at.
At the next step, the network computes , … with a layer. It means that we compute an « aggregation » of the values of and . An important remark here is that each is computed without looking at the other for . They are computed independently.
Then, we compute each weight using a softmax. The softmax, as it name says, behaves almost like a argmax, but is differentiable. Let’s say that we have an argma$ function such that where the only 1 in the output is telling which input is the max. Then, the softmax is defined by . If one of the is bigger than the other, then will be very close to .
Here, the are the softmax of the projected on a learned direction. So the softmax can be thought as the max of the « relevance » of the variables, according to the context.
The output is the weighted arithmetic mean of all the , where the weight represent the relevance for each variable according the context .
An other computation of « relevance »
The model presented above of an attentive model can be modified. First, the layer can be replaced by any other network. The only important thing is that this function mixes up and . A version used is to compute only a dot product between and .
This version is even easier to understand. The attention model is « softly-choosing » the variable the most correlated with the context. As far as we know, both systems seem to produce comparable results.
An other important modification is hard attention.
Soft Attention and Hard Attention
The mechanism we described previously is called « Soft attention » because it is a fully differentiable deterministic mechanism that can be plugged into an existing system, and the gradients are propagated through the attention mechanism at the same time they are propagated through the rest of the network.
Hard attention is a stochastic process: instead of using all the hidden states as an input for the decoding, the system samples a hidden state with the probabilities . In order to propagate a gradient through this process, we estimate the gradient by Monte Carlo sampling.
A Hard Attention model. The output is a random choice of one of the , with probability
Both system have their pros and cons, but the trend is to focus on soft attention mechanism as the gradient can directly be computed instead of estimated through a stochastic process.
Return to the image captioning
Now, we are able to understand how the image captioning system presented before is working.
Attention model for image captioning
We can recognise the figure of the « classic » model for image captioning, but with a new layer of attention model. What is happening when we want to predict the new word of the caption ? If we have predicted words, the hidden state of the LSTM is . We select the « relevant » part of the image by using as the context. Then, the output of the attention model , which is the representation of the image filtered such that only the relevant parts of the image remains, is used as an input for the LSTM. Then, the LSTM predict a new word, and returns a new hidden state .
Learning to Align in Machine Translation
The work by Bahdanau, et al  proposed a neural translation model which learns to translate sentences from one language to another and introduces an attention mechanism.
Before explaining the attention mechanism, the vanilla neural translation model using an encoder-decoder works. The encoder is fed a sentence in English using Recurrent Neural Networks (RNN, usually GRU or LSTM) and produces a hidden state . This hidden state conditions the decoder RNN to produce the right output sentence in French.
A model for translation without attention.
For translation, we have the same intuition than for image captioning. When we are generating a new word, we are usually translating a single word of the original language. An attention model allows, for each new word, to focus on a part of the original text.
The only difference between this model and the model of image captioning is that the are the successive hidden layers of a RNN.
Attention model for Translation.
Instead of producing just a single hidden state corresponding to the whole sentence, the encoder produces T hidden states each corresponding to a word. Each time the decoder RNN produces a word, it determines the contribution of each hidden states to take as input, usually a single one (see figure below). The contribution computed using a softmax: this means that attention weights are computed such that , and all hidden states contribute to the decoding with weight .
In our case, the attention mechanism is fully differentiable, and does not require any additional supervision, it is simply added on top of an existing Encoder-Decoder.
This process can be seen as an alignment, because the network usually learns to focus on a single input word each time it produces an output word. This means that most of the attention weights are 0 (black) while a single one is activated (white). The image below shows the attention weights during the translation process, which reveals the alignment and makes it possible to interpret what the network has learnt (and this is usually a problem with RNNs!)
Word alignment in translation with an attention model. (Taken from )
Attention without Recurrent Neural Networks
Up to now, we only described attention models in an encoder-decoder framework (i.e. with RNNs). However, when the order of input does not matter, it is possible to consider independant hidden states . This is the case for instance in Raffel et Al , where the attention model is fully feed-forward. The same applies to the simple case of Memory Networks  (see next section).
From Attention to Memory Addressing
NIPS 2015 hosted a very interesting (and packed!) workshop called RAM for Reasoning, Attention and Memory. It included works on attention, but also the Memory Networks , Neural Turing Machines  or Differentiable Stack RNNs  and many others. These models all have in common that they use a form of external memory in which they can read (eventually write).
Comparing and explaining these models is out of the scope of this post, but the the link between attention mechanism and memory is interesting.
In Memory Networks for instance, we consider an external memory – a set of facts or sentences – and an input .
The network learns to address the memory, this means to select which fact to focus on to produce the answer. This corresponds exactly to an attention mechanism over the external memory. In Memory Networks, the only difference is that the soft selection of the facts (blue Embedding A in the image below) is decorrelated from the weighted sum of the embeddings of the facts (pink embedding C in the image). In Neural Turing Machine, and many very recent memory based QA models, a soft attention mechanism is used. These models will be the object of a following post.
Memory Network. (Taken from )
Attention mechanism and other fully differentiable addressable memory systems are extensively studied by many researchers right now. Even though they are still young and not implemented in real world systems, they showed that they can be used to beat the state-of-the-art in many problems where the encoder-decoder framework detained the previous record.
At Heuritech, we became interested in attention mechanism a few month ago and organised a workshop to get our hands dirty and code encoder-decoder with attention mechanism. While we do not use attention mechanism in production yet, we envision it to have an important role in advanced text understanding where some reasoning is necessary, in a similar manner as the recent work by Hermann et al .
In a separate blog post, I will elaborate on what we’ve learnt during the workshop and the recent advances that were presented at the RAM workshop.
Léonard Blier et Charles Ollion
We thank Mickael Eickenberg and Olivier Grisel for their helpful remarks.
 Itti, Laurent, Christof Koch, and Ernst Niebur. « A model of saliency-based visual attention for rapid scene analysis. » IEEE Transactions on Pattern Analysis & Machine Intelligence 11 (1998): 1254-1259.
 Desimone, Robert, and John Duncan. « Neural mechanisms of selective visual attention. » Annual review of neuroscience 18.1 (1995): 193-222.
 Cho, Kyunghyun, Aaron Courville, and Yoshua Bengio. « Describing Multimedia Content using Attention-based Encoder–Decoder Networks. » arXiv preprint arXiv:1507.01053 (2015)
 Xu, Kelvin, et al. « Show, attend and tell: Neural image caption generation with visual attention. » arXiv preprint arXiv:1502.03044 (2015).
 Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. « Neural machine translation by jointly learning to align and translate. » arXiv preprint arXiv:1409.0473(2014).
 Sukhbaatar, Sainbayar, Jason Weston, and Rob Fergus. « End-to-end memory networks. » Advances in Neural Information Processing Systems. (2015).
 Graves, Alex, Greg Wayne, and Ivo Danihelka. « Neural Turing Machines. » arXiv preprint arXiv:1410.5401 (2014).
 Joulin, Armand, and Tomas Mikolov. « Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets. » arXiv preprint arXiv:1503.01007 (2015).
 Hermann, Karl Moritz, et al. « Teaching machines to read and comprehend. » Advances in Neural Information Processing Systems. 2015.
 Raffel, Colin, and Daniel PW Ellis. « Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems. » arXiv preprint arXiv:1512.08756 (2015).