*Foreword: The author is Hedi Ben Younes, PhD student at LIP6 / Heuritech. Multimodal fusion of text and image information is an important topic at Heuritech, as most of the media on internet is composed of images, videos and text. The challenging Visual Question Answering task is an excellent benchmark for the fusion of text and image.*

I will present in this blogpost a work done by Hédi Ben-younes*, Rémi Cadène*, Matthieu Cord and Nicolas Thome. The paper is accepted at the International Conference on Computer Vision (ICCV) 2017, and will be presented at the poster session. To delve deeper into our work, you can:

- Read the article
- Use our PyTorch code*Equal contribution

# Visual Question Answering

The goal of Visual Question Answering (VQA) is to build a system that can answer questions about images.

Solving the task of VQA would significantly improve the possibilities in human-machine interfaces, allowing to dynamically extract the needed information from a picture.

In a shorter term, and as pointed out in the foreword, VQA provides a benchmark for multimodal representation methods. We can use this task to develop methods that could be used for problems where inputs are intrinsically multimodal, and where the output highly depends on the combination of modalities.

To solve this problem, precise image and text models (*monomodal representations*) are required. But more importantly, high level interactions between these two modalities have to be carefully settled into the model in order to provide the correct answer.

This projection from the monomodal spaces to a *multimodal space* is supposed to model the relevant correlations between the two spaces. Besides, the model must have the ability to understand the full scene, focus its attention on the relevant visual regions and discard the useless information regarding the question.

# MUTAN

## Global Framework

We cast the visual question answering task into the problem of classification. Given a question about an image , we want the predicted answer to match the correct one :

To this end, we first represent the image and the question using powerful monomodal embeddings. We use ResNet-152 to produce and a GRU to yield . The supervision is given by a vector .

We explore the problem of multimodal embedding: how do we learn a multimodal embedding with a composition of monomodal ones ? In other words, if the model is

what should we put in ?

Bilinear models are an appealing solution to the fusion problem, since they encode a fully-parametrized interaction between the two embedding spaces. The general form of a bilinear projection between and is

We note that this full bilinear model introduces a tensor . If we use an answer vocabulary of size , becomes way too large for storing as well as for learning. To reduce the number of parameters, we add some structure in the tensor using the Tucker Decomposition [3]. It consists of writing the very large bilinear interaction as a smaller bilinear interaction between projections of the input representations.

*Tucker Decomposition*

When we constrain to have low Tucker ranks, we can re-write the model as:

## Tensor sparsity

However, the construction of this decomposition restricts the dimensions and to be relatively small (), which might cause a bottleneck in the modeling. To reach higher dimensions, and thus reduce the bottleneck, we explore adding more structure into the tensor . More precisely, we force the third order slices of the *core tensor* to have a fixed rank.

Applying this structural constraint on simplifies the bilinear interaction in the Tucker Decomposition. It changes the expression of , which becomes

Where and are matrices of size and .

## Unifying state-of-the-art VQA models

We show that the framework of Tucker Decompositions can be used to express some of the state-of-the-art fusion strategies for VQA (namely MLB [4] and MCB [5]). We invite the interested reader to read the article for details on this point.

## Adding visual attention

As it has been done in previous articles, we integrate our fusion strategy into a multi-glimpse attention mechanism.

*Attention mechanism with one glimpse*

Basically, we represent the image as a set of region vectors. Then, we use a MUTAN bloc to merge each region vector with a question, and thus yield a score for each region. These scores are used to weight-sum pool the region vectors, and provide an attended visual embedding. This vector is then fused with the question embedding with another MUTAN bloc to produce the answer

# Results

In a few words, an ensemble of 3 MUTAN models reaches the performance of an ensemble of 7 MLB models, which was the previous published state-of-the-art. We further improve on this result with an ensemble of 5 models. Please read the article for more details on the comparison with other methods.

## Impact of rank sparsity

Besides comparing our model to the state-of-the-art, we are interested in understanding how it behaves. More precisely, we focus our study on understanding what the rank constraint can bring.

We carry these experiments on a MUTAN model without attention. As we reduce the rank, we can increase the output dimension and limit the bottleneck effect. We see on this plot that for a fixed number of parameters in the fusion, performance is better when we reduce the rank and increase the output dimension.

# Qualitative observations

Introducing the rank constraint implies that we write as a sum of R projections

We want to assess what kind of information these different projections have learnt. Are these projections complementary, redundant, specialized, … ?

We train a MUTAN without attention, with and measure its performance on the validation set. We then set to 0 all the except one and measure the performance of this ablated system.

We do so for each one of the 20 projections. We compare the full system to the R ablated systems on different question types. In the plots below, the dotted line represents the performance of the full system, and each bar is the performance obtained when we keep only the corresponding projection.

Depending on the question type, we observe 3 different behaviors of the ranks.

For questions starting by ”Is there”, whose answer is almost always ”yes” or ”no”, we observe that each rank has learnt enough to reach almost the same accuracy as the global system.

Other question types require information from all the latent projections, as in the case of ”What is the man”. This leads to cases where all projections perform equally and significantly worst when taken individually than when combined to get the full model.

At last, we observe that specific projections contribute more than others depending on the question type. For example, latent variable 16 performs well on ”what room is”, and is less informative to answer questions starting by ”what sport is”. The opposite behavior is observed for latent variable 17.

# Conclusion

The framework Tucker decomposition helps us understand which kind of structure is imposed on a bilinear model.

We have successfully applied it in the context of VQA, combining it with a low-rank constraint on the core tensor of the decomposition. It could be interesting to explore other kinds of structures in the elements of decomposition, while delving deeper into the tensor analysis to get a more precise understanding of the expressivity involved by a decomposition.

We would also like to apply the methods developed for VQA to other tasks requiring multimodal representations.

# References

[1] Hedi Ben-younes, Rémi Cadène, Matthieu Cord and Nicolas Thome. MUTAN: Multimodal Tucker Fusion for Visual Question Answering. arXiv preprint arXiv:1705.06676. Accepted in *ICCV 2017. *

[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. *ICCV 2015*

[3] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Rev. 51(3):455–500, Aug. 2009

[4] J.-H. Kim, K.-W. On, J. Kim, J.-W. Ha, and B.-T. Zhang. Hadamard Product for Low-rank Bilinear Pooling. *ICLR 2017.*

[5] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. *EMNLP 2016*