FROMAGe

Language models have proven themselves to be a flexible tool for use in a variety of domains. However, despite their power, most existing language models have limitations in performing tasks related to visual thinking and reasoning, and are unable to create pictorial images. Such obstacles prevent users from using a single model for their tasks and often force them to additionally search for models that specialize in visual context.

In 2023, researchers at Carnegie Mellon University were able to create the first multimodal language model FROMAGe, which has visual and language capabilities such as multimodal dialogue, generation and contextual search of images from conversations. With this model, users can solve any of their problems in dialogue mode, and for a more accurate answer, the model will often illustrate its textual answers.

Example of the FROMAGe model in operation

Let's take a look inside

Now it's time to figure out how such a miracle works and how to train it. The FROMAGe model contains a frozen language model OPT and an image encoder CLIP. Moreover, the FROMAGe architecture allows using any language model in addition to OPT, which makes this model more flexible. Moreover, researchers have empirically found out that to solve such a complex task as image output, it is necessary to additionally train only three linear layers, solving only two problems on data from three million images with captions.

Image Description

The first task that was presented to the model for training was creating a description for images. When training the model, one image with its description or two at once were fed. Using the CLIP encoder and the linear layer, the images were encoded, and the GPT2 tokenizer translated the text that was fed into the frozen language model. The OPT model, in turn, solved a simple token continuation problem, and then the model's prediction was estimated by the loss with an accurate description.

It is interesting to note that although one or two data samples were fed to the model for training, during testing the model easily copes with long contexts containing five or six images with different descriptions at once.

Extracting image from text

The second learning task is much more difficult to implement. Since the model has both images and text outputs, it cannot use text encoders, which are often used in language models. Instead of text encoders, the authors of the article use autoregressive models, which is a limitation to the use of bidirectional attention in the model.

Due to all the factors, a new token is added to display the image. [RET]which tells the model to output images as a response.

The process of training a model to extract images from text

During the training process, the last two linear layers are used to encode the data. The similarity of text and visual embeddings is then compared.

Testing

Unfortunately, as of 2023, there was no precise benchmark for evaluating models of this type, and the authors came up with their own evaluation methods, inspired by the CLIP model. Testing took place in three stages.

In the first stage, a visual retelling of the story was tested, in which the model was successively fed images and asked to continue the story with a picture.

According to the testing results, the model is inferior to CLIP in short contexts, but with increasing context, CLIP significantly lags behind FROMAGe, which is not surprising, because the CLIP model was trained on short contexts. The problem of model deterioration with increasing context has long been known and is disclosed in the article long‑CLIP.

At the second stage of testing the model, questions were given to the image, to which FROMAGe had to give an answer in the form of text or an image. In general, such a test emulated the dialogue mode with artificial intelligence. The model gives ambiguous results, in which FROMAGe loses to the language models ESPER and FLAMINGO. However, these models are capable of answering only in text form. On the other hand, the model is ahead of CLIP in quality, which the authors of the article are proud of.

The last testing was focused on text generation. The model had to finish the story. The results of the testing showed that the more context is given to the model, the more accurately and qualitatively the model will generate the text.

Based on the results of all three tests, conclusions were made about the breakthrough success of FROMAGe. However, by the fall of 2024, based on my research, I noticed that this model was rarely used and often preferred over other analogs that were trained on the FROMAGe strategy.

Summary

As a result, FROMAGe is capable of producing images and text. Moreover, the model demonstrates high performance in solving various tasks related to the input and output of graphic text, and qualitatively demonstrates interactive capabilities, such as multimodal dialogue. The most interesting thing is that to solve such a complex problem, three scientists needed to train only three linear layers! Thanks to this model, created in 2023, we can use almost all the functionality of neural networks and receive both visual and textual answers when necessary. Unfortunately, the model has not become as bright and popular as the mastodons of language models in the form of GPT-4, LLama and many others. However, the model has become a bright light in the future of training language models to visualize their responses.