Pre-training of a new CoCa model on multimodal facilities

Pioneering work in computer vision has shown the effectiveness of models with a single encoder pre-trained in image classification for capturing generalized visual representations that are effective in other tasks. Details before launch flagship course in Data Science.

Often, ML developers start designing models with a basic generic model that learns at scale and whose capabilities are portable to a wide range of downstream tasks. In natural language processing, a number of the main popular, “basic” modelsincluding BERT, T5, GPT-3are pre-trained on web-scale data, and show great potential for multitasking when learning without trying [zero-shot learning], multi-trial learning or transfer learning. Compared to training overly specialized individual models, pre-training base models for a large number of tasks can amortize training costs and overcome resource constraints when building large-scale models.

This groundbreaking work in computer vision has shown the effectiveness of models with a single encoder pre-trained in image classification for capturing generalized visual representations that are efficient for other downstream tasks. More recently, contrastive dual coding approaches have been explored (CLIP, ALIGN, Florence) and a generative encoder-decoder (SimVLM) trained using noisy web-scale image-text pairs.

Dual-encoder models show remarkable ability to classify images without trying, but are not as efficient in visual and language recognition. On the other hand, encoder-decoder methods are good for image captions and visual responses to questions, but cannot perform tasks such as searching.

AT article “CoCa: Contrastive Captioners are Image-Text Foundation Models” we present a unified computer vision model called Contrastive Captioner (CoCa). Our model is a novel encoder that generates aligned single-modal images, text attachments, and merged multi-modal objects simultaneously, making the model flexible enough to be applied directly to all types of downstream problems.

In particular, CoCa achieves the most advanced results in solving a number of visual and visual-language problems covering visual recognition, cross-modal alignment and multi-modal recognition. In addition, this model is trained on very general representations, so it can perform as well as fully tuned models trained without trials or by coders with fixed weights.

An overview of Contrastive Captioners (CoCa) compared to single encoder, dual encoder, and encoder-decoder models.


We propose CoCa, a unified ML model that combines contrast loss and image caption loss in a single training data stream consisting of image annotations and noisy image-text pairs, effectively combining single encoder, dual encoder, and encoder-decoder paradigms.

To this end, we present a new encoder-decoder architecture, in which the encoder is a visual transformer (ViT), and the text decoder transformer is divided into two parts – a single-modal text decoder and a multi-modal text decoder.

We skip the cross-attention in the single-modal decoder layers to encode text representations for contrast loss, and cascade the multi-modal cross-attention decoder layers to the image encoder output to learn multi-modal image-text objects for loss of image captions.

This design maximizes the flexibility and versatility of the model to solve a wide range of problems, while at the same time it can be trained efficiently with a single forward and back propagation for both learning objectives, thus keeping computational costs to a minimum. Thus, the model can be trained end-to-end from scratch with a training cost comparable to a simple encoder-decoder model.

An illustration of forward propagation used by CoCa for contrast loss and image caption loss.

Comparative results

The CoCa model can be directly tuned for many tasks with minimal adaptation. Thus, our model achieves a number of state-of-the-art results in popular visual and multimodal databases, including:

  1. visual recognition: imagenetKinetics-400/600/700 and MiT;

  2. cross modal alignment: MS COCO, Flickr30K and MSR-VTT;

  3. multimodal recognition: VQA, SNLI-VE, NLVR2 and nocaps.

Comparison of CoCa with other basic image-to-text models (without customization for specific tasks) and several modern specialized models customized for specific tasks.

Remarkably, CoCa achieves these results as a single model that is tailored to all tasks, and yet often lighter than previous high-performance dedicated models. For example, CoCa achieves 91.0% ImageNet predictive accuracy using less than half the parameters of previous state-of-the-art models. In addition, CoCa has powerful generative capabilities for creating high-quality image captions.

Comparing the scaling performance of an image classification system with fine-tuned ImageNet predictive accuracy with model size.