Self-Supervised Learning. Metrics and first pretext tasks
In the previous article, we looked at the general essence of SSL approaches and why they are used. It’s time to start taking them apart in more detail.
In the article, we will first go over the main teaching methods that use exclusively the internal structure of the image. Later, we will define how to compare different methods without the presence of a target as such, and look at the metrics of the methods presented above. Spoiler – the described methods are far from the best at the moment, but they describe the evolution of human thought.
Let me remind you that this is the second article from the series about SSL in Computer Vision.
Image recovery
Predicting the context of a piece of an image
📋C. Doersch, A. Gupta, A. Efros. Unsupervised Visual Representation Learning by Context Prediction (May 2015)
The basic idea is simple – if an object is presented in the picture, then it has some kind of solid structure. Therefore, by cutting out 2 pieces from this structure, we can try to restore their relative position on the image.

Both patches – from the center and from the periphery – are run through 2 backbones with common weights (AlexNet was used in the paper), the position of the peripheral patch relative to the central one is predicted (out of 8 possible options). The authors note that it is important to cut patches at some distance from each other, otherwise the common borders will be data-leak. Another data-leak was chromatic aberrations in the image; the authors used special preprocessing to eliminate them.
Puzzle solving (Jigsaw)
📋M. Noroozi, P. Favaro. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles (March 2016)
The idea is consonant with the previous method, but the authors went further – instead of predicting one position, the neural network had to predict how the patches were mixed relative to each other. The closest analogy is a game of tags. For 9 patches, there are 9!=362880 possible permutations. To simplify the task a bit, the authors chose a subset from 64 permutations. As in the previous setting, each patch was passed through a backbone with groped weights and after softmax, 1 out of 64 possible combinations was predicted.

One of the interesting parameters is the complexity of the task. Here the authors measure the complexity with the help of Hamming distances – metrics of the number of permutations relative to the base sequence (123..89). In the ablation study, the authors show that metrics are best at balancing complexity – there should be both simple permutations (only 2) and complex permutations (all 9). Optimum with an average number of permutations = 8. And they offer the following slogan:
A good self-supervised task is neither simple nor ambiguous
Image masking
📋D. Pathak, P. Krahenbuhl, J. Donahue et al. Context Encoders: Feature Learning by Inpainting (April 2016)
Here the task is also quite understandable – to mask part of the picture and teach the network to predict the masked part. The authors used a 2-component loss – the cross-entropy of the restored piece relative to the original and the adversarial part (which is the letter A from GANs). The network consisted of an encoder + decoder (essentially an autoencoder). Of the bonuses of this approach – ready-made pre-trained weights for segmentation (spoiler, not of the best quality).

Image colorization
📋G. Larsson, M. Maire, G. Shakhnarovich. Learning Representations for Automatic Colorization (March 2016)
Another idea on the surface is that for each pixel you can predict its color. To describe each pixel, the authors use hypercolumns – activation values corresponding to the pixel from each convolutional layer (see the picture below). As a loss – the proximity of the predicted color to the true one in the color scheme Lab.

Turn prediction
📋S. Gidaris, P. Singh, N. Komodakis. Unsupervised Representation Learning by Predicting Image Rotations (March 2018)
The logic proposed by the authors of the article is as follows: in order to effectively predict the angle of rotation of the image, the network must be able to get the semantics of the image (the tit is sitting on a branch – the head is on top, the paws are on the bottom).
Therefore, even the basic statement of the problem – to predict the turn, leads to a deep understanding of the semantics of the scene of the image. The entire statement of the problem has already been described – we rotate the picture by 0, 90, 180 or 270 degrees and predict what the angle of rotation was by classification. Interestingly, 4 corners work better than 8 (these + offset by 45).

Key SSL Metrics
For obvious reasons, we can’t just apply classification or regression metrics or whatever – there’s no target to compare predictions against. Even if we measure the quality on pretext tasks, then, firstly, they are all different, and secondly, we are not very interested in the quality on them. After all, the goal is to pretrain a high-quality feature extractor and reuse it in a downstream task.
Therefore, it is necessary to measure the quality of embeddings obtained from the feature extractor. Let’s go through some measurement methods.
Clustering and Similarity of Nearest Neighbors
I would like to have some kind of meaningful representation of embeddings regarding the content of the picture. Let’s recall the canonical examples of word2vec, where it was possible to add vectors and get the addition of meanings. On the basis of such representations of images, it would be possible to build entire services around image search with multifactorial filters. However, often the individual components of the embedding space do not carry an isolated meaning.
Then we can assume that in the embedding space there will be similar images nearby and cluster them. Common clusters will contain similar images. However, without markup, this approach is more qualitative than quantitative, and cannot be a reliable metric.
Predictive ability by embedding
It is logical to associate the metric with the task that we want to solve. Historically, this is a classification task. Historically on the ImageNet dataset. Get metric describes Linear evaluation protocol (how good the metric is is discussed in the last article of the cycle):
Pretraining the SSL method on ImageNet
Associate each image from ImageNet with the embedding obtained from the frozen feature extractor
On top of embeddings, we train a linear model to determine the class label
We evaluate the quality as usual on ImageNet (top-1 (top-5) Accuracy)
Strictly speaking, you still need to test on the same architecture (usually ResNet-50). However, now SSL methods have begun to appear purely for transformers, ResNet does not corny fit here anymore.
In addition to the Linear evaluation protocol, there are also protocols for other datasets and other types of tasks.
Also, in addition to the strategy of completely freezing the weights, there is a protocol for additional training on several instances of classes from the dataset (Few-shot learning). Usually it is 1%, 5% or 10% of ImageNet. The retrained models are compared with supervised (on the same number of samples), with a pretrain on a different dataset, with other SSL methods.
Quality of SSL Methods for Image Recovery
The methods described above were the first attempt at writing in the field of SSL and are currently far from top solutions. Therefore, the metrics are low. The table below shows the accuracy (Accuracy) according to the linear evaluation protocol for various layers of the convolutional network, a comparison with supervised learning and with random initialization. (Interestingly, random initialization should produce a class probability of about 0.3 percent. Here, however, the quality is much higher. This suggests that the structure of sequential convolutions itself generates a useful signal).

An interesting feature is the drop in quality when using features from the last layer of the convolution block (Conv5) relative to the previous one (Conv4). This dependence is clearly visible in the graph below.

The main reason for this behavior is the appearance of specialization in the last convolution block, it is retrained to solve the given SSL task. This is a problem that will be addressed in subsequent articles.
Instead of total
In this article, we looked at the main methods of SSL, which operate exclusively on one image to train the model. They also mentioned the main metrics for assessing the quality of SSL solutions.
In the next article, let’s see what you can squeeze out of the picture and the box of augmentations!