Autoencoders in simple words

Autoencoders are a basic machine learning and artificial intelligence technique from which more complex models are built, such as in diffusion models such as Stable Diffusion. What is an autoencoder?

Autoencoders are called autoencoders because they automatically learn to encode data into a more compact or low-dimensional representation and then decode it back to its original form. This term can be broken down as follows:

Auto: means that the process is automatic. The model learns itself, without the need for explicit labels or human intervention to extract features. During training, it automatically finds how to best represent the data in a low-dimensional form.
Encoder: The first part of the model, the encoder, compresses or encodes the input data into a smaller, compressed form. This step involves reducing the dimensionality of the input data, essentially learning a more compact version of the data.
Decoder: the second part of the model, the decoder, attempts to recover the original input data from the encoded representation. The goal is to ensure that the output is as similar as possible to the original input, showing that the encoding preserves the essential features.

So first of all, an autoencoder is a type of neural network used for unsupervised learning. But not just any one, but one that can encode and decode data, like a ZIP archiver that can compress and decompress data. In machine learning, it is used to reduce dimensionality or compress data, and to remove noise from images (we'll talk about this later).

However, it does it smarter than a ZIP archiver. It is able to understand the most important features of the data (the so-called latent or hidden features) and remembers them instead of all the data, in order to then restore something close to the original from the rough description. In images, for example, it can remember the outlines of objects or the relative position of objects to each other. This allows for interesting lossy compression. It works something like this:

Only in latent traits are the values actually discrete, so the process is closer to this:

Here you need to indicate that this is just an example for understanding, and what the network recognizes in reality will be different. And also that not a single octopus was harmed during these experiments.

This approach is important for diffusion modelsbecause it allows them to work in a lower-dimensional space (called latent space), which is much faster than working directly from a high-resolution image. That is, instead of performing the denoising process directly on the pixels, the image is first compressed into latent space using an autoencoder, and the diffusion process occurs in this lower-dimensional space. After diffusion, the decoder reconstructs the high-resolution image from the latent representation. If you think about it, this is quite logical. For a diffusion model to work, you only need to know the main features of the original image, and not all the smallest details, so why waste time and computing resources on them?

The idea behind the autoencoder is very simple. trick. If we artificially limit the number of nodes in an autoencoder network and train it to reconstruct data from the original, this will force the network to learn a compressed representation of the original data simply because it will not have enough nodes to remember all the features of the data. She will have to discard most of the irrelevant attributes. This, of course, is only possible if there is some structure in the data (such as correlations between input features), because then this structure can be learned and exploited as the data passes through the network bottleneck.

Creating a bottleneck in a neural network

This network can be trained by minimizing reconstruction errorswhich measures the differences between our original data and the reconstructed image after decompression.

Here's an example of how the compression stage works:

At each compression step, we reduce the image size by half, but double the number of channels the network can use to store latent features.

Decoding works in reverse order:

Please note that at the end we get a restored image, not the original. They will be similar, but not identical! Yes, the goal of the learning process is to make them as similar (that is, minimize reconstruction error), but some details will be lost.

An ideal autoencoder model finds a balance between:

1. Sufficient sensitivity to the input data for their accurate reconstruction.

2. Sufficiently insensitive to avoid simple memorization or retraining on training data.

In most cases, this is achieved by specifying a loss function with two components: one term that encourages the model to be sensitive to the input data (for example, reconstruction error), and another term that discourages learning or retraining (for example, a regularization term). This is a very important observation – it is important to ensure that the autoencoder does not simply learn in an efficient way to remember the training data. We want it to find latent features to be useful for data other than the training set.

There are other ways to create a bottleneck in a network other than limiting the number of nodes.

Sparse autoencoders

We can limit the network by limiting the number of neurons that can fire at the same time. This will essentially force the network to use separate hidden layer nodes for specific features of the input data (an idea somewhat similar to how different areas of the brain process different types of information).

Sparse autoencoders use regularization techniques that encourage hidden units (neurons) in the network to maintain a certain level of sparsity, that is, only a small part of them should be active (i.e. have a non-zero output) at any given time. Here are the main types of regularization used in sparse autoencoders:

Regularization using KL-divergence:

The most common method for regularizing sparse autoencoders is to impose a constraint on the sparsity of hidden unit activations using Kullback-Leibler (KL) divergence.

The idea is to compare the average activation of a hidden unit to a desired level of sparsity, usually denoted by a small value (e.g. 0.05). KL divergence penalizes deviations from this desired level of sparsity.

This is achieved by adding a sparsity penalty term to the overall cost function. The cost function becomes a combination of the reconstruction error and the sparsity penalty.

The desired sparsity is often denoted as p (a small value such as 0.05). The average activation of hidden unit j across training examples is denoted as pj.

The general cost function J with regularization via KL-divergence looks like this:

Where:

J_reconstruction – reconstruction error (for example, mean square error).
β is a weight to control the strength of the sparsity penalty.
n_h — number of hidden units.

L1-regularization:

L1 regularization encourages sparsity by penalizing the absolute value of the weights, causing many weights to shift toward zero.

By adding the sum of the absolute values of the weights to the cost function, this form of regularization effectively encourages the model to use fewer connections, leading to sparse activations in the hidden layer.

L2 regularization:

L2 regularization, also known as weight decay, discourages large weights by penalizing the sum of squares of the weights.

Although L2 regularization does not directly provide sparsity, it helps prevent overfitting and can complement other methods that promote sparsity, such as KL divergence or L1 regularization.

Regularization of activity:

This method directly penalizes neuron activations. A term is added to the loss function that penalizes non-zero activations, often calculated as the L1-norm of activations.

By minimizing the sum of activations, such regularization encourages most neurons to remain inactive.

Denoising autoencoders

Another idea is to add noise to the original image and use it as input, but compare it to the cleaned original to calculate the error. So the model learns to sort of remove noise from the image, but as a side effect, it can no longer simply remember the input because the input and output are not the same. This is what the learning process looks like:

This, as stated above, causes the model to remember only important features and ignore noise (and other unimportant details).

Variational autoencoders (VAE)

They are the ones used in diffusion models. The basic idea is that in conventional autoencoders, features are stored as discrete values in feature layers. Variational autoencoders (VAE) use a probability distribution for each latent feature. This allows the network to implement some interesting features, which I will describe in a separate article.

Restrictions
Because autoencoders learn to compress data by identifying patterns and relationships (i.e., correlations between input features) that emerge during training, they tend to effectively recover data similar to that used during training.

Also, the ability of autoencoders to compress data on their own is not often used, as they are usually inferior to hand-crafted algorithms designed for specific types of data, such as audio or images.

Although autoencoders can be used to encode text, they are less commonly used compared to more modern architectures such as transformers (BERT, GPT, etc.) because autoencoders may have difficulty processing complex language structures or long sequences, especially without mechanisms like attention that help capture long-range dependencies.

Conclusion
Autoencoders are fundamental building blocks in machine learning and AI, offering a versatile approach to tasks such as data compression, dimensionality reduction, and noise removal. By learning to encode data into compact, low-dimensional representations and then decode them back, autoencoders can effectively capture important features of the input data while discarding less significant details. This ability makes them valuable in applications ranging from image processing to data preprocessing, and as components of more complex architectures such as latent diffusion models.

I am a co-founder of an AI integrator Raft.

Sharing my experience in TG-channel.

All the best and positive mood!