# Theoretical considerations for image compression using neural networks

I was prompted to take up this article by a recent (on a historical scale) experiment in image compression using Stable Diffusion. Having skimmed through truisms like Wikipedia, I found that the problem of “a beautiful, but completely fictional picture” is already known, but the most obvious solution that comes to mind – for some reason does not appear in them.

The reasons I can surmise are that it has either long since been dismissed as ineffective (too low compression ratio) or not worked out as having too many variations, of which “not all yogurts are the same”.

Fortunately, the article is intended to be a sandbox, so I can safely express my own thoughts – before it steals someone’s attention, it must be approved by more knowledgeable specialists.

The problem of neural networks trained to transmit an image through a “bottleneck” is also inherent in natural neural networks.

Many parents who let children of a similar age go “on their own” quickly discover that they have developed their own “childish language”, understandable only to them. And, obviously, the development of this “language” does not go beyond the names of toys and other elementary life, so even if it wanted to, such a language would be poorly suitable for a full life (let’s leave guesswork how much it can or cannot develop over the period when a person is capable of it). The problem is solved radically – children are retrained into a common language, and sometimes this is already quite difficult. The same solution is asked here: instead of the “childish language”, which the left and right halves of the neural network came up with mutually for each other to communicate through the “bottleneck” (indicated by a bold question mark), to abandon the latent parameter space altogether, as well as from the right half of the neural network. It’s easy to take and use as the target “language” for which the compression neural network is trained, a certain deterministic compression format that is easy to decompress with ordinary mathematics (without neural networks), but its compression is so nontrivial that only a sufficiently powerful neural network is able to solve this problem (on in fact, for some reason I have confidence that up to this point I have stated something obvious and well-known, but in the comments to more recent articles offhand no one has found examples, so all of a sudden …) Well, now let’s move on to the nuances for which I am writing this article.

Nuance first. A compression algorithm that tries to reduce everything to a fixed size is a conditionally useful thing, to say the least. The actual amount of information in the image can vary by orders of magnitude, from “beautiful clouds” to “a page from the Voynich manuscript” (yes, in the experiment, it was not just the house that suffered the most crushing fiasco that the neural network replaced with a completely different house, but the inscriptions on the drawings ). To do this, when training the network, you need to add a new input parameter: “required size”, which the network must fit into. The output signals applied to the “unclamp” are simply cut off to the number N, and everything from it and above is discarded (here, however, some edge effects are possible that affect the adequacy of the latter, N_{-1}-th output signal). At the same time, it is reasonable to divide the final error that occurs between the restored image and the original by N: this will allow not to demand the impossible from the network and not to train it to “get too sharp” on trying to save hopelessly compressed images to the detriment of “developing skills” to get a really good image where it would be possible. And, probably, it would be no less reasonable when training to use each image repeatedly with different requirements, teaching the network to compress it in different quality – from maximum to minimum.

Nuance second. The choice of an estimation algorithm is not the most elementary task. What exactly should be taken as the final error that should be minimized? What should be the penalty function during training? RMS, of course, is indicative as the “average temperature in the hospital”, but this parameter is clearly not worth taking seriously. Maximum error? Too much focus on one pixel of noise.

Outlier estimators (sometimes called the truly nightmarish phrase “robust estimators”)?

Okay, let’s say we’ve found a good noise estimator. But here another problem lies in wait for us: these are deformations. The shift of a one-pixel line by half a pixel, in general, is not fatal for the image (being on the verge of resolution anyway, it is not very informative). From the point of view of noise, changing the line along its entire length by as much as half the dynamic range in its original place and appearing in a new one is no less an error than the complete absence of a line, and more than a line mutilated to a dotted state.

In any case, without armed with a sufficiently perfect “smart assessment”, there is definitely no point in rushing at the task. In terms of noise, it’s easier to replace “C” with “B”.

Nuance third. There is quite a lot of “devil in the details” in choosing a particular coding system. Let’s put aside such an obvious option as fractal encoders (I mention them for a reason, they have their own fandom, which thoroughly knows all the pitfalls – well, now they have access to the piano). I will analyze an order of magnitude simpler, but nevertheless hardly amenable to traditional methods of expansion, option: frequency analysis, limited to arbitrary areas. This is something like a nightmarish hybrid of wavelets and image vectorization: you need to split the image into a stack of layers-terms, each of which is given by a contour (mark it in green) and filled with some set of harmonics (mark them in red). Nightmarish in the sense that our vision solves this problem instantly (“here is one branch of a tree, the leaves are located in such and such a way, here is the second, here are ripples on the water, here are pebbles” – this is all just what separate areas of the image that can be outlined), but solving this problem without neural networks is a mathematical nightmare, especially considering that areas can and should partially overlap for maximum efficiency).

What imprint does this decoder leave on training in general? Firstly, the amount of information required to store one contour depends on the complexity of the contour shape. Secondly, on the number of harmonics. For example, it is most effective to cover a “square-nested building” with one complex circuit, in which complex harmonics set a pattern of identical buildings (an ideal “square-nested building” does not occur in nature, but for example …) This, of course, will be a large volume information – but it will immediately cover most of the image. And in order to encourage the network to create the most efficient contours, we give it a significant margin of outputs, allowing you to specify a lot, for example, Bezier curves (or how we want to define the contour) and harmonics. Then, according to the required degree of compression, we quantize the outputs (both harmonics and contours). We estimate the required amount of information for each.

Sort in order of increasing contrast. And then, instead of taking the first N contours, as in the “spherical vacuum” example, we select the contours in order of increasing contrast (i.e., influence on the image), until the required volume is reached in the amount of N (plus or minus to the nearest integer description of one circuit). And only the quantized contours already obtained, filled with quantized harmonics, are fed to the “unclamp” and then to the “smart error estimate”, teaching the neural network to make the best use of very intricate and complex patterns between the levels at the outputs and the amount of information that needs to be “fit” ( and at the same time protecting yourself from edge effects of learning for N_{-1}-th output signal).

As is easy to see, this is very different from the general case above. The space of output parameters is rigidly divided into regions (the maximum number of which should cover the highest compression quality of the most complex image), where each region is divided into a contour encoded in one way or another (the number of signals allocated for this corresponds to the most complex contour) and harmonic coefficients ( also corresponding to the worst case). The total number of outputs will surely turn out to be comparable to the number of inputs, but since the cases where everything simultaneously proceeds according to the worst scenario are degenerate or completely impossible in principle, this should not be embarrassing, because after quantization, sorting and discarding the least contrasting (I assume that in this case it is better to ignore the size of the area altogether, so that the importance of small and large details is determined by the neural network itself) – the number of “survivors” will dry out by a couple of orders of magnitude. And even more – if the “nightmare” codec does not show a high compression ratio, any workable result is already very indicative in terms of the ability of neural networks to own the most terrible ways of encoding an image.

Thus, if someone knows the issues of “smart” error estimation much better than I do, and in addition can imagine the principles by which it is possible to describe the contours in a way accessible to the neural network (and even so that they can be quantized with a “roughening” of the contour ) — I wish good luck to this brave naturalist.

And if someone from those who dealt with fractal compression, these considerations led to similar thoughts in their diocese, I give them the rostrum.

Criticism is welcome (except for pointing out that all this is nothing more than an acrobatic imhonautics competition on uneven bars, because this circumstance is already known and listed in “known bugs”, there is no point in repeating it). Operational criticism is especially welcome, before it is too late to correct ashypki, ochpeyatki and incorrect use of terminology.