How can you take tensorflow and mix two pictures into one

You may have seen images in which two images are mixed. One can be seen up close, and another is seen from afar. For example, Einstein and Madonna.

I don’t know how the original ones were made, but I tried to do something similar using tensorflow.

The general idea is to take machine learning and use it to fit the image. Please note: I am not teaching a model that will make one from two pictures. A model is just an array of pixel colors, nothing else. In the process of “learning” we will take two sample pictures and will improve the “similarity” to them. For the first sample, the error function will be the color difference from the trained picture. For the second, the difference between a blurry sample and a blurry picture.

Fourier transform is the last century, new problems require innovative solutions.

Code and source pictures lie on github

To run you need tensorflow, numpy and Pillow. The latest version of tensorflow will need python 3.7 or newer.

To get the result in a couple of minutes when calculating on the processor, I reduced the pictures to 512×512 pixels. Took them from Wikipedia: chess and Lena the girl.

chess
Lena girl

Let’s load them and convert them to floats in the range from 0 to 1:

lena = Image.open("imgs/Lenna512.png")
chess = Image.open("imgs/ChessSet512.png")
imgs = [np.array(i) / 255.0 for i in [lena, chess]]

Gamma correction

The brightness of light perceived by the eye can differ by many, many orders of magnitude. In images with one luminance byte per channel, there are 256 luminance values ​​and they are displayed non-linearly in the physical number of photons. For example, two pixels with a brightness of 127 will have less light than one with a brightness of 254.

To translate the physical value of brightness, there is a gamma correction: raising the brightness to the power of 2.2. The result is a value proportional to the number of photons from the screen. The reverse transformation is also done simply by raising to the power of 1 / 2.2.

Why do you need all this? If I want to know the apparent color difference, I can simply take the difference in normal RGB space.

When calculating how the image will look from afar (blurred), it will be necessary to translate the brightness values ​​into linear space (proportional to the number of photons) and then make a Gaussian blur in it.

Gaussian blur

When you blur the picture, each pixel turns into a speck. In mathematical language, a speck is called a kernel, and the process itself is called a convolution. The kernel looks something like this:

Ce ^ {-  frac {dx ^ 2 + dy ^ 2} {2  sigma ^ 2}} = C e ^ {-  frac {dx ^ 2} {2  sigma ^ 2}} e ^ {-  frac {dy ^ 2} {2  sigma ^ 2}}

C is a certain constant so that the sum of all elements (or the integral over the area) is equal to one and the brightness of the image does not change. For the same reason, it is necessary to make a gamma correction and work with a physically correct amount of light.

A feature of convolution with this kernel is that it can be done first along one axis, and then along the other, so it is enough to calculate the kernel for the one-dimensional case:

def make_gauss_blur_kernel(size: int, sigma: float) -> np.ndarray:  
    result = np.zeros(shape=[size], dtype=float)  
    center = (size - 1) // 2  
    div = 2 * (sigma ** 2)  
    for i in range(size):  
        x2 = (center - i)**2 
        result[i] = math.exp( -x2 / div)  
    return result / np.sum(result)  
  
make_gauss_blur_kernel(size=11, sigma=2)

Mathematically, the kernel is infinite, but we will limit its size. For example, for sigma = 2 and a kernel size of 11, it will look like this:

array([0.00881223, 0.02714358, 0.06511406, 0.12164907, 0.17699836, 0.20056541, 0.17699836, 0.12164907, 0.06511406, 0.02714358, 0.00881223])

Tensorflow 2.0

In the old tensorflow, the model graph was immutable and this imposes restrictions. In version 2.0, they dragged the trick from PyTorch – the model graph is built dynamically right in the process of calculating the error, and then you can take and calculate the gradients.

The magic looks like this:

with tf.GradientTape() as tape:
    # computations
    loss = ...
    
gradient = tape.gradient(loss, trainable_variables)    

This is exactly what we need.

Create the model

class MyModel:
    def __init__(self, img_h: int, img_w: int, gauss_kernel_size: int, gauss_sigma: float, image_source: Optional[np.ndarray] = None):
        if image_source is None:
            image_source = np.zeros(shape=(1, img_h, img_w, 3), dtype=float)
        self.trainable_image = tf.Variable(initial_value=image_source, trainable=True)
        gauss_blur_kernel = make_gauss_blur_kernel(gauss_kernel_size, gauss_sigma)
        self.gauss_kernel_x = tf.constant(gauss_blur_kernel[np.newaxis, :, np.newaxis, np.newaxis] * np.ones(shape=(1, 1, 3, 1)))
        self.gauss_kernel_y = tf.constant(gauss_blur_kernel[:, np.newaxis, np.newaxis, np.newaxis] * np.ones(shape=(1, 1, 3, 1)))
        
    ...

trainable_image – variables. Our “learning” picture consists of them. We will adjust them to make it look like the two pictures we need. We will also make constants for the kernels of convolutions along the x and y axes. We will not teach them, they are already obtained in a clever way.

For convolutions, a four-dimensional kernel is used:

  1. Y-axis of the image

  2. X-axis of the image

  3. number of input channels (3)

  4. number of output channels per input channel (1, each color goes into itself)

Also, both numpy and tensorflow have a broadcasting idea. For example, an array with dimensions (1, 256, 512, 1) can be interpreted as (N, 256, 512, C). As if the size along the first and last axis is arbitrary, and the numbers are the same and do not depend on the coordinate along that axis. Broadcasting in these libraries sometimes works differently, the convolution function wants to see the kernel with the dimensions (size_x, size_y, 3, 1) and for some reason is dissatisfied with the array (size_x, size_y, 1, 1), so I had to multiply by the array in numpy units of dimension (1, 1, 3, 1). If we made different convolution kernels for different colors, then this dimension would be useful to us, but everything is the same for us.

I use per-channel convolutions (so that the channel with the color when blurring affects only itself and turns into a new channel with the same color). Such convolutions are considered faster than usual ones. And also inside there is no Fourier transform, and the complexity of calculating the convolution depends linearly on the size of the kernel. For the same reason, a convolution with a 15×15 square kernel will take several times longer than two convolutions with 1×15 and 15×1 kernels.

class MyModel:
    ...

    def run(self, img_precise: np.ndarray, img_blurred: np.ndarray, m_precise = 1, m_blurred = 1) -> Report:
        with tf.GradientTape() as tape:
            # next code will be here

For the learning step, we take two pictures and coefficients for the importance of errors for each.

Learned variables can be any, even -1, even 9000. But as the colors of the picture, I want to get values ​​in the range from 0 to 1. For this, we apply sigmoid.: near zero it grows more or less linearly, but at large input values ​​the growth slows down and the result will never exceed one.

trainable_image01 = tf.math.sigmoid(self.trainable_image)

For the first picture (which should be sharp), the difference is simply the sum of the squares of the luminance differences for each channel of each pixel. In theory, each pixel learns independently, and a learning step of 0.1 will be quite normal.

Instead of the sum, one could use reduce_mean (), but then the gradients would be smaller by the area of ​​the picture (512×512) and the gradients would have to be multiplied by something like 10 ^ 5.

loss_precise = tf.reduce_sum(tf.square(trainable_image01 - img_precise[np.newaxis, :, :, :]))

For the second image, let’s make a blur (convolution), and then compare it pixel by pixel in the same way. And let’s not forget about gamma correction before convolution and reverse after:

blurred = self.gauss_blur(trainable_image01 ** 2.2) ** (1.0 / 2.2)
blurred_label = self.gauss_blur(img_blurred[np.newaxis, :, :, :] ** 2.2) ** (1.0 / 2.2)
loss_gauss = tf.reduce_sum(tf.square(blurred - blurred_label))

Let’s multiply the errors by the coefficients and add them. Depending on the ratio of the coefficients, the result will tend more towards one picture or another.

loss = loss_precise * m_precise + loss_gauss * m_blurred

Alright and get the gradients:

gradient = tape.gradient(loss, self.trainable_image)

There are different optimizers for training, but I made a simple gradient descent with my hands:

def apply_gradient(self, grad: np.ndarray, lr: float):
    self.trainable_image.assign(self.trainable_image.numpy() - lr * grad)

Let’s call the function many, many times. The code in the article is for illustration, the runable version can be found on the github.

class MyModel:
    ...
    def train(self, steps_count: int, print_loss_steps: int, lr: float, **run_kwargs) -> Report:  
        for i in range(steps_count):  
            r = self.run(**run_kwargs)  
            model.apply_gradient(r.gradient, lr)  
            if i % print_loss_steps == print_loss_steps - 1:  
                print(f"{i}: loss = {r.loss}, precise = {r.loss_precise}, gauss = {r.loss_gauss}")  
        return r

Now everything is ready to train the model:

model = MyModel(512, 512, gauss_kernel_size=15, gauss_sigma=3)
r = model.train(steps_count=200, print_loss_steps=50, lr=0.3, img_precise=imgs[0], img_blurred=imgs[1], m_precise=0.1, m_blurred=1.0)  
Image.fromarray(np.uint8(r.image * 255.0))

I tried different ratios (0.1, 0.3, 1.0).

Pictures here

Why is all this?

Just because I can. I love the ability to define an arbitrary error function and train the model without thinking about how to get the result analytically.

If it seems to you that I hammer in nails with a microscope and tensorflow is not at all for this, then this is not so. The library makes it easy to read gradients, I use it as I want. Machine learning does not have to happen on clusters with top GPUs on terabyte datasets.

By using Fourier transform and cutting out high frequencies from one picture and low frequencies from another, you can get a similar effect. But with some caveats: brightness values ​​less than 0 and more than 1 can also be obtained. And I don’t know how to combine the logarithmic perception of brightness by the human eye and the need to do Gaussian blur in linear color space. I tried it, I didn’t like the result. There will be no proofs.

The variant with training is orders of magnitude slower than Fourier and on my laptop it takes several minutes. In my opinion, this is not scary, since I spent much more time writing code. I do not have a task to generate thousands of pictures, one or two is enough.

The initial state of the trained model is a gray picture. If performance is important, you can take one of the pictures or even the results of experiments with frequencies and Fourier as a first approximation. But in the case of one picture, writing and debugging this code will take longer than learning from scratch, and the result will be approximately the same.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *