Portrait image harmonization

Hello! In this publication, the RnD CV team from SberDevices will introduce you to our approach to solving the problem of increasing the degree of realism of portrait images (in scientific terms – portrait harmonization of images). We have conducted a number of studies and are ready to present:

  • PHNet — a new neural network architecture that solves the problem of portrait harmonization;

  • FFHQH – a large-scale dataset of 70,000 instances, based on the FFHQ dataset from Nvidia.

Our research can be applied in various products and services, in particular for SberJazz in the task of replacing the background and harmonizing a portrait, as well as in GigaChat when adding the “static image” and “video” modalities to the service. In addition, our developments can be useful in solving the problem of Image Enhancement and in the task of manipulating images, where you need to replace / remove / add objects to a photo, for example in Kandinsky in picture mixing mode. Also, using our solution you can create synthetic datasets.

In this article we will talk about the task of harmonization, the distinctive aspects of portrait harmonization, and share details about the architecture of the model, the datasets used and the experiments performed. At the end of the article, examples of the model’s operation and the resulting metrics will be presented.

We also evaluated other open-source solutions that are the best in the task of harmonization. We trained them on our FFHQH dataset, and showed that the PHNet architecture is state-of-the-art (best) in the portrait harmonization task.

Description of the task

First, let's discuss the classic problem of image harmonization.

Harmonization is not only the correction of color characteristics for an existing object in a photograph, but also for an object transferred from one background to another. The latter also entails inconsistency between the object and the new background.

When you add an object to a new background, usually the foreground of a photo F there is an object that does not match this background B in terms of contrast, brightness, saturation and other color characteristics. A mask is used to extract an object from an image M.

Example from the HFlickr dataset

Example from the HFlickr dataset

Region F The first photo is the inconsistent area B. The second photo shows the area mask F. The third photo is harmonized.

The task is to change this object so that it becomes harmonized (realistic) relative to the area B. The harmonized image is a linear combination of the following form:

  I_{harmonized} = f(I_{composite}, M) \cdot M + I_{composite} \cdot (1 - M),

Where I_{composite} — RGB picture on which the unharmonized object is located, f – algorithm or neural network, f(I_{composite}, M) – an image with a harmonized and natural object that predicts f . The object is extracted from the resulting image using a mask.

The portrait harmonization problem is a subtype of the classical harmonization problem, in which the object is portrait of a man. It is more complex not only technically, but also due to the lack of necessary data sets, since there are practically no publicly available datasets for this task, and the publicly available FFHQ dataset consists of photographs with a large area of ​​human faces. Thus, the background, which contains information about how to transform the object in the foreground, takes up much less space.

Examples from Nvidia's FFHQ dataset

Examples from Nvidia's FFHQ dataset

Architecture

We developed a neural network called PHNet (Patch Harmonization Network) to solve the problem of portrait harmonization.

PHNet is a Unet architecture that uses a Patch-based normalization (PN) module in the decoder blocks and Patch-based feature extraction (PFE) at the beginning. The blocks will be explained next.

PHNet Architecture

PHNet Architecture

PN block uses statistical methods of color transfer, calculating the following global statistics:

\mu_{glob}(I) = \frac{\sum\limits_{x \in I}{x}}{N} ,\: \sigma_{glob}(I) = \sqrt\frac{\sum\limits_ {x \in I}{(x-\mu_{glob}(I))^2}}{N}

In the formula N is the number of pixels in the input tensor I \in R^{C_i\times H_i\times W_i}.
In stages 4–6 of the decoder, the input to the PN block is the tensor from the previous stage of the decoder I and a smaller version of the mask \hat{M} \in R^{1\times H_i\times W_i}.

To obtain the tensors characterizing the foreground F and background B image areas – I pixel-by-pixel multiplied by M And 1 - M respectively.
After multiplication we get: F \in R^{C_i\times H_i\times W_i} And B\in R^{C_i\times H_i\times W_i}.

Next we describe 2 stages of harmonization:

The result of the PN block is:

N(F, B, M) = \tilde F \cdot M + \tilde B\cdot(1 - M),

Where \tilde F = w_0 \cdot \hat{F}_{loc} + w_1 \cdot \hat{F}_{glob} + w_2 And \tilde B = \gamma_0 \cdot \hat{B}_{loc} + \gamma_1 \cdot B + \gamma_2.
w_0, w_1, w_2, \gamma_0, \gamma_1, \gamma_2 are trainable parameters.

The motivation of the PN block is to capture the visual style of the background area and introduce information into the foreground. This is different from batch normalization layers, in which normalization occurs in both areas with the same mean and variance.

Visualization of the Patch-based Normalization (PN) block

In a number of experiments, we found that the use of PN modules in three decoder blocks is optimal from a metric point of view. They are shown in the figure “PHNet Architecture” in decoder blocks 4, 5 and 6.

PFE-the module represents the already familiar PN block with trainees w_0, w_1 parameters, while all others remain fixed. The output of the PFE block is adaptively averaged and the resulting coefficients reweight the weights in the SE block.

The intuition of such a block comes from experiments with I_{composite}where the foreground can be manipulated to blend into the background to the point of almost disappearing, while still maintaining color-independent attributes, as shown in the PHNet Architecture figure.

iHarmony4

The most used dataset in the harmonization task is iHarmony4. It consists of 4 other datasets: HCOCO, Hadobe5k, Hday2night, HFlickr. As already mentioned, harmonizing an image means changing the color characteristics of an object in the foreground.

In the case of iHarmony4, the dataset was compiled using:

In the foreground, the iHarmony4 dataset includes objects such as animals, full-length people, interior items and other objects, but it is not portrait.

In total, the dataset contains 73146 triplets <I_{real}, M, I_{composite}>” src=”https://habrastorage.org/getpro/habr/upload_files/c1f/a7c/528/c1fa7c528520283202da89e2d261c600.svg” width=”186″ height=”22″/>of which 7404 are in the test (aka validation) sample.</p><h2>FFHQH</h2><p>To create our own portrait dataset, we used the FFHQ dataset, which contains only a set of portrait images <img data-lazyloaded= no masks. So we not only need to create \{I_{composite}\}but also masks \{M\}.

Receipt process I_{composite} And M next:

  • Predicting people's portrait masks using matting neural network architecture StyleMatteapplied to real images;

  • Binarization of the resulting masks according to the standard threshold probability = 0.5. This threshold was chosen based on the histogram of mask distribution. At this stage the mask M formed;

  • Applying augmentations to real images in the mask area M:

    • ColorJitter: brightness (0.5), contrast (0.4), saturation (0.06), hue (0.05);

    • RandomPosterize (p=0.5; range=6);

    • RandomAdjustSharpness (p=0.5; range=4);

    The range of augmentation parameters was carefully chosen to avoid strong and unnatural color distortions. The augmentations used are heterogeneous in terms of color transformations;

  • Thus, the background of the object remained untouched, but the person in the portrait changed and became inconsistent with the background, thereby I_{composite} formed.

Examples of FFHQH dataset

The FFHQH dataset was divided into 3 samples: 60,000 triplets for training, 5,000 for validation, and 5,000 for testing.

Experiments and results

Data
We trained 2 models and conducted experiments on iHarmony4 and FFHQH.

Image preprocessing
We did not experiment with augmentations, and only used horizontal reflection in all our experiments. Before submitting to PHNet, photos were converted to the range [0; 1].

Metrics
PSNR is used as the main metric in this task:

  PSNR = 10 \cdot log_{10}(\frac{MAX^2}{MSE})

Let's leave 2 important remarks. Firstly, the authors of other works calculate metrics differently.

As a MAX value, in some works they used 255, and in others – the maximum pixel value I_{real} (pictures are encoded from 0 to 255, but the latter is not always present in the photo). We also saw an implementation where the MAX value is the difference between the maximum and minimum I_{composite}.

The choice of MAX value is important because it directly affects the PSNR value. On the page paperswithcode there are top models for solving the harmonization problem, and for example, the authors of HDNet use 255 as the MAX value, and DucoNet and a number of other works use the pixel maximum I_{real}.
In our experiments, we also choose MAX to be equal to the maximum pixel value in I_{real} due to frequency of use.

The second note is that the authors use non-binary masks to calculate metrics. We do not agree with this, since three out of four datasets in iHarmony4 contain binary masks, and it is not clear why we should count metrics on others. We tried to find out about this from the authors of 4 papers, but did not receive an answer.

PSNR and MSE (on which the first one depends) can be calculated using not all pixels of the image, but only those belonging to the foreground area. Such metrics are denoted by fPSNR and fMSE.

Loss functions
We use 3 loss functions: MSE normalized by F area (FN-MSE), gradient and PSNR-Loss:

PSNR\_Loss = \frac{100}{log_{10}(\frac{MAX}{\sqrt{MSE}})}

Let's move on to the results.

Comparison of models on the iHarmony4 test set

The best result is shown in bold color.

The table shows that PHNet is not an ideal model for the problem of conventional harmonization, despite the fact that the best metric was obtained on three out of four datasets. This is due to the fact that PHNet does not specialize in images whose mask occupies a small area (78% of the masks in the iHarmony4 dataset occupy less than 15% of the image area I_{real}).

In terms of masks (objects) occupying different areas in iHarmony4, the situation is as follows:

Comparison of models with different foreground ratios

Comparison of models with different foreground ratios

This table confirms that PHNet is not as good for small objects as it is for large ones.

And finally the results on our FFHQH dataset:

Comparison of models on the FFHQH test set

Comparison of models on the FFHQH test set

All listed models were trained on our FFHQH dataset. We used the training parameters that the authors used in their papers and adapted their repositories, keeping the best model in the PSNR metric for validation.

All FFHQH masks have an area greater than 40%, which makes this task non-trivial and difficult, but as can be seen from the results, PHNet coped with it by a large margin over its competitors. The final model has 39.9 million parameters and weighs 153 MB. FPS on the Intel H470 CPU in one thread is 1.01, and on the NVIDIA Tesla V100 it is 34.49.

Models were trained and compared at a resolution of 256×256.

Examples of work

We present two types of visualizations: a comparison of PHNet with other models for two datasets and a demonstration of work on FFHQH.

Visual comparison at FFHQH

Ground Truth stands for I_{real}.

Visual comparison on iHarmony4

Column 4: PHNet predictions

Column 4: PHNet predictions

Conclusion

In this article we talked about the problem of image harmonization and how we solved it in our team in SberDevices. We created FFHQH, the first large-scale portrait harmonization dataset, and made it publicly available.

Team of authors

Efremyan Karen, Petrova Elizaveta, Kaskov Evgeniy, Kapitanov Alexander

Links

Our team also has a telegram channel where we talk about the results of our work, share ideas and failures. Subscribe.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *