Semantic segmentation based on the U-Net architecture and determination of the distance between objects

Hi all!

Returning to the everyday use of neural networks, initially there was an idea to improve the model for detecting a free parking space from my previous article (Determining a free parking space using Computer Vision), to make it possible to segment the road, sidewalk and exclude from parking spaces, cars that are standing on the lawn ( There have been a few angry comments about this.

However, in the process of thinking, I decided to make a separate sematic segmentation model, and write a neural network manually and train on my data. The essence of the model is as follows:
The model based on the U-Net architecture segments various objects (cat, chair, table, plate with meatballs, etc.) and when two segmentation objects approach each other (cat – plate), the model signals this using bot telegrams.

Great, the task is set, now the implementation!

#Библиотеки
import os
import glob
import cv2

import pandas as pd
import numpy as np
import requests

import tensorflow as tf

from skimage import measure
from skimage.io import imread, imsave, imshow
from skimage.transform import resize
from skimage.morphology import dilation, disk
from skimage.draw import polygon, polygon_perimeter

from livelossplot.tf_keras import PlotLossesCallback

The first thing I need to train a model from scratch is a tagged database. I have made over 40 staged photos and marked them up using the Supervisely service.

This is not the only service for marking up images (there is, for example, a very interesting CVAT). The trial version of Supervisely is just enough to mark up about 50 photos for grades 6-8 in one day. The service is very convenient, there are many different tools for high-quality markup.

After an hour of fun (not very) work on the markup, here’s what I got:

Labeled image (7 classes + 1 background)

Labeled image (7 classes + 1 background)

I tried my best, don’t judge too harshly 🙂

So, the database is ready, let’s form our dataset and divide it into train and test (classic). In general, after augmentation, more than 2000 images are obtained, of which 1800 will go to the train sample, the rest to the test.

#Размер train выборки
train_size = 1800

#Делим на train и test
train_dataset = dataset.take(train_size).cache()
test_dataset = dataset.skip(train_size).take(len(dataset) - train_size).cache()

train_dataset = train_dataset.batch(BATCH_SIZE)
test_dataset = test_dataset.batch(BATCH_SIZE)
What does the model see after augmentation

What does the model see after augmentation

Next, we will form the architecture of the neural network. I chose the classic U-Net architecture, which proved to be excellent in solving semantic segmentation issues.

Classic U-Net architecture

Classic U-Net architecture

I deliberately did not make any block functions of the neural network, so that anyone could consistently look at the architecture. It turned out something like this:

My version of U-Net

My version of U-Net

In essence, U-Net consists of two parts: an encoder and a decoder. We input a three-channel image with a size of 256×256 in our case and then do downsampling, at each level we extract a map of features of various objects (shapes, sizes, colors, etc.).

def unet_model(image_size, output_classes):

    #Входной слой
    input_layer = tf.keras.layers.Input(shape=image_size + (3,))
    conv_1 = tf.keras.layers.Conv2D(64, 4, 
                                    activation=tf.keras.layers.LeakyReLU(),
                                    strides=2, padding='same', 
                                    kernel_initializer="glorot_normal",
                                    use_bias=False)(input_layer)
    #Сворачиваем
    conv_1_1 = tf.keras.layers.Conv2D(128, 4, 
                                      activation=tf.keras.layers.LeakyReLU(), 
                                      strides=2,
                                      padding='same', 
                                      kernel_initializer="glorot_normal",
                                      use_bias=False)(conv_1)
    batch_norm_1 = tf.keras.layers.BatchNormalization()(conv_1_1)

    #2
    conv_2 = tf.keras.layers.Conv2D(256, 4, 
                                    activation=tf.keras.layers.LeakyReLU(), 
                                    strides=2,
                                    padding='same', 
                                    kernel_initializer="glorot_normal",
                                    use_bias=False)(batch_norm_1)
    batch_norm_2 = tf.keras.layers.BatchNormalization()(conv_2)

    #3
    conv_3 = tf.keras.layers.Conv2D(512, 4, 
                                    activation=tf.keras.layers.LeakyReLU(), 
                                    strides=2,
                                    padding='same', 
                                    kernel_initializer="glorot_normal",
                                    use_bias=False)(batch_norm_2)
    batch_norm_3 = tf.keras.layers.BatchNormalization()(conv_3)

    #4
    conv_4 = tf.keras.layers.Conv2D(512, 4, 
                                    activation=tf.keras.layers.LeakyReLU(), 
                                    strides=2,
                                    padding='same', 
                                    kernel_initializer="glorot_normal",
                                    use_bias=False)(batch_norm_3)
    batch_norm_4 = tf.keras.layers.BatchNormalization()(conv_4)

    #5
    conv_5 = tf.keras.layers.Conv2D(512, 4, 
                                    activation=tf.keras.layers.LeakyReLU(), 
                                    strides=2,
                                    padding='same', 
                                    kernel_initializer="glorot_normal",
                                    use_bias=False)(batch_norm_4)
    batch_norm_5 = tf.keras.layers.BatchNormalization()(conv_5)

    #6
    conv_6 = tf.keras.layers.Conv2D(512, 4, 
                                    activation=tf.keras.layers.LeakyReLU(), 
                                    strides=2,
                                    padding='same', 
                                    kernel_initializer="glorot_normal",
                                    use_bias=False)(batch_norm_5)

Then we do upsampling and concatenate with feature maps from the first stage.

    #Разворачиваем
    #1
    up_1 = tf.keras.layers.Concatenate()([tf.keras.layers.Conv2DTranspose(512, 4, activation='relu', strides=2,
                                                                          padding='same',
                                                                          kernel_initializer="glorot_normal",
                                                                          use_bias=False)(conv_6), conv_5])
    batch_up_1 = tf.keras.layers.BatchNormalization()(up_1)

    #Добавим Dropout от переобучения
    batch_up_1 = tf.keras.layers.Dropout(0.25)(batch_up_1)

    #2
    up_2 = tf.keras.layers.Concatenate()([tf.keras.layers.Conv2DTranspose(512, 4, activation='relu', strides=2,
                                                                          padding='same',
                                                                          kernel_initializer="glorot_normal",
                                                                          use_bias=False)(batch_up_1), conv_4])
    batch_up_2 = tf.keras.layers.BatchNormalization()(up_2)
    batch_up_2 = tf.keras.layers.Dropout(0.25)(batch_up_2)




    #3
    up_3 = tf.keras.layers.Concatenate()([tf.keras.layers.Conv2DTranspose(512, 4, activation='relu', strides=2,
                                                                          padding='same',
                                                                          kernel_initializer="glorot_normal",
                                                                          use_bias=False)(batch_up_2), conv_3])
    batch_up_3 = tf.keras.layers.BatchNormalization()(up_3)
    batch_up_3 = tf.keras.layers.Dropout(0.25)(batch_up_3)




    #4
    up_4 = tf.keras.layers.Concatenate()([tf.keras.layers.Conv2DTranspose(256, 4, activation='relu', strides=2,
                                                                          padding='same',
                                                                          kernel_initializer="glorot_normal",
                                                                          use_bias=False)(batch_up_3), conv_2])
    batch_up_4 = tf.keras.layers.BatchNormalization()(up_4)


    #5
    up_5 = tf.keras.layers.Concatenate()([tf.keras.layers.Conv2DTranspose(128, 4, activation='relu', strides=2,
                                                                          padding='same',
                                                                          kernel_initializer="glorot_normal",
                                                                          use_bias=False)(batch_up_4), conv_1_1])
    batch_up_5 = tf.keras.layers.BatchNormalization()(up_5)


    #6
    up_6 = tf.keras.layers.Concatenate()([tf.keras.layers.Conv2DTranspose(64, 4, activation='relu', strides=2,
                                                                          padding='same',
                                                                          kernel_initializer="glorot_normal",
                                                                          use_bias=False)(batch_up_5), conv_1])
    batch_up_6 = tf.keras.layers.BatchNormalization()(up_6)


    #Выходной слой
    output_layer = tf.keras.layers.Conv2DTranspose(output_classes, 4, activation='sigmoid', strides=2,
                                                   padding='same',
                                                   kernel_initializer="glorot_normal")(batch_up_6)

    model = tf.keras.Model(inputs=input_layer, outputs=output_layer)
    return model

As a loss, we will use a mixture of binary crossentropy and DICE (DICE works well with segmentation, and binary crossentropy provides good convergence).

# Binary crossentropy + 0.25 * DICE
def dice_bce_loss(y_pred, y_true):
    total_loss = 0.25 * dice_loss(y_pred, y_true) + tf.keras.losses.binary_crossentropy(y_pred, y_true)
    return total_loss

Having approximately understood the principle of the selected neural network, we will start training the model. I give examples of the work of the model after training 5, 10, 25 epochs.

5 eras

5 eras

10 epochs

10 epochs

25 epochs

25 epochs

It can be seen that 5 epochs – the model is not retrained, 25 epochs – the model is retrained. As a result, it was decided to stop at a model trained on 10 epochs with dropout.

The training process at 10 epochs, after the 8th epoch the model reaches a plateau

The training process at 10 epochs, after the 8th epoch the model reaches a plateau

Static objects (such as a tabletop, plate) are different in color and shape, the model recognizes perfectly, but if a segment of a red cat falls on a red tabletop, the model starts to get confused, most likely, with a different arrangement of objects, the model would also confuse object classes. Additional training on a better database would solve this problem, but for our everyday research work, the result obtained is enough.

So, the first part of the work is ready. Now we have a model that segments the following classes in the frame: tabletop, table legs, chair, chair legs, cat, plate with meatballs, background.

Next, we will make a small feature to measure the distance between class objects.

Euclidean distance

Euclidean distance

Everything is simple here, we take a point in the middle of the plate object and the extreme point of the cat mask contour and look at the distance between the two points. How to calculate the distance between two points, I think, you don’t need to tell anyone (I’ll tell you, using the calculation of Euclid distance).

def distance_between_p(p1, p2):
    dis = ((p2[0] - p1[0]) ** 2 + (p2[1] - p1[1]) ** 2) ** 0.5
    return dis

We make several “levels” of triggering: a cat on the table, a cat with a muzzle in meatballs (this is a configurable parameter and depends on the specific situation). Here is what our model sees in the technical part:

Segmentation and distance calculation between objects

Segmentation and distance calculation between objects

This is where poor segmentation comes into play. There may be objects in the frame that the model defines as another cat. A piece of a red wall in the corner of the frame (or part of the table), and the distance calculation will already be before this incorrectly segmented object.

Mistakes of bad segmentation

Mistakes of bad segmentation

In our case, you can use one of the developed crutches, for example, this one: the model will respond to a sharp change in the distance between two objects (if the distance between the plate and the cat was 100 and increased sharply by 5 times, then we will consider this to be bad segmentation).

But in fact, it is necessary to return to the first stage of development and retrain the model on better data.

Let’s add my favorite: After each “level” of security, a telegram message will come 🙂

The final result of the model

The final result of the model

In fact, I have everything. As a result, there is a model that detects a cat in the frame and reacts to its approach to specially designated (and segmented) objects. In general, a fairly simple task, difficulties may arise in the quality of the marked-up database and in the selection of the correct model settings. Such a primitive model can solve serious problems, such as a cat eating flowers in a room, seriousness lies not even in preserving flowers, but in saving wool from poisonous plants (Aloe is deadly for cats, but many people have it), putting a web at home camera and marking the apartment, you can trust the program to monitor the fluffy in your absence. You can add many different functions.

The main purpose of this article is to introduce everyone to the basic model of semantic segmentation on the U-Net architecture. You can find the full code on my GitHub page (https://github.com/Mazepov/Cat_Segmentation)

That’s all for me. Ahead are many more interesting projects from my head (and not only) based on neural networks!

PS The cat was eventually fed, don’t worry 🙂

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *