Do computers dream of building houses? or How to make neural networks detect renovations in apartments and improve advertisements

What does the process of building a house look like for most people? A pit, sand, cement, some blocks, people and equipment scurrying around, noise, dust for a couple of years and now the house is ready. In fact, things have not been like this for a long time. More precisely, so, but this, as they say, is frontend. But construction has long been a process not physical, but cyberphysical. That’s why it also has a backend. This includes working with data at all stages, from planning to repair assessment, using neural networks to analyze sales advertisements, building economic models, and much more. In general, the creation of a house is an IT project that begins long before the construction of the building and does not end after delivery to residents, because During operation, data collection and processing continues. My name is Alexey, I am a technical lead in the Data Science team in the Computer Vision direction at Samolet, and now I’ll tell you everything.

Surely you’ve heard about Samolet, but more likely as a developer, or builder. In fact, the company consists of many business units, of which only a part is directly responsible for construction. But the rest, including mine, prepares the “soil” for this, analyzes the data in the process or helps in the assessment after construction, and even when selling on the secondary market. Today I’ll tell you briefly what interesting things can happen in construction from an IT point of view.

Content

Not development, but development

For some reason, historically, companies that build houses are called developers, and those who develop a project from scratch, plan, create a house, infrastructure, etc. – developers. Although in our environment, of course, a developer is a developer. But in Samolet, everything worked out in such a way that the company has many developers who help build houses. We have long since embarked on the path of digitalization and are systematically transforming ourselves from a development company into an IT company, developing products that are not very related to square meters. See for yourself: we employ more than 1,500 developers who work daily in 8 business units.

Nowadays, no construction begins without pre-prepared projects, within the framework of which introductory information is studied, necessary materials are calculated, risks are worked out, etc. Yes, the Data driven approach is used at all stages in the company. And although this is not visible, the technological part of the process is almost greater than the physical one. And for this we need a lot of data. Their collection and analysis is carried out by a separate data directorate in the business unit “Strategic Support Center”, which has built a single data warehouse and processes the most valuable pieces of information from all directions – from the acquisition of a construction site to the daily work of the management company for the operation of the house. This is the department I work in.

Data is not only about reporting, but also about predictive analytics.
70% of success is the process of automating decision making.

One of the most interesting things, in my opinion, that we do and what my team does is projects related to Computer Vision. We work closely with recognition and classification systems, and even with generative networks. Here, believe me, there is room to turn around. Moreover, over the past few years, the number of convenient libraries and tools and their accuracy have increased exponentially, which allows us to quickly test hypotheses and implement complex projects that were just fantasies a couple of years ago.

Number and quality of models for working with images over the past 10 years Source: https://paperswithcode.com/sota/image-classification-on-imagenet

I won’t talk about the rather boring classification of images for sales ads, where we determine whether the room in the photo is a kitchen, bathroom or living space. But I’ll tell you about interesting cases where we use computer vision. In our work we use a fairly large stack, which we’ll talk about separately someday. I will only note that at the moment the main ones are Python, Spark, SQL, Hive, Airflow. Well, for computer vision, of course, the Pytorch libraries – 90% and TensorFlow – 10%. However, depending on the task, the tools may change, expand and be supplemented, so the list is relevant only for today.

We recognize documents and control construction

Construction control

Previously, special people monitored the state of construction (no, they still do, of course), but now we can assign auxiliary functions to computer vision algorithms. For example, they allow you to monitor compliance with the technical process: that the masonry is done correctly, that the glazing is installed correctly. Using installed cameras, camera traps and machine learning models, we can monitor more than 30 parameters (of which 16 classes relate to interior decoration).

And what is important, we also monitor safety. When the monolith is erected, but there is no masonry or windows yet, there should be fencing nets on the floor. Based on snapshots from the video stream, the model can determine whether there is a grid or not. It may also determine that construction workers wear helmets on sites and that other safety requirements are observed. And if suddenly there are violations somewhere, the model will promptly notify the team about it so that they can correct everything

Document recognition

Of course, we have a huge amount of project documentation in the works. Unfortunately, not all suppliers, contractors and partners have yet reached electronic document management and some documents exist in paper form. And some of it, although electronic, is not always classified.

Therefore, we have developed an internal system that can recognize the type of document by appearance and semantics, in order to then correctly classify it and transfer it to the required department for processing. Inside, of course, there is also a neural network that works not only with document files, but also with scans, and even photographs. We receive embeddings (vectors) of the document form (you can read more here, what it is) and document text embeddings (OCR methods + text models are used), and based on them we can predict what type of document it is. This is not the easiest task and the attentive reader will notice that in the diagram above, documents with the TTH type are recognized only by 47%. And the problem here is that this type of document is very similar to the UPD, both in form and content. But there are not too many such documents at the moment, so the problem is not critical, however, we are thinking about how to solve it.

Something more complicated: monitoring repairs and improving photos

Repair monitoring

In addition to supervising construction, we can also supervise finishing work. The mechanism here is similar, only, unlike stationary cameras and camera traps, here the cameras are used by technical supervision employees. For them, everything looks the same as always, they go into each apartment and check if there are sockets, if there is laminate flooring, and so on, but we have improved this process. So how does it work:

A technical supervision officer conducts filming using a developed mobile application;
Videos are sent to the server from mobile devices via the bittorrent protocol to ensure resuming in case of unstable connection;
The video is automatically cut into frames, the frames are processed on video cards;
Based on the results of processing, a report is generated based on the number of frames;
The percentage of finished finishing is calculated based on the proportion of frames with the required object to the proportion of frames where the class should be.

In order for this to work, we created an algorithm (read the code), which, using a model (we used convnext_large_in22k (ConvNeXt Large)) determines the state of the finish from a photo or video and displays it on the screen after processing. We also created a convenient system administrator control panel to control work.

Then we received the data and carried out preprocessing (see Jupiter notebook) for all objects, trained the model, classified all objects inside the apartments and got this picture.

It is clear that each apartment has rooms and bathrooms, many of which have corridors. Well, the classification of finishes already depended on the training sample. We took the basic parameters, but you can, of course, expand them if your business needs it. For marking, we shot 5 hours of video at two sites, which turned out to be enough. The video has an FPS of 30 frames per second. 5 hours = 5 × 60 minutes = 300 minutes = 300 × 60 seconds = 18,000 seconds × 30 FPS = 540,000 images from which you can select the required number of examples for the dataset. 10,000 images were selected for the pilot.

After the initial marking and training of the model, we were ready to obtain results on new data. Next, you need to evaluate statistics by class, look at intersections (where two mutually exclusive classes arise), make additional markup and retrain the model on new data, and also possibly adjust classes (add/remove). The first runs of the neural network on new data showed very good results.

Architecture	Number of parameters	FPS	F1 score
convnext_tiny	28 million	69	0.762
vit_giant_patch14_224_clip_laion2b	18 million	18	0.830

As you can see, the F1 metric (in this case we can assume that this is the general value for recall and accuracy) on two suitable models shows a value close to 0.8. This means that in 80% of cases the model correctly determines what is in the image. It is impossible to achieve 100%, and losses of 20% are considered a good result. Moreover, this tool only creates warnings, and then a person still checks.

As a result, we receive a summary table of data about the object; the information is stored in a corporate storage in Hadoop for transfer to other Aircraft systems for further work by other teams. First of all, the data is needed by PTO engineers and deputy construction directors to monitor dynamics, as well as by company management to identify lagging objects, to monitor the process of interior finishing of the building, and to apply management decisions if they lag behind the plan. An example of a report is shown below, and the full report in PDF form can be viewed link.

This process allowed us to reduce inspection time, speed up work, reduce fines and time spent working with contractors, and reduce incident costs. Based on the calculations carried out, a saving effect of 10-20 million rubles per object was obtained. Take a look at what a process that used to take dozens of man-hours looks like now.

Improving photos in ads

The Samolet Plus service approached our division with the task of creating a solution that could automatically correct images, for example, when users create advertisements for the sale of apartments. You need to correct the images so that there are no duplicates, blurry photos, etc. In this work we solve several problems at once. In them, we used solutions “out of the box” if they provided the necessary and sufficient quality or carried out research and took the most suitable method:

Removing duplicates (we took photo vectors using the clip-ViT-B-32 transformer architecture and estimated the cosine measure between them)
Rotate a photo to the correct orientation (using the EfficientNet_V2_S classifier)
Removing watermarks (we tried different solutions, but haven’t found the ideal yet)
Increasing photo resolution (using the EDSR model in Python, more about the solution you can read it here)
General improvement of the photo: brightness, contrast, sharpness, etc. (we used a base of DPED – Deep Convolutional Networksbut we modified it and made the interface for convenience)

We noticed that when publishing advertisements for the sale or rental of an apartment, users often upload low-quality photos or a large number of identical or almost identical images. To avoid this problem, we trained our models to detect duplicates, as well as similar photos taken even from slightly different angles.

Thus, by defining a cosine measure (the “similarity” of two vectors that the neural network receives from images), we can remove absolute (full) or visual duplicates, as well as rotated and flipped duplicates of existing images. The most informative photos remain from the duplicates – those with a better angle, more details, higher quality/resolution and other factors. As a result, we get an adequate set of photos for each apartment, which will not contain unnecessary information for the buyer or tenant.

from sentence_transformers import SentenceTransformer, util
from PIL import Image
import os
from pathlib import Path
import pandas as pd
from tqdm import tqdm
import shutil
import numpy as np


def duplicates(
    img_arrays,
    similarity_threshold: float = 0.9,
    batch_size: int = 64,
    device: str="cuda:0"
) -> list:
    
    if len(img_arrays) == 0:
        return None
    elif len(img_arrays) == 1:
        return [0]
    
    model = SentenceTransformer('clip-ViT-B-32')
    img_embs = model.encode(
        [Image.fromarray(img) for img in img_arrays], 
        batch_size=batch_size, 
        convert_to_tensor=True,
        device=device 
        # show_progress_bar=True
    )
    
    img_data = {i: emb for i, emb in zip(range(len(img_arrays)), img_embs)}
    img_unique = [list(img_data.keys())[0]]

    for img_to_compare in list(img_data.keys())[1:]:
        threshold_overcomed = False
        for img_compared in img_unique:
            metric = util.cos_sim(img_data[img_to_compare], img_data[img_compared])
            if metric >= similarity_threshold:
                threshold_overcomed = True
                break
        if not threshold_overcomed:
            img_unique.append(img_to_compare)
    
    # img_not_duplicated = [img_arrays[i] for i in img_unique]
    is_duplicate = np.ones(len(img_arrays), dtype=int)
    is_duplicate[img_unique] = 0
    return is_duplicate.tolist(

After which we can, using models (we used DPED projectwhich, using a deep learning approach, converts smartphone photos into DSLR-quality images) correct poor lighting in photos, increase contrast and even sharpen images.

At the end of the photo processing process, use the EDSR (Enhanced Deep Super-Resolution) solution to enlarge the original image, because Users do not always take photos of suitable quality. As you can see, compared to x4 bicubic interpolation (naive approach), implementation through a neural network gives a better result that can be used. Moreover, the implementation is as simple as possible, just look at the code, which, in fact, does everything we need using only the cv2 library.

import cv2
from cv2 import dnn_superres

img_orig = cv2.imread('path/to/img.jpg')

sr = dnn_superres.DnnSuperResImpl_create()
# https://github.com/Saafke/EDSR_Tensorflow/tree/master/models/
sr.readModel(f'path/to/EDSR_x{scale_coef}.pb')
sr.setModel("edsr", scale_coef)

sr.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
sr.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

img_upscaled_edsr = sr.upsample(img_orig)

One of the tasks we have to solve is removing watermarks from photos. No, we under no circumstances want to use illegally obtained photos, but situations vary. For example, the user’s smartphone puts an inscription with the name of the gadget on the photo, or a photo processing application or cloud storage service superimposes its logo. This only interferes with the training models. Solutions without neural networks (cv2 tools) do not work or work poorly – the main focus is to “find” the watermark by the difference in contrast, color gamut and other speculations on the difference in pixel color. The result is a highly distorted image, see example below.

Option 1 (contrast distortion) — Option 1
(contrast distortion)

Option 2 (geometry distortion) — Option 2
(geometry distortion)

Open machine learning solutions often require a watermark mask (its shape and what it looks like), which is not always possible to obtain. I would like to delete an arbitrary watermark without first having information about it. The other part of the solutions often only finds the area around the watermark, but does not remove it in any way. Additional models/resources are needed to process it further. The rest of the solutions simply don’t work or work poorly.

Currently, we have tested various options with Unet, Watermark-Removal-PyTorch, deep-image-prior and other solutions, but have not found a suitable ready-made or semi-ready solution. We tried to make a solution on our own based on U-Net and GAn architectures, but it turned out similar. It was possible to devote additional time to finalizing the solutions, and most likely there would have been success, but we turned to other sources and received high-quality photos without watermarks for further development. Thus, the need for a solution disappeared. And this also happens when the problem is solved not “head-on”, but by other means. Fortunately, we are ready to experiment and search for solutions in various sources.

What’s next? Generative models!

I won’t get too ahead of myself, but I’ll briefly say about one of the key topics we are working on. Yes, like the rest of the world, we have become fascinated by generative models. We already know how to generate a design based on an apartment project, change it, even transfer the style, etc., but of course, there are certain difficulties.

For example, there are nuances in model training; in the case of GAN, there is a “mode collapse” problem, when the model begins to generate a very limited set of images and cannot reproduce the entire variety of training sample data. There are still quality issues; even with proper setup and training, the generated images may be less realistic than we would like. Sometimes they may contain unnatural artifacts or distortions. Plus, we cannot control the generation process, or rather, we can, but to a limited extent, and also for now it is quite difficult and time-consuming.

However, we are actively working in the direction of GAN, because… these models can help us create the content we need much cheaper and easier than humans do. We may also use such content to enhance existing data sets or improve data quality. And the use of generative models to create interactive systems that respond to user input and can generate relevant content in real time is still a dream. In the examples below we, for example, use Stable Diffusion + ControlNet. As you can see, ready-made solutions did not help here and we managed to get a very good result with the help of modifications to the base libraries. It’s as if our “dream” has already become a reality, but there is still room to grow.

We are also using NeRF models to generate full-fledged 3D. As you understand, for construction and apartment design this is just a killer feature. Neural Raiance Fields (NeRF) is a fully connected neural network that can generate new perspectives on complex 3D scenes based on a partial set of 2D images. You could say it’s photogrammetry backed by artificial intelligence. However, of course, NeRF is a computationally intensive algorithm and processing complex scenes can take hours or days. However, we are currently testing different options and looking to optimize performance.