Analysis of the basic solution for the problem of linking aerial images to the terrain from the Digital Breakthrough

Hare on HaKhatons) talk about tasks All-Russian Championship of Digital Breakthrough, I explain the baseline and give tips that will help you climb higher in the rankings. This article will consider a case from the Moscow Institute of Physics and Technology on linking aerial photographs to the terrain.

This article is special because it contains a fixed baseline that didn’t work in the first place. Now the solution below gives the result for 9th place in the leaderboard!

Spoiler: at the end of the article there are tips to improve the basic solution.

Digital Breakthrough

I think everyone already knows what it is. Digital Breakthrough. However, let me remind you that this year the main topic was artificial intelligence. And this season is in full swing!

Although some of the events have already passed, there are 19 more regional championships, 5 district hackathons and 3 all-Russian championships ahead of the participants. I advise you to join me and other participants so as not to miss the opportunity to win cash prizes and cool trips, as well as gain experience in a variety of tasks.


In the modern world, a huge number of tasks are solved with the help of satellite photographs and aerial photographs. Often, the speed and quality of interpretation of this data determines how quickly fires, floods and other emergencies are detected. Machine vision technologies are just beginning to be used in solving such problems, but the need for their use is constantly growing.

The solution of this problem will allow quickly linking images to geographic coordinates, which in the future can speed up geodetic work, help to quickly search for missing people, and control deforestation. And this is just a short list of where you need to link aerial photographs to the terrain.

The participants of the championship will be asked to find the location and orientation of the image on an extremely large image in height and width, georeferenced to the area.

The task

The purpose of the task is to find the location and orientation of the image on the substrate.

To better understand the context of the task, participants should familiarize themselves with the following terms:

  • Substrate – an extremely large image in height and width with georeferencing to the area, i.e. the coordinates of each pixel are known or can be calculated. As a rule, the image contains a large area of ​​​​land (square kilometers or more)

  • aerial view – image from a satellite or an unmanned aerial vehicle, the direction of the camera when photographing looked vertically down. It has a significantly lower resolution compared to the substrate. Essentially a photograph taken with a regular camera. The main feature is that the aerial photograph was taken at a different time from the substrate, a season, or even in a completely different year or at different heights.

  • overlap – the position of the images, in which the same area of ​​the terrain is visible on two or more aerial photographs. Mutual orientation of multi-temporal images of different resolution implies comparison of images and obtaining their georeferencing due to manual comparison by the operator with the map.


The data are aerial photographs of a fixed size:

  • train/img — a folder containing 800 photos of the training set;

  • train/json – data folder in json format with the following values

    • left top — coordinate of the upper left corner of the photo relative to the background;

    • right top – coordinate of the upper right corner;

    • left bottom – coordinate of the lower left corner;

    • right bottom — coordinate of the lower right corner;

    • angle – angle of rotation.

  • test/ – a folder containing 400 photos for prediction;

  • original.tiff – a substrate with an extension of 10496 x 10496:

What you should pay attention to

The pictures were taken at different time intervals and under different weather conditions. For example, part of the surface may be hidden behind clouds. It is also worth noting that there are not enough photos for training, you can expand the set for training by cutting photos from the substrate yourself.


For such a specific task, a custom metric has been developed that determines the difference between the predicted center, the angle of rotation of the photo and their original values.

Metric calculation formula (later corrected as 10 degrees or 350 degrees is a 10 degree error, not 350)
Metric calculation formula (later corrected as 10 degrees or 350 degrees is a 10 degree error, not 350)

Solution details

Solution methodology

In the basic solution, it is proposed to solve the problem as a regression. Target values ​​in this case are 4 coordinates and the angle of rotation of the image relative to the original.

The solution scheme will be as follows:

  1. Installing and importing all libraries

  2. Data preprocessing

  3. Creating loaders (DataLoader) to feed data into the model

  4. Helper Functions for Model Training

  5. Building and training the model

  6. Testing the resulting solution

What libraries do we need

Let’s start by importing all the required libraries. As a framework for training a neural network, torch.

# Общие библиотеки
import pandas as pd
import numpy as np
import glob
from tqdm import tqdm
import os
from sklearn.model_selection import train_test_split
import json
from math import sin, cos

# Для создания и обучения модели
import torch
import torch.nn as nn
import torch.optim as optim
from import Dataset
from torchvision import datasets, models, transforms

# Для работы с изображениями
import cv2
from PIL import Image

# Для визуализации
import matplotlib.pyplot as plt
from IPython.display import clear_output

Converting the initial dataset

At this stage, the data stored in json files is converted into a pandas dataframe.

json_dir = "/content/json/"

data_df = pd.DataFrame({'id': [], "left_top_x": [], 'left_top_y': [], "right_bottom_x": [], 'right_bottom_y': [], 'angle': []})

json_true = []
for _, _, files in os.walk(json_dir):
    for x in files:
        if x.endswith(".json"):
            data = json.load(open(json_dir + x))
            new_row = {'id':x.split(".")[0]+".img", 'left_top_x':data["left_top"][0], 'left_top_y':data["left_top"][1], 'right_bottom_x': data["right_bottom"][0], "right_bottom_y": data["right_bottom"][1], 'angle': data["angle"]}
            data_df = data_df.append(new_row, ignore_index=True)

Transformed dataset
Transformed dataset

Data loader

For effective training of a neural network, data must be submitted in the form of batches. A batch stores multiple instances of data. For example, in the basic solution, the parameter batch_size is 16, that is, each batch fed into the model contains 16 data instances. Below is one of the implementation options for such a data loader.

First, let’s write a class in which the data is directly loaded and converted into the desired format.

class ImageDataset(Dataset):
    def __init__(self, data_df, transform=None):

        self.data_df = data_df
        self.transform = transform

    def __getitem__(self, idx):
        # достаем имя изображения и ее лейбл
        image_name, labels = self.data_df.iloc[idx]['id'], [self.data_df.iloc[idx]['left_top_x']/10496, 

        # читаем картинку. read the image
        image = cv2.imread(f"/content/train/{image_name}")
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        image = Image.fromarray(image)
        # преобразуем, если нужно. transform it, if necessary
        if self.transform:
            image = self.transform(image)
        return image, torch.tensor(labels).float()
    def __len__(self):
        return len(self.data_df)

Next, we set the augmentations that will be used when training the model.

train_transform = transforms.Compose([
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                          std=[0.229, 0.224, 0.225]),

valid_transform = transforms.Compose([
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                          std=[0.229, 0.224, 0.225]),

Let’s look at the amount of data and divide it into training and validation parts.

from os import listdir

print("Обучающей выборки " ,len(listdir("/content/train")))
print("Тестовой выборки " ,len(listdir("/content/test")))

Обучающей выборки 800
Тестовой выборки 400
# разделим датасет на трейн и валидацию, чтобы смотреть на качество
train_df, valid_df = train_test_split(data_df, test_size=0.2, random_state=43)

Let’s submit each of the samples to the previously created class. Then wrap it in another class that already exists in the library torch-DataLoader.

train_dataset = ImageDataset(train_df, train_transform)
valid_dataset = ImageDataset(valid_df, valid_transform)
train_loader =,

valid_loader =,
                                           # shuffle=True,

Secondary functions

To train the model, we need the functions of calculating the metric, plotting the training graph, and training itself.

Below is the metric calculation function. An important note – the metric for the leaderboard is calculated by the corrected metric, what exactly has been corrected can be found in the “Metrics” section.

def compute_metric(data_true, data_pred, outImageW = 10496, outImageH = 10496):

    x_center_true = np.array((data_true[0] + data_true[2])/2).astype(int)
    y_center_true = np.array((data_true[1] + data_true[3])/2).astype(int)

    x_metr = x_center_true - np.array((data_pred[0] + data_pred[2])/2).astype(int)
    y_metr = y_center_true - np.array((data_pred[1] + data_pred[3])/2).astype(int)

    metr =  1 - 0.7 * (abs(x_metr)/outImageH + abs(y_metr)/outImageW)/2 + 0.3 *abs(data_pred[4] - data_true[4])/359
    return metr

The function of visualization of training graphs.

def plot_history(train_history, val_history, title="loss"):
    plt.plot(train_history, label="train", zorder=1)
    points = np.array(val_history)
    steps = list(range(0, len(train_history) + 1, int(len(train_history) / len(val_history))))[1:]
    plt.scatter(steps, val_history, marker="+", s=180, c="orange", label="val", zorder=2)
    plt.xlabel('train steps')

The model training process is written by hand without ready-made functions (such as TrainEpoch). This gives a clearer control of the learning process and the ability to customize it.

def train(res_model, criterion, optimizer, train_dataloader, test_dataloader, NUM_EPOCH=15):
    train_loss_log = []
    val_loss_log = []
    train_acc_log = []
    val_acc_log = []
    for epoch in tqdm(range(NUM_EPOCH)):
        train_loss = 0.
        train_size = 0
        train_pred = []

        for imgs, labels in train_dataloader:

            imgs = imgs.cuda()
            labels = labels.cuda()

            y_pred = model(imgs)

            loss = criterion(y_pred, labels)
            train_loss += loss.item()
            train_size += y_pred.size(0)
            train_loss_log.append(( / y_pred.size(0)) * 100)
            y_pred[:, :4] = y_pred[:, :4] * 10496
            y_pred[:, -1] = y_pred[:, -1] * 360

            labels[:, :4] = labels[:, :4] * 10496
            labels[:, -1] = labels[:, -1] * 360

            for label, pr in zip(labels, y_pred):
                    train_pred.append(compute_metric(label.cpu().detach().numpy(), pr.cpu().detach().numpy()))



        val_loss = 0.
        val_size = 0
        val_pred = []
        with torch.no_grad():
            for imgs, labels in test_dataloader:
                imgs = imgs.cuda()
                labels = labels.cuda()
                pred = model(imgs)
                loss = criterion(pred, labels)

                pred[:, :4] = pred[:, :4] * 10496
                pred[:, -1] = pred[:, -1] * 360

                labels[:, :4] = labels[:, :4] * 10496
                labels[:, -1] = labels[:, -1] * 360
                val_loss += loss.item()
                val_size += pred.size(0)

                for label, pr in zip(labels, pred):
                    val_pred.append(compute_metric(label.cpu().detach().numpy(), pr.cpu().detach().numpy()))

        val_loss_log.append((val_loss / val_size)*100)

        plot_history(train_loss_log, val_loss_log, 'loss')

        print('Train loss:', (train_loss / train_size)*100)
        print('Val loss:', (val_loss / val_size)*100)
        print('Train metric:', (np.mean(train_pred)))
        print('Val metric:', (np.mean(val_pred)))
    return train_loss_log, train_acc_log, val_loss_log, val_acc_log

Model Training

We use as a model resnet50pretrained on the dataset imagenet with an output layer of size 5 since we want to predict 5 parameters. The loss function will be MSELoss used in regression problems.


# Подргружаем модель

model = models.resnet50(pretrained=True)
model.fc = nn.Linear(2048, 5)

model = model.cuda()
criterion = torch.nn.MSELoss()

optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)

Let’s start training and observe the changes in losses.

train_loss_log, train_acc_log, val_loss_log, val_acc_log = train(model, 
Model training logs
Model training logs

Model validation

Let’s calculate the metric on the dataset for validation. We get the metric 0.98. However, do not forget that we trained on a small data set and the final metric on the leaderboard may differ.

total_metric = []

for imgs, labels in valid_loader:
    imgs = imgs.cuda()
    labels = labels.cpu().detach().numpy()            
    pred = model(imgs)
    pred = pred.cpu().detach().numpy()    

    pred[:, :4] = pred[:, :4] * 10496
    pred[:, -1] = pred[:, -1] * 360

    labels[:, :4] = labels[:, :4] * 10496
    labels[:, -1] = labels[:, -1] * 360
    for label, pr in zip(labels, pred):
        total_metric.append(compute_metric(label, pr))
total_metric = np.mean(total_metric)
print('Valid metric:', total_metric)

Valid metric: 0.9801261833663446

Let’s create predictions on the test dataset

First you need to write a class for loading test data, similar to what was written for training.

class TestImageDataset(Dataset):
    def __init__(self, files, transform=None):

        self.files = files
        self.transform = transform

    def __getitem__(self, idx):

        image_name = self.files[idx]

        # читаем картинку. read the image
        image = cv2.imread(f"/content/test/{image_name}")
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        image = Image.fromarray(image)
        # преобразуем, если нужно. transform it, if necessary
        if self.transform:
            image = self.transform(image)
        return image
    def __len__(self):
        return len(self.files)

Next, we collect the names of all test files and declare a dataloader with a batch size of 16, since there are 400 test images, and 400%16==0.


for _, _, test_files in os.walk(test_images_dir):

test_dataset = TestImageDataset(test_files, valid_transform)

test_loader =,
                                           # shuffle=True,

We collect predictions in a list.

indexes = [x.split('.')[0] for x in test_files]
preds = []

for imgs in test_loader:
    imgs = imgs.cuda()            
    pred = model(imgs)
    pred = pred.cpu().detach().numpy()
    pred[:, :4] = np.clip(pred[:, :4] * 10496, 0, 10496)
    pred[:, -1] = np.clip(pred[:, -1] * 360, 0, 360)

We write down the received predictions in the format necessary for the submission. After that, you can compress all .json files into an archive and upload it to the platform.

sub_dir = "/content/submission/"
if not os.path.exists(sub_dir):

json_true = []

for indx, pred in zip(indexes, preds):

    pred = [int(x) for x in pred]

    left_top = [pred[0], pred[1]]
    right_top = [pred[2], pred[1]]
    left_bottom = [pred[0], pred[3]]
    right_bottom = [pred[2], pred[3]]
    res = {
        'left_top': left_top,
        'right_top': right_top,
        'left_bottom': left_bottom,
        'right_bottom': right_bottom,
        'angle': pred[4]

    with open(sub_dir+indx+'.json', 'w') as f:
        json.dump(res, f)

An example of what each .json file contains.

  "left_top": [7000, 4000], 
  "right_top": [6000, 4000], 
  "left_bottom": [7000, 3000], 
  "right_bottom": [6000, 3000], 
  "angle": 178

Recommendations for improving the solution

  • The first option to improve the solution is to increase the number of epochs and retrain the model.

  • You can also try to change the architecture of the model to a more complex one.

  • Try to create an ensemble of models and apply the TTA (Test Time Augmentation) method.

  • Change the size of the input images and explore the possibility of improving the augmentations used.

  • Expand the dataset from the provided substrate or from third-party resources.

  • Thinking about other approaches to solving the problem is not regression.


The case is very interesting due to the non-standard formulation of the problem. There is room for experimentation with approaches. And if you manage to find an effective approach, then you can also get a cash prize of up to 250 thousand rubles!

All questions you are interested in, you can ask in channel Hare by HaKhatons.

Good luck to everyone at the championships and hackathons!

Similar Posts

Leave a Reply