Analysis of the basic solution for the problem of linking aerial images to the terrain from the Digital Breakthrough
This article is special because it contains a fixed baseline that didn’t work in the first place. Now the solution below gives the result for 9th place in the leaderboard!
Spoiler: at the end of the article there are tips to improve the basic solution.
Digital Breakthrough
I think everyone already knows what it is. Digital Breakthrough. However, let me remind you that this year the main topic was artificial intelligence. And this season is in full swing!
Although some of the events have already passed, there are 19 more regional championships, 5 district hackathons and 3 all-Russian championships ahead of the participants. I advise you to join me and other participants so as not to miss the opportunity to win cash prizes and cool trips, as well as gain experience in a variety of tasks.
Introduction
In the modern world, a huge number of tasks are solved with the help of satellite photographs and aerial photographs. Often, the speed and quality of interpretation of this data determines how quickly fires, floods and other emergencies are detected. Machine vision technologies are just beginning to be used in solving such problems, but the need for their use is constantly growing.
The solution of this problem will allow quickly linking images to geographic coordinates, which in the future can speed up geodetic work, help to quickly search for missing people, and control deforestation. And this is just a short list of where you need to link aerial photographs to the terrain.
The participants of the championship will be asked to find the location and orientation of the image on an extremely large image in height and width, georeferenced to the area.
The task
The purpose of the task is to find the location and orientation of the image on the substrate.
To better understand the context of the task, participants should familiarize themselves with the following terms:
Substrate – an extremely large image in height and width with georeferencing to the area, i.e. the coordinates of each pixel are known or can be calculated. As a rule, the image contains a large area of land (square kilometers or more)
aerial view – image from a satellite or an unmanned aerial vehicle, the direction of the camera when photographing looked vertically down. It has a significantly lower resolution compared to the substrate. Essentially a photograph taken with a regular camera. The main feature is that the aerial photograph was taken at a different time from the substrate, a season, or even in a completely different year or at different heights.
overlap – the position of the images, in which the same area of the terrain is visible on two or more aerial photographs. Mutual orientation of multi-temporal images of different resolution implies comparison of images and obtaining their georeferencing due to manual comparison by the operator with the map.
Data
The data are aerial photographs of a fixed size:
train/img — a folder containing 800 photos of the training set;
train/json – data folder in json format with the following values
left top — coordinate of the upper left corner of the photo relative to the background;
right top – coordinate of the upper right corner;
left bottom – coordinate of the lower left corner;
right bottom — coordinate of the lower right corner;
angle – angle of rotation.
test/ – a folder containing 400 photos for prediction;
original.tiff – a substrate with an extension of 10496 x 10496:
What you should pay attention to
The pictures were taken at different time intervals and under different weather conditions. For example, part of the surface may be hidden behind clouds. It is also worth noting that there are not enough photos for training, you can expand the set for training by cutting photos from the substrate yourself.
Metrics
For such a specific task, a custom metric has been developed that determines the difference between the predicted center, the angle of rotation of the photo and their original values.

Solution details
Solution methodology
In the basic solution, it is proposed to solve the problem as a regression. Target values in this case are 4 coordinates and the angle of rotation of the image relative to the original.
The solution scheme will be as follows:
Installing and importing all libraries
Data preprocessing
Creating loaders (DataLoader) to feed data into the model
Helper Functions for Model Training
Building and training the model
Testing the resulting solution
What libraries do we need
Let’s start by importing all the required libraries. As a framework for training a neural network, torch.
# Общие библиотеки
import pandas as pd
import numpy as np
import glob
from tqdm import tqdm
import os
from sklearn.model_selection import train_test_split
import json
from math import sin, cos
# Для создания и обучения модели
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset
from torchvision import datasets, models, transforms
# Для работы с изображениями
import cv2
from PIL import Image
# Для визуализации
import matplotlib.pyplot as plt
from IPython.display import clear_output
Converting the initial dataset
At this stage, the data stored in json files is converted into a pandas dataframe.
json_dir = "/content/json/"
data_df = pd.DataFrame({'id': [], "left_top_x": [], 'left_top_y': [], "right_bottom_x": [], 'right_bottom_y': [], 'angle': []})
json_true = []
for _, _, files in os.walk(json_dir):
for x in files:
if x.endswith(".json"):
data = json.load(open(json_dir + x))
new_row = {'id':x.split(".")[0]+".img", 'left_top_x':data["left_top"][0], 'left_top_y':data["left_top"][1], 'right_bottom_x': data["right_bottom"][0], "right_bottom_y": data["right_bottom"][1], 'angle': data["angle"]}
data_df = data_df.append(new_row, ignore_index=True)
data_df.head(5)

Data loader
For effective training of a neural network, data must be submitted in the form of batches. A batch stores multiple instances of data. For example, in the basic solution, the parameter batch_size is 16, that is, each batch fed into the model contains 16 data instances. Below is one of the implementation options for such a data loader.
First, let’s write a class in which the data is directly loaded and converted into the desired format.
class ImageDataset(Dataset):
def __init__(self, data_df, transform=None):
self.data_df = data_df
self.transform = transform
def __getitem__(self, idx):
# достаем имя изображения и ее лейбл
image_name, labels = self.data_df.iloc[idx]['id'], [self.data_df.iloc[idx]['left_top_x']/10496,
self.data_df.iloc[idx]['left_top_y']/10496,
self.data_df.iloc[idx]['right_bottom_x']/10496,
self.data_df.iloc[idx]['right_bottom_y']/10496,
self.data_df.iloc[idx]['angle']]
# читаем картинку. read the image
image = cv2.imread(f"/content/train/{image_name}")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = Image.fromarray(image)
# преобразуем, если нужно. transform it, if necessary
if self.transform:
image = self.transform(image)
return image, torch.tensor(labels).float()
def __len__(self):
return len(self.data_df)
Next, we set the augmentations that will be used when training the model.
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
valid_transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
Let’s look at the amount of data and divide it into training and validation parts.
from os import listdir
print("Обучающей выборки " ,len(listdir("/content/train")))
print("Тестовой выборки " ,len(listdir("/content/test")))
Обучающей выборки 800
Тестовой выборки 400
# разделим датасет на трейн и валидацию, чтобы смотреть на качество
train_df, valid_df = train_test_split(data_df, test_size=0.2, random_state=43)
Let’s submit each of the samples to the previously created class. Then wrap it in another class that already exists in the library torch-DataLoader.
train_dataset = ImageDataset(train_df, train_transform)
valid_dataset = ImageDataset(valid_df, valid_transform)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=16,
shuffle=True,
pin_memory=True,
num_workers=2)
valid_loader = torch.utils.data.DataLoader(dataset=valid_dataset,
batch_size=16,
# shuffle=True,
pin_memory=True,
num_workers=2)
Secondary functions
To train the model, we need the functions of calculating the metric, plotting the training graph, and training itself.
Below is the metric calculation function. An important note – the metric for the leaderboard is calculated by the corrected metric, what exactly has been corrected can be found in the “Metrics” section.
def compute_metric(data_true, data_pred, outImageW = 10496, outImageH = 10496):
x_center_true = np.array((data_true[0] + data_true[2])/2).astype(int)
y_center_true = np.array((data_true[1] + data_true[3])/2).astype(int)
x_metr = x_center_true - np.array((data_pred[0] + data_pred[2])/2).astype(int)
y_metr = y_center_true - np.array((data_pred[1] + data_pred[3])/2).astype(int)
metr = 1 - 0.7 * (abs(x_metr)/outImageH + abs(y_metr)/outImageW)/2 + 0.3 *abs(data_pred[4] - data_true[4])/359
return metr
The function of visualization of training graphs.
def plot_history(train_history, val_history, title="loss"):
plt.figure()
plt.title('{}'.format(title))
plt.plot(train_history, label="train", zorder=1)
points = np.array(val_history)
steps = list(range(0, len(train_history) + 1, int(len(train_history) / len(val_history))))[1:]
plt.scatter(steps, val_history, marker="+", s=180, c="orange", label="val", zorder=2)
plt.xlabel('train steps')
plt.legend(loc="best")
plt.grid()
plt.show()
The model training process is written by hand without ready-made functions (such as TrainEpoch). This gives a clearer control of the learning process and the ability to customize it.
def train(res_model, criterion, optimizer, train_dataloader, test_dataloader, NUM_EPOCH=15):
train_loss_log = []
val_loss_log = []
train_acc_log = []
val_acc_log = []
for epoch in tqdm(range(NUM_EPOCH)):
model.train()
train_loss = 0.
train_size = 0
train_pred = []
for imgs, labels in train_dataloader:
optimizer.zero_grad()
imgs = imgs.cuda()
labels = labels.cuda()
y_pred = model(imgs)
loss = criterion(y_pred, labels)
loss.backward()
train_loss += loss.item()
train_size += y_pred.size(0)
train_loss_log.append((loss.data.cpu().detach().numpy() / y_pred.size(0)) * 100)
y_pred[:, :4] = y_pred[:, :4] * 10496
y_pred[:, -1] = y_pred[:, -1] * 360
labels[:, :4] = labels[:, :4] * 10496
labels[:, -1] = labels[:, -1] * 360
for label, pr in zip(labels, y_pred):
train_pred.append(compute_metric(label.cpu().detach().numpy(), pr.cpu().detach().numpy()))
optimizer.step()
train_acc_log.append(train_pred)
val_loss = 0.
val_size = 0
val_pred = []
model.eval()
with torch.no_grad():
for imgs, labels in test_dataloader:
imgs = imgs.cuda()
labels = labels.cuda()
pred = model(imgs)
loss = criterion(pred, labels)
pred[:, :4] = pred[:, :4] * 10496
pred[:, -1] = pred[:, -1] * 360
labels[:, :4] = labels[:, :4] * 10496
labels[:, -1] = labels[:, -1] * 360
val_loss += loss.item()
val_size += pred.size(0)
for label, pr in zip(labels, pred):
val_pred.append(compute_metric(label.cpu().detach().numpy(), pr.cpu().detach().numpy()))
val_loss_log.append((val_loss / val_size)*100)
val_acc_log.append(val_pred)
clear_output()
plot_history(train_loss_log, val_loss_log, 'loss')
print('Train loss:', (train_loss / train_size)*100)
print('Val loss:', (val_loss / val_size)*100)
print('Train metric:', (np.mean(train_pred)))
print('Val metric:', (np.mean(val_pred)))
return train_loss_log, train_acc_log, val_loss_log, val_acc_log
Model Training
We use as a model resnet50pretrained on the dataset imagenet with an output layer of size 5 since we want to predict 5 parameters. The loss function will be MSELoss used in regression problems.
torch.cuda.empty_cache()
# Подргружаем модель
model = models.resnet50(pretrained=True)
model.fc = nn.Linear(2048, 5)
model = model.cuda()
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)
Let’s start training and observe the changes in losses.
train_loss_log, train_acc_log, val_loss_log, val_acc_log = train(model,
criterion,
optimizer,
train_loader,
valid_loader,
15)

Model validation
Let’s calculate the metric on the dataset for validation. We get the metric 0.98. However, do not forget that we trained on a small data set and the final metric on the leaderboard may differ.
total_metric = []
for imgs, labels in valid_loader:
imgs = imgs.cuda()
labels = labels.cpu().detach().numpy()
pred = model(imgs)
pred = pred.cpu().detach().numpy()
pred[:, :4] = pred[:, :4] * 10496
pred[:, -1] = pred[:, -1] * 360
labels[:, :4] = labels[:, :4] * 10496
labels[:, -1] = labels[:, -1] * 360
for label, pr in zip(labels, pred):
total_metric.append(compute_metric(label, pr))
total_metric = np.mean(total_metric)
print('Valid metric:', total_metric)
Valid metric: 0.9801261833663446
Let’s create predictions on the test dataset
First you need to write a class for loading test data, similar to what was written for training.
class TestImageDataset(Dataset):
def __init__(self, files, transform=None):
self.files = files
self.transform = transform
def __getitem__(self, idx):
image_name = self.files[idx]
# читаем картинку. read the image
image = cv2.imread(f"/content/test/{image_name}")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = Image.fromarray(image)
# преобразуем, если нужно. transform it, if necessary
if self.transform:
image = self.transform(image)
return image
def __len__(self):
return len(self.files)
Next, we collect the names of all test files and declare a dataloader with a batch size of 16, since there are 400 test images, and 400%16==0.
test_images_dir="/content/test/"
for _, _, test_files in os.walk(test_images_dir):
break
test_dataset = TestImageDataset(test_files, valid_transform)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
batch_size=16,
# shuffle=True,
pin_memory=True,
num_workers=2
)
We collect predictions in a list.
indexes = [x.split('.')[0] for x in test_files]
preds = []
for imgs in test_loader:
imgs = imgs.cuda()
pred = model(imgs)
pred = pred.cpu().detach().numpy()
pred[:, :4] = np.clip(pred[:, :4] * 10496, 0, 10496)
pred[:, -1] = np.clip(pred[:, -1] * 360, 0, 360)
preds.extend(list(pred))
We write down the received predictions in the format necessary for the submission. After that, you can compress all .json files into an archive and upload it to the platform.
sub_dir = "/content/submission/"
if not os.path.exists(sub_dir):
os.makedirs(sub_dir)
json_true = []
for indx, pred in zip(indexes, preds):
pred = [int(x) for x in pred]
left_top = [pred[0], pred[1]]
right_top = [pred[2], pred[1]]
left_bottom = [pred[0], pred[3]]
right_bottom = [pred[2], pred[3]]
res = {
'left_top': left_top,
'right_top': right_top,
'left_bottom': left_bottom,
'right_bottom': right_bottom,
'angle': pred[4]
}
with open(sub_dir+indx+'.json', 'w') as f:
json.dump(res, f)
An example of what each .json file contains.
{
"left_top": [7000, 4000],
"right_top": [6000, 4000],
"left_bottom": [7000, 3000],
"right_bottom": [6000, 3000],
"angle": 178
}
Recommendations for improving the solution
The first option to improve the solution is to increase the number of epochs and retrain the model.
You can also try to change the architecture of the model to a more complex one.
Try to create an ensemble of models and apply the TTA (Test Time Augmentation) method.
Change the size of the input images and explore the possibility of improving the augmentations used.
Expand the dataset from the provided substrate or from third-party resources.
Thinking about other approaches to solving the problem is not regression.
Results
The case is very interesting due to the non-standard formulation of the problem. There is room for experimentation with approaches. And if you manage to find an effective approach, then you can also get a cash prize of up to 250 thousand rubles!
All questions you are interested in, you can ask in channel Hare by HaKhatons.
Good luck to everyone at the championships and hackathons!