Working with YOLOV8. Detection, segmentation, tracking of objects, as well as preparing your own dataset and training

If you think that getting started with neural networks is difficult, then this material is for you!

So, YOLO (You Only Look Once) is a neural network designed to work with objects in images and can solve the following problems:

  • Detection – object detection

  • Segmentation – dividing an image into areas that relate to each object

  • Classification – determining what is in the image

  • Search for key points of the body – to determine a person’s posture

  • Object tracking – stream processing, in which for each object it is possible to save and use location history

This article will also cover:

  • Tracking-based motion prediction

  • Creating your own dataset for additional training of a model for detecting new objects

Preparing to work with YOLO

A distinctive feature of YOLO is the approach in which you can start using a neural network with minimal Python programming skills.
To install YOLO on your computer, run to the console:

pip install ultralytics

After this, all the necessary modules will be installed and you can proceed to work.

Detection of objects in photos

Object detection – determining the location of objects and their classes in the image.

For example, let's take a picture and define the objects in it:

Many different objects

Many different objects

To use the YOLO neural network, let's write a script:

from ultralytics import YOLO
import cv2
import numpy as np
import os

# Загрузка модели YOLOv8
model = YOLO('yolov8n.pt')

# Список цветов для различных классов
colors = [
    (255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 0), (0, 255, 255),
    (255, 0, 255), (192, 192, 192), (128, 128, 128), (128, 0, 0), (128, 128, 0),
    (0, 128, 0), (128, 0, 128), (0, 128, 128), (0, 0, 128), (72, 61, 139),
    (47, 79, 79), (47, 79, 47), (0, 206, 209), (148, 0, 211), (255, 20, 147)
]

# Функция для обработки изображения
def process_image(image_path):
    # Загрузка изображения
    image = cv2.imread(image_path)
    results = model(image)[0]
    
    # Получение оригинального изображения и результатов
    image = results.orig_img
    classes_names = results.names
    classes = results.boxes.cls.cpu().numpy()
    boxes = results.boxes.xyxy.cpu().numpy().astype(np.int32)

    # Подготовка словаря для группировки результатов по классам
    grouped_objects = {}

    # Рисование рамок и группировка результатов
    for class_id, box in zip(classes, boxes):
        class_name = classes_names[int(class_id)]
        color = colors[int(class_id) % len(colors)]  # Выбор цвета для класса
        if class_name not in grouped_objects:
            grouped_objects[class_name] = []
        grouped_objects[class_name].append(box)

        # Рисование рамок на изображении
        x1, y1, x2, y2 = box
        cv2.rectangle(image, (x1, y1), (x2, y2), color, 2)
        cv2.putText(image, class_name, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

    # Сохранение измененного изображения
    new_image_path = os.path.splitext(image_path)[0] + '_yolo' + os.path.splitext(image_path)[1]
    cv2.imwrite(new_image_path, image)

    # Сохранение данных в текстовый файл
    text_file_path = os.path.splitext(image_path)[0] + '_data.txt'
    with open(text_file_path, 'w') as f:
        for class_name, details in grouped_objects.items():
            f.write(f"{class_name}:\n")
            for detail in details:
                f.write(f"Coordinates: ({detail[0]}, {detail[1]}, {detail[2]}, {detail[3]})\n")

    print(f"Processed {image_path}:")
    print(f"Saved bounding-box image to {new_image_path}")
    print(f"Saved data to {text_file_path}")


process_image('test.png')

At the beginning of the code, select the model (it will be downloaded automatically the first time you run the script). In addition to the minimum yolov8n.pt, Several more are available:
yolov8n.pt yolov8s.pt yolov8m.pt yolov8l.pt yolov8x.pt

Each next one is larger, works slower, but also identifies some objects much better.

Comparative characteristics of the models, as well as the possibility of running them on various devices, are described in the article by Stepan Zhdanov:
https://habr.com/ru/articles/822917/

As a result of the work of the neural network, we obtain an object boxeswhich contains information about the coordinates found in the image of objects, as well as their class affiliation (person, car, bus, traffic light, etc.)

So, after executing this script we see the result:

In addition, for additional processing, data on objects in the image is saved into a text file of the following type:

car:
Coordinates: (842, 681, 1180, 894)
Coordinates: (254, 849, 524, 971)
Coordinates: (49, 620, 425, 857)
stop sign:
Coordinates: (407, 560, 470, 626)
Coordinates: (267, 494, 341, 557)
traffic light:
Coordinates: (334, 157, 451, 426)
Coordinates: (938, 97, 1031, 312)
Coordinates: (86, 481, 130, 602)
person:
Coordinates: (578, 711, 710, 990)
Coordinates: (715, 723, 750, 798)
Coordinates: (715, 852, 864, 976)
Coordinates: (241, 897, 385, 1012)
truck:
Coordinates: (52, 620, 425, 859)

Detection of objects in a video stream

After practicing with one image, let's move on to video processing. In general, it is not much more complicated, because… A video is just a sequence of images! The only distinctive feature of the following code is that special codecs for the MP4 format are used to record video.

from ultralytics import YOLO
import cv2
import numpy as np

# Загрузка модели YOLOv8
model = YOLO('yolov8n.pt')

# Список цветов для различных классов
colors = [
    (255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 0), (0, 255, 255),
    (255, 0, 255), (192, 192, 192), (128, 128, 128), (128, 0, 0), (128, 128, 0),
    (0, 128, 0), (128, 0, 128), (0, 128, 128), (0, 0, 128), (72, 61, 139),
    (47, 79, 79), (47, 79, 47), (0, 206, 209), (148, 0, 211), (255, 20, 147)
]

# Открытие исходного видеофайла
input_video_path="input.mp4"
capture = cv2.VideoCapture(input_video_path)

# Чтение параметров видео
fps = int(capture.get(cv2.CAP_PROP_FPS))
width = int(capture.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(capture.get(cv2.CAP_PROP_FRAME_HEIGHT))

# Настройка выходного файла
output_video_path="detect.mp4"
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
writer = cv2.VideoWriter(output_video_path, fourcc, fps, (width, height))

while True:
    # Захват кадра
    ret, frame = capture.read()
    if not ret:
        break

    # Обработка кадра с помощью модели YOLO
    results = model(frame)[0]

    # Получение данных об объектах
    classes_names = results.names
    classes = results.boxes.cls.cpu().numpy()
    boxes = results.boxes.xyxy.cpu().numpy().astype(np.int32)
    # Рисование рамок и подписей на кадре
    for class_id, box, conf in zip(classes, boxes, results.boxes.conf):
        if conf>0.5:
            class_name = classes_names[int(class_id)]
            color = colors[int(class_id) % len(colors)]
            x1, y1, x2, y2 = box
            cv2.rectangle(frame, (x1, y1), (x2, y2), color, 2)
            cv2.putText(frame, class_name, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

    # Запись обработанного кадра в выходной файл
    writer.write(frame)

# Освобождение ресурсов и закрытие окон
capture.release()
writer.release()

To detect objects using a web camera – just one line
capture = cv2.VideoCapture(input_video_path)
specify the camera index for connection
capture = cv2.VideoCapture(0)
Full code example:
https://github.com/stepanburmistrov/YoloV8/blob/main/Detection_Camera.py

Image segmentation

Segmentation is the division of an image into classes. One of the most effective ways to deal with this process is to remove the background around the person from the photo/video.

To begin with, let's take a photo (a frame from the previous video) and apply the YoloV8 model designed for segmentation. As in the case of detection, there is a choice:
yolov8n-seg.pt yolov8s-seg.pt yolov8m-seg.pt yolov8l-seg.pt yolov8x-seg.pt

import cv2
import numpy as np
from ultralytics import YOLO
import os

# Загрузка модели YOLOv8
model = YOLO('yolov8x-seg.pt')

colors = [
    (255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 0), (0, 255, 255),
    (255, 0, 255), (192, 192, 192), (128, 128, 128), (128, 0, 0), (128, 128, 0),
    (0, 128, 0), (128, 0, 128), (0, 128, 128), (0, 0, 128), (72, 61, 139),
    (47, 79, 79), (47, 79, 47), (0, 206, 209), (148, 0, 211), (255, 20, 147)
]

def process_image(image_path):
    # Проверка наличия папки для сохранения результатов
    if not os.path.exists('results'):
        os.makedirs('results')
    
    # Загрузка изображения
    image = cv2.imread(image_path)
    image_orig = image.copy()
    h_or, w_or = image.shape[:2]
    image = cv2.resize(image, (640, 640))
    results = model(image)[0]
    
    classes_names = results.names
    classes = results.boxes.cls.cpu().numpy()
    masks = results.masks.data.cpu().numpy()

    # Наложение масок на изображение
    for i, mask in enumerate(masks):
        color = colors[int(classes[i]) % len(colors)]
        
        # Изменение размера маски перед созданием цветной маски
        mask_resized = cv2.resize(mask, (w_or, h_or))
        
        # Создание цветной маски
        color_mask = np.zeros((h_or, w_or, 3), dtype=np.uint8)
        color_mask[mask_resized > 0] = color

        # Сохранение маски каждого класса в отдельный файл
        mask_filename = os.path.join('results', f"{classes_names[classes[i]]}_{i}.png")
        cv2.imwrite(mask_filename, color_mask)

        # Наложение маски на исходное изображение
        image_orig = cv2.addWeighted(image_orig, 1.0, color_mask, 0.5, 0)


    # Сохранение измененного изображения
    new_image_path = os.path.join('results', os.path.splitext(os.path.basename(image_path))[0] + '_segmented' + os.path.splitext(image_path)[1])
    cv2.imwrite(new_image_path, image_orig)
    print(f"Segmented image saved to {new_image_path}")

process_image('segmentation_test.png')

As a result, we get this image:

Image segmentation using YoloV8

Image segmentation using YoloV8

Masks of each class found will also be saved in the results folder for convenient research.

Now let's work on removing the background – for example, we will replace the background with green. In the future, this will help in other photo or video editing programs to effectively blur the edges and get a very decent result!

Example code that solves this problem:

import cv2
import numpy as np
from ultralytics import YOLO
import os

model = YOLO('yolov8n-seg.pt')

# Цвет для выделения объектов класса "person"
person_color = (0, 255, 0)  # Зеленый цвет

def process_image(image_path):
    frame = cv2.imread(image_path)
    if frame is None:
        print("Ошибка: не удалось загрузить изображение")
        return

    image_orig = frame.copy()
    h_or, w_or = frame.shape[:2]
    image = cv2.resize(frame, (640, 640))
    results = model(image)[0]

    classes = results.boxes.cls.cpu().numpy()
    masks = results.masks.data.cpu().numpy()

    # Создаем зеленый фон
    green_background = np.zeros_like(image_orig)
    green_background[:] = person_color

    # Наложение масок на изображение
    for i, mask in enumerate(masks):
        class_name = results.names[int(classes[i])]
        if class_name == 'person':
            color_mask = np.zeros((640, 640, 3), dtype=np.uint8)
            resized_mask = cv2.resize(mask, (640, 640), interpolation=cv2.INTER_NEAREST)
            color_mask[resized_mask > 0] = person_color

            color_mask = cv2.resize(color_mask, (w_or, h_or), interpolation=cv2.INTER_NEAREST)

            mask_resized = cv2.resize(mask, (w_or, h_or), interpolation=cv2.INTER_NEAREST)
            green_background[mask_resized > 0] = image_orig[mask_resized > 0]

    # Сохраняем обработанное изображение с добавлением суффикса '_segmented'
    base_name, ext = os.path.splitext(image_path)
    output_path = f"{base_name}_removed_BG{ext}"
    cv2.imwrite(output_path, green_background)
    print(f"Processed image saved to {output_path}")

    cv2.imshow('Processed Image', green_background)  # Показываем обработанное изображение
    cv2.waitKey(0)
    cv2.destroyAllWindows()

# Путь к изображению, которое необходимо обработать
image_path="test.jpg"
process_image(image_path)
Removing the background around a person using the YOLOV8 neural network

Removing the background around a person using the YOLOV8 neural network

There is an important addition to the code that for high-quality and most accurate markup of the image, before “feeding” the neural network, it must be reduced to a size of 640 * 640 px.
In other cases, object masks can be obtained with a slight offset.

As another example to use – automatic creation of stickers:
https://github.com/stepanburmistrov/YoloV8/blob/main/Segmentation_Image2Sticker.py

Creating a sticker using YoloV8

Creating a sticker using YoloV8

Classification

An image processing process in which the entire image will belong to a specific class.

Detection and classification solve different problems and have their own characteristics and areas of application. Here are the main reasons why you should not always use detection instead of classification:

  1. Difficulty of the task:

    • Classification: Determines which class the entire object or image belongs to. For example, classifying a photograph as “dog” or “cat”.

    • Detection: Finds objects within an image and determines their classes and locations (for example, where the dog is in the photo and where the cat is).

  2. Resources and computing power:

    • Classification: Typically requires less computational resources because the entire object or image is analyzed without having to identify its parts.

    • Detection: More computationally expensive as it requires image analysis to find objects and determine their boundaries.

  3. Execution speed:

    • Classification: Faster because it performs one class definition operation for the entire image.

    • Detection: Slower because it requires repeated image analysis to find all the objects and classify them.

  4. Complexity of implementation:

    • Classification: Easier to implement and configure, as it is trained on smaller and more structured data.

    • Detection: More complex to implement, requires more data and training time, especially if you need to detect objects of different sizes and shapes.

When to use classification

  1. Holistic definition of an object's class: If you need to determine the class of the entire object or image, rather than its parts. For example, determine what is shown in the photograph (a dog or a cat).

  2. Limited computing resources: In conditions where computing resources are limited and fast data processing is required.

  3. A more precise definition of “subclasses”. For example, you can find letters in an image using detection, and then more accurately identify the character using classification!

When to use detection

  1. Multiple objects in an image: If there may be several objects of different classes in the image, and you need to determine their location and classes. For example, detecting cars and pedestrians on the street.

  2. Analysis of complex scenes: When you need to analyze complex scenes where it is important not only to determine which objects are present, but also where they are located.

  3. Real time application: In tasks where it is necessary to track objects in real time, for example, in video surveillance systems.

Example code for one image:

import cv2
import numpy as np
from ultralytics import YOLO
import os

model = YOLO('yolov8n-cls.pt')


def process_image(img):
    # Обработка кадра с помощью модели
    results = model(img)[0]

    # Отображение результатов классификации на изображении
    if results.probs is not None:
        # Доступ к вершинам классификации
        top1_idx = results.probs.top1  # Индекс класса с наивысшей вероятностью
        top1_conf = results.probs.top1conf.item()  # Вероятность для класса с наивысшей вероятностью
        class_name = results.names[top1_idx]  # Получаем имя класса по индексу

        # Отображаем класс и вероятность на кадре
        label = f"{class_name}: {top1_conf:.2f}"
        cv2.putText(img, label, (50, 50),
                    cv2.FONT_HERSHEY_SIMPLEX, 2,
                    (255, 0, 0), 3)

    return image

image = cv2.imread("dog.jpg")
image = process_image(image)
cv2.imwrite('result.jpg', image)
Not quite a collie, but similar!  Models for classification have a lot to learn

Not quite a collie, but similar! Models for classification have a lot to learn

Finding key points of the body – determining human pose

There are many ways to use this model, here are some of them:

  • Athletes training: Help analyze athletes' movements to improve their technique

  • Rehabilitation: Monitoring and adjusting patient movements during rehabilitation exercises.

  • Fall detection: Automatically detects falls and other hazardous situations for the elderly or industrial workers.

  • Animation: Creating realistic movements for animated characters in films and video games.

  • Virtual and augmented reality: Realistic tracking of user movements to create interactive VR and AR applications.

  • Customer behavior analysis: Studying the movement and behavior of customers in stores to optimize product display and improve service.

  • Digital mirrors: Virtual clothing try-on, allowing customers to see how they would look in different outfits without having to physically try them on.

Example code for image processing (and using the previous examples it is easy to process a video file or camera):

from ultralytics import YOLO
import cv2
import numpy as np
import os

# Загрузка модели YOLOv8-Pose
model = YOLO('yolov8n-pose.pt')

# Словарь цветов для различных классов
colors = {
    'white': (255, 255, 255),
    'red': (0, 0, 255),
    'blue': (255, 0, 0)
}

def draw_skeleton(image, keypoints, confs, pairs, color):
    for (start, end) in pairs:
        if confs[start] > 0.5 and confs[end] > 0.5:
            x1, y1 = int(keypoints[start][0]), int(keypoints[start][1])
            x2, y2 = int(keypoints[end][0]), int(keypoints[end][1])
            if (x1, y1) != (0, 0) and (x2, y2) != (0, 0):  # Игнорирование точек в (0, 0)
                cv2.line(image, (x1, y1), (x2, y2), color, 2)

def process_image(image_path):
    # Загрузка изображения
    image = cv2.imread(image_path)
    if image is None:
        print("Ошибка: не удалось загрузить изображение")
        return

    # Обработка изображения с помощью модели
    results = model(image)[0]

    # Проверка на наличие обнаруженных объектов
    if hasattr(results, 'boxes') and hasattr(results.boxes, 'cls') and len(results.boxes.cls) > 0:
        classes_names = results.names
        classes = results.boxes.cls.cpu().numpy()
        boxes = results.boxes.xyxy.cpu().numpy().astype(np.int32)

        # Обработка ключевых точек
        if results.keypoints:
            keypoints = results.keypoints.data.cpu().numpy()
            confs = results.keypoints.conf.cpu().numpy()
            
            for i, (class_id, box, kp, conf) in enumerate(zip(classes, boxes, keypoints, confs)):
                draw_box=False
                if draw_box:
                    class_name = classes_names[int(class_id)]
                    color = colors['white']
                    x1, y1, x2, y2 = box
                    cv2.rectangle(image, (x1, y1), (x2, y2), color, 2)
                    cv2.putText(image, class_name, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

                # Визуализация ключевых точек с номерами
                for j, (point, point_conf) in enumerate(zip(kp, conf)):
                    if point_conf > 0.5:  # Фильтрация по уверенности
                        x, y = int(point[0]), int(point[1])
                        if (x, y) != (0, 0):  # Игнорирование точек в (0, 0)
                            cv2.circle(image, (x, y), 5, colors['blue'], -1)
                            cv2.putText(image, str(j), (x + 5, y - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, colors['blue'], 2)

                # Рисование скелета
                draw_skeleton(image, kp, conf, [(5, 7), (7, 9), (6, 8), (8, 10)], colors['white']) # Руки
                draw_skeleton(image, kp, conf, [(11, 13), (13, 15), (12, 14), (14, 16)], colors['red']) # Ноги
                draw_skeleton(image, kp, conf, [(5, 11), (6, 12)], colors['blue']) # Тело

    # Сохранение и отображение результатов
    output_path = os.path.splitext(image_path)[0] + "_pose_detected.jpg"
    cv2.imwrite(output_path, image)
    print(f"Сохранено изображение с результатами: {output_path}")

    cv2.imshow('YOLOv8-Pose Detection', image)
    cv2.waitKey(0)
    cv2.destroyAllWindows()

# Путь к изображению для обработки
image_path="d.jpg"
process_image(image_path)

Object tracking

Object tracking is the process of recording and analyzing the movements of objects in a video or image sequence. This process involves assigning unique identifiers to each detected object and tracking its location over time. Thanks to this approach, it is possible to solve many problems in various fields.

Applications of object tracking:

  • Security and video surveillance: Detect suspicious behavior, automatically track people to identify suspicious activities, identify and alert about missing or unattended objects.

  • transport and logistic: Traffic management, vehicle monitoring to optimize traffic flow, autonomous vehicles and collision avoidance.

  • Retail: Analysis of customer behavior to optimize product placement, improve user experience and anti-theft systems to monitor suspicious activity.

  • Sports and performance analysis: Analyzing sporting events, tracking players for detailed analysis of their actions and strategies, using tracking to improve athletes' technique.

  • Medicine and healthcare: Rehabilitation, tracking patients' movements to monitor their progress, analyzing gait and other movements to identify disorders and diseases.

  • Robotics and Human-Computer Interaction: Robot navigation, ensuring safe movement of robots in a dynamic environment, gesture control of devices and applications.

Example code for video processing:

from collections import defaultdict
import cv2
import numpy as np
from ultralytics import YOLO

# Загрузка модели YOLOv8
model = YOLO("yolov8x.pt")

# Открытие видео файла
video_path = "input.mp4"
cap = cv2.VideoCapture(video_path)

# Проверка успешного открытия видео
if not cap.isOpened():
    print(f"Ошибка открытия {video_path}")
    exit()


# Получение FPS видео
fps = cap.get(cv2.CAP_PROP_FPS)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

# Настройка VideoWriter для сохранения выходного видео
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter('out.mp4', fourcc, fps, (width, height))

# Создание словаря для хранения истории треков объектов
track_history = defaultdict(lambda: [])

# Цикл для обработки каждого кадра видео
while cap.isOpened():
    # Считывание кадра из видео
    success, frame = cap.read()

    if not success:
        print("Конец видео")
        break

    # Применение YOLOv8 для отслеживания объектов на кадре, с сохранением треков между кадрами
    results = model.track(frame, persist=True)

    # Проверка на наличие объектов
    if results[0].boxes is not None and results[0].boxes.id is not None:
        # Получение координат боксов и идентификаторов треков
        boxes = results[0].boxes.xywh.cpu()  # xywh координаты боксов
        track_ids = results[0].boxes.id.int().cpu().tolist()  # идентификаторы треков

        # Визуализация результатов на кадре
        annotated_frame = results[0].plot()

        # Отрисовка треков
        for box, track_id in zip(boxes, track_ids):
            x, y, w, h = box  # координаты центра и размеры бокса
            track = track_history[track_id]
            track.append((float(x), float(y)))  # добавление координат центра объекта в историю
            if len(track) > 30:  # ограничение длины истории до 30 кадров
                track.pop(0)

            # Рисование линий трека
            points = np.hstack(track).astype(np.int32).reshape((-1, 1, 2))
            cv2.polylines(annotated_frame, [points], isClosed=False, color=(230, 230, 230), thickness=10)

        # Отображение аннотированного кадра
        cv2.imshow("YOLOv8 Tracking", annotated_frame)
        out.write(annotated_frame)  # запись кадра в выходное видео
    else:
        # Если объекты не обнаружены, просто отображаем кадр
        cv2.imshow("YOLOv8 Tracking", frame)
        out.write(frame)  # запись кадра в выходное видео

    # Прерывание цикла при нажатии клавиши 'Esc'
    if cv2.waitKey(1) == 27:
        break

# Освобождение видеозахвата и закрытие всех окон OpenCV
cap.release()
out.release()  # закрытие выходного видеофайла
cv2.destroyAllWindows()

Now that tracking is working, it becomes possible to analyze the collected data. We will predict the location of an object after a given time:

To do this, we will use the capabilities of the NumPy library, which allows you to find linear regression for given points. This will help us average the motion of an object and predict its future coordinates.

Here is the complete function code predict_positionwhich uses the method of least squares to find the line that best describes the last points of an object's path, and extrapolates from that line to predict the future position:

def predict_position(track, future_time, fps):
    if len(track) < 2:
        return track[-1]

    N = min(len(track), 25)
    track = np.array(track[-N:])

    times = np.arange(-N + 1, 1)

    A = np.vstack([times, np.ones(len(times))]).T
    k_x, b_x = np.linalg.lstsq(A, track[:, 0], rcond=None)[0]
    k_y, b_y = np.linalg.lstsq(A, track[:, 1], rcond=None)[0]

    future_frames = future_time * fps
    future_x = k_x * future_frames + b_x
    future_y = k_y * future_frames + b_y

All the same videos, but with prediction processing:

Code for processing video prediction

Creating your own dataset

And now the most interesting thing is how to train the YOLOV8 neural network to process not only cars and dogs, but also objects with which you need to work. For this purpose, the developers provide the opportunity to retrain the model.
Why additional training? Because not the entire model is trained, but only the last layers of the neural network. This allows you to effectively and quickly train the model on an unlimited dataset, which you can actually mark with your hands! Let's begin!

The task is to go the full way and get a high-quality result, which can be improved in the future by increasing the complexity of objects and the conditions for their appearance!
I present our hero – “Cube” (which is actually a rectangular parallelepiped). We will teach him to recognize our model!

— I’ll immediately answer the question: “Why not OpenCV? After all, there is enough of it here.”
— Yes, moreover, we will use it to automatically mark up the dataset. And not her, because The task is to train the model to identify the necessary objects!

We shoot a video with the object, also capturing frames where it is not there, i.e. just a background!

The next step is to split the video into separate images.

import cv2
import os
import time

# Путь к видеофайлу
video_path="000.mp4"
# Папка для сохранения изображений
output_folder="output_images"
# Интервал между кадрами (каждый n-й кадр будет сохранен)
frame_interval = 1  # Можно изменить на 2, 5 и т.д.

os.makedirs(output_folder, exist_ok=True)

# Открытие видеофайла
cap = cv2.VideoCapture(video_path)
if not cap.isOpened():
    print(f"Ошибка открытия видеофайла: {video_path}")
    exit()

frame_count = 0
saved_count = 0

while True:
    ret, frame = cap.read()
    if not ret:
        break

    if frame_count % frame_interval == 0:
        # Получение текущего времени в виде временной метки
        timestamp = int(time.time() * 1000)  # Используем миллисекунды для большей точности
        output_path = os.path.join(output_folder, f'{timestamp}_frame_{saved_count:05d}.jpg')
        cv2.imwrite(output_path, frame)
        print(f"Сохранено: {output_path}")
        saved_count += 1

    frame_count += 1

cap.release()
print("Разделение видео на фотографии завершено.")

And we get a huge number of photographs. How many of them do you need? The answer to this question is definitely difficult to give, because… depends on the complexity of the object and the conditions in which it will be determined.
We can say for sure that the more diverse the frames with the surrounding environment are (if this, of course, is required in the conditions of further operation), the better the trained model will work later!

Now you need to mark up the data, i.e. indicate the location where the desired object is located.

Automated marking

Because This problem can be solved in “greenhouse” conditions and can be marked automatically using OpenCV. Let's take one of the frames of our dataset and use a script to select ranges in the HSV color model, which will allow us to select the cube from the background:

import cv2
import numpy as np

def nothing(*arg):
    pass

cv2.namedWindow( "result" ) 
cv2.namedWindow( "settings" )


cv2.createTrackbar('h1', 'settings', 0, 180, nothing)
cv2.createTrackbar('s1', 'settings', 0, 255, nothing)
cv2.createTrackbar('v1', 'settings', 0, 255, nothing)
cv2.createTrackbar('h2', 'settings', 180, 180, nothing)
cv2.createTrackbar('s2', 'settings', 255, 255, nothing)
cv2.createTrackbar('v2', 'settings', 255, 255, nothing)

while True:
    img = cv2.imread('000.jpg')
    h,w,_=img.shape
    img=cv2.resize(img,(w//5,h//5))
    hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV )
 
    # считываем значения бегунков
    h1 = cv2.getTrackbarPos('h1', 'settings')
    s1 = cv2.getTrackbarPos('s1', 'settings')
    v1 = cv2.getTrackbarPos('v1', 'settings')
    h2 = cv2.getTrackbarPos('h2', 'settings')
    s2 = cv2.getTrackbarPos('s2', 'settings')
    v2 = cv2.getTrackbarPos('v2', 'settings')
    h_min = np.array((h1, s1, v1), np.uint8)
    h_max = np.array((h2, s2, v2), np.uint8)
    img_bin = cv2.inRange(hsv, h_min, h_max)
    cv2.imshow('result', img_bin)
    cv2.imshow('original', img)
    ch = cv2.waitKey(5)
    if ch == 27:
        break
cv2.destroyAllWindows()

Now you need to write the resulting values ​​in the format:
lower_hsv = np.array([64, 54, 167])
upper_hsv = np.array([180, 255, 255])

The following script will apply this filter to all images and save the coordinates of the found object.
IMPORTANT! This script is for solving a simple problem – where there is one object of the same class in the frame, in good conditions. For manual marking, the script continues!

import cv2
import os
import numpy as np

input_folder="output_images"
output_folder="dataset/train"
output_images_folder = os.path.join(output_folder, 'images')
output_labels_folder = os.path.join(output_folder, 'labels')

os.makedirs(output_images_folder, exist_ok=True)
os.makedirs(output_labels_folder, exist_ok=True)

lower_hsv = np.array([89, 71, 120])
upper_hsv = np.array([180, 255, 255])

def find_mask(image, lower_hsv, upper_hsv):
    hsv_image = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
    mask = cv2.inRange(hsv_image, lower_hsv, upper_hsv)
    return mask

def find_bounding_rect(mask):
    contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    if contours:
        contour = max(contours, key=cv2.contourArea)
        x, y, w, h = cv2.boundingRect(contour)
        return x, y, x + w, y + h
    else:
        return None

def normalize_coordinates(x1, y1, x2, y2, img_width, img_height):
    x_center = (x1 + x2) / 2 / img_width
    y_center = (y1 + y2) / 2 / img_height
    width = abs(x2 - x1) / img_width
    height = abs(y2 - y1) / img_height
    return x_center, y_center, width, height

for filename in os.listdir(input_folder):
    if filename.endswith(('.jpg', '.jpeg', '.png')):
        image_path = os.path.join(input_folder, filename)
        image = cv2.imread(image_path)
        resized_image = cv2.resize(image, (640,640))
        mask = find_mask(resized_image, lower_hsv, upper_hsv)
        bounding_rect = find_bounding_rect(mask)

        if bounding_rect is not None: 
            x1, y1, x2, y2 = bounding_rect
            x_center, y_center, width, height = normalize_coordinates(x1, y1, x2, y2, 640, 640)

            # Сохранение изображения
            output_image_path = os.path.join(output_images_folder, filename)
            cv2.imwrite(output_image_path, resized_image)

            # Сохранение 
            label_filename = os.path.splitext(filename)[0] + '.txt'
            label_file_path = os.path.join(output_labels_folder, label_filename)
            with open(label_file_path, 'w') as f:
                f.write(f"0 {x_center} {y_center} {width} {height}\n")

            print(f"Processed and saved {filename}")

print("Подготовка датасета завершена.")

During processing, files are changed to a size of 640*640 pixels for transmission to the neural network. Also, for each file, a TXT file is written, which contains information about the location of the object.
It is important that the coordinates are not specified in pixels, but as values ​​from 0 to 1, indicating the relationship of the location of the point relative to the frame size.

The folder structure at this stage looks like this:

dataset/
├── train/
│   ├── images/
│   │   ├── img1.jpg
│   │   ├── img2.jpg
│   │   ├── ...
│   ├── labels/
│   │   ├── img1.txt
│   │   ├── img2.txt
│   │   ├── ...
├── output_images/
│   ├── img1.jpg
│   ├── img2.jpg
│   ├── ...
├── Dataset_video2images.py
└── Dataset_HSV_Markup.py

After marking it would be good to check how well the data was marked!

import cv2
import os

# Папки с изображениями и метками
images_path="dataset/train/images"
labels_path="dataset/train/labels"

# Папка для сохранения изображений с нарисованными прямоугольниками
output_folder="checked_images"
os.makedirs(output_folder, exist_ok=True)

# Цвета для классов (можно добавить больше цветов, если классов больше)
colors = [(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 0)]

# Чтение всех файлов изображений и меток
images = [f for f in os.listdir(images_path) if f.endswith(('.jpg', '.jpeg', '.png'))]

# Функция для преобразования координат из нормализованных значений в пиксели
def denormalize_coordinates(x_center, y_center, width, height, img_width, img_height):
    x_center *= img_width
    y_center *= img_height
    width *= img_width
    height *= img_height
    x1 = int(x_center - width / 2)
    y1 = int(y_center - height / 2)
    x2 = int(x_center + width / 2)
    y2 = int(y_center + height / 2)
    return x1, y1, x2, y2

# Обработка изображений
for image_file in images:
    image_path = os.path.join(images_path, image_file)
    label_file = os.path.splitext(image_file)[0] + '.txt'
    label_path = os.path.join(labels_path, label_file)

    # Проверка наличия файла меток
    if not os.path.exists(label_path):
        print(f"Label file not found for image: {image_file}")
        continue

    # Загрузка изображения
    image = cv2.imread(image_path)
    img_height, img_width = image.shape[:2]

    # Чтение файла меток и рисование прямоугольников
    with open(label_path, 'r') as f:
        for line in f:
            cls, x_center, y_center, width, height = map(float, line.strip().split())
            x1, y1, x2, y2 = denormalize_coordinates(x_center, y_center, width, height, img_width, img_height)
            color = colors[int(cls) % len(colors)]
            cv2.rectangle(image, (x1, y1), (x2, y2), color, 2)
            cv2.putText(image, f'class {int(cls)}', (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.9, color, 2)

    # Сохранение изображения с нарисованными прямоугольниками
    output_path = os.path.join(output_folder, image_file)
    cv2.imwrite(output_path, image)
    print(f"Saved checked image: {output_path}")

print("Проверка разметки завершена.")

We check and make sure that the checked_images folder contains correctly marked data. These frames around the cubes are drawn based on data from labels, which ensures that the correct data gets into the neural network for training.

Manual marking

To work with more complex data and many classes, manual markup is required. There are, of course, online services, but this is not our method. We write ourselves:

import cv2
import os
import yaml
import shutil

# Путь к папке с исходными изображениями
full_images_path="output_images"
# Путь к папке для сохранения обработанных данных
dataset_path="dataset"
train_images_path = os.path.join(dataset_path, 'train', 'images')
train_labels_path = os.path.join(dataset_path, 'train', 'labels')
valid_images_path = os.path.join(dataset_path, 'valid', 'images')
valid_labels_path = os.path.join(dataset_path, 'valid', 'labels')
test_images_path = os.path.join(dataset_path, 'test', 'images')
test_labels_path = os.path.join(dataset_path, 'test', 'labels')
ready_images_path = os.path.join(full_images_path, 'ready')

os.makedirs(train_images_path, exist_ok=True)
os.makedirs(train_labels_path, exist_ok=True)
os.makedirs(valid_images_path, exist_ok=True)
os.makedirs(valid_labels_path, exist_ok=True)
os.makedirs(test_images_path, exist_ok=True)
os.makedirs(test_labels_path, exist_ok=True)
os.makedirs(ready_images_path, exist_ok=True)


window_name="Annotation Tool"
current_class = 0
colors = [(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 0)]
annotations = []

# Функция для масштабирования изображения
def resize_image(image, size=(640, 640)):
    return cv2.resize(image, size)

# Обработка событий мыши
drawing = False
ix, iy = -1, -1

def draw_rectangle(event, x, y, flags, param):
    global ix, iy, drawing, annotations, current_class
    
    if event == cv2.EVENT_LBUTTONDOWN:
        drawing = True
        ix, iy = x, y
    elif event == cv2.EVENT_MOUSEMOVE:
        if drawing:
            image = param['original_image'].copy()
            for annotation in annotations:
                cls, x1, y1, x2, y2 = annotation
                cv2.rectangle(image, (x1, y1), (x2, y2), colors[cls], 2)
            cv2.rectangle(image, (ix, iy), (x, y), colors[current_class], 2)
            cv2.imshow(window_name, image)
    elif event == cv2.EVENT_LBUTTONUP:
        drawing = False
        annotations.append((current_class, ix, iy, x, y))
        image = param['original_image'].copy()
        for annotation in annotations:
            cls, x1, y1, x2, y2 = annotation
            cv2.rectangle(image, (x1, y1), (x2, y2), colors[cls], 2)
        cv2.imshow(window_name, image)
    elif event == cv2.EVENT_RBUTTONDOWN:  # Удаление последней рамки
        if annotations:
            removed_annotation = annotations.pop()
            image = param['original_image'].copy()  # Вернемся к оригинальному изображению
            for annotation in annotations:
                cls, x1, y1, x2, y2 = annotation
                cv2.rectangle(image, (x1, y1), (x2, y2), colors[cls], 2)
            cv2.imshow(window_name, image)
        else:
            image = param['original_image'].copy()
            cv2.imshow(window_name, image)

# Обновление файла data.yaml
def update_data_yaml():
    data_yaml_path="data.yaml"
    data = {
        'train': 'dataset/train/images',
        'val': 'dataset/valid/images',
        'test': 'dataset/test/images',
        'nc': 4,
        'names': ['class0', 'class1', 'class2', 'class3']
    }
    with open(data_yaml_path, 'w') as f:
        yaml.dump(data, f, default_flow_style=None, sort_keys=False)
    print(f"Updated {data_yaml_path}")

# Загрузка и обработка изображений
for filename in os.listdir(full_images_path):
    if filename.endswith(('.jpg', '.jpeg', '.png')):
        image_path = os.path.join(full_images_path, filename)
        image = cv2.imread(image_path)
        image = resize_image(image)
        original_image = image.copy()
        annotations = []

        cv2.namedWindow(window_name)
        cv2.setMouseCallback(window_name, draw_rectangle, param={'original_image': original_image})

        while True:
            image_with_annotations = original_image.copy()
            for annotation in annotations:
                cls, x1, y1, x2, y2 = annotation
                cv2.rectangle(image_with_annotations, (x1, y1), (x2, y2), colors[cls], 2)
            cv2.imshow(window_name, image_with_annotations)
            key = cv2.waitKey(1) & 0xFF

            if key == ord(' '):  # Нажатие пробела для сохранения
                # Сохранение изображения
                output_image_path = os.path.join(train_images_path, filename)
                cv2.imwrite(output_image_path, original_image)
                print(f"Saved image to {output_image_path}")

                # Сохранение текстовых данных
                label_filename = os.path.splitext(filename)[0] + '.txt'
                label_file_path = os.path.join(train_labels_path, label_filename)
                with open(label_file_path, 'w') as f:
                    for annotation in annotations:
                        cls, x1, y1, x2, y2 = annotation
                        x_center = (x1 + x2) / 2 / 640
                        y_center = (y1 + y2) / 2 / 640
                        width = abs(x2 - x1) / 640
                        height = abs(y2 - y1) / 640
                        f.write(f"{cls} {x_center} {y_center} {width} {height}\n")
                print(f"Saved labels to {label_file_path}")

                # Обновление data.yaml
                update_data_yaml()

                # Перемещение обработанного изображения
                ready_image_path = os.path.join(ready_images_path, filename)
                shutil.move(image_path, ready_image_path)
                print(f"Moved image to {ready_image_path}")
                break
            elif key == 27:  # Нажатие Esc для пропуска изображения
                print("Skipped image")
                break
            elif key in [ord(str(i)) for i in range(10)]:  # Выбор класса
                current_class = int(chr(key))
                print(f"Selected class: {current_class}")

        cv2.destroyAllWindows()

Brief instructions:
— Select one of 4 classes using the numbers 0,1,2,3 on the keyboard
— Draw frames around objects
— Delete last frame – right mouse button
— Go to next frame – space

Now that all the data has been marked up, all that remains is to prepare a few files and you can start training!

Final preparations

For quality training and testing of work, it is necessary to divide all data into 3 parts:
— train (training data) – 70%
— test (test data, for verification during training) – 20%
— valid (validation data, for testing the model after training) – 10%

Doing this manually is inconvenient, so we automate the process using a script:

import os
import shutil
import random

# Параметры для разделения данных
test_percent = 0.2  # Процент данных для тестирования
valid_percent = 0.1  # Процент данных для проверки

# Путь к папке с данными
dataset_path="dataset"
train_images_path = os.path.join(dataset_path, 'train', 'images')
train_labels_path = os.path.join(dataset_path, 'train', 'labels')
valid_images_path = os.path.join(dataset_path, 'valid', 'images')
valid_labels_path = os.path.join(dataset_path, 'valid', 'labels')
test_images_path = os.path.join(dataset_path, 'test', 'images')
test_labels_path = os.path.join(dataset_path, 'test', 'labels')

os.makedirs(valid_images_path, exist_ok=True)
os.makedirs(valid_labels_path, exist_ok=True)
os.makedirs(test_images_path, exist_ok=True)
os.makedirs(test_labels_path, exist_ok=True)

# Получение всех файлов изображений и соответствующих меток
images = [f for f in os.listdir(train_images_path) if f.endswith(('.jpg', '.jpeg', '.png'))]
labels = [f for f in os.listdir(train_labels_path) if f.endswith('.txt')]

# Убедимся, что количество изображений и меток совпадает
images.sort()
labels.sort()

# Проверка на соответствие количества изображений и меток
if len(images) != len(labels):
    print("Количество изображений и меток не совпадает.")
    exit()

# Перемешивание данных
data = list(zip(images, labels))
random.shuffle(data)
images, labels = zip(*data)

# Разделение данных
num_images = len(images)
num_test = int(num_images * test_percent)
num_valid = int(num_images * valid_percent)
num_train = num_images - num_test - num_valid

# Перемещение данных в соответствующие папки
def move_files(file_list, source_image_dir, source_label_dir, dest_image_dir, dest_label_dir):
    for file in file_list:
        image_path = os.path.join(source_image_dir, file)
        label_path = os.path.join(source_label_dir, os.path.splitext(file)[0] + '.txt')
        shutil.move(image_path, os.path.join(dest_image_dir, file))
        shutil.move(label_path, os.path.join(dest_label_dir, os.path.splitext(file)[0] + '.txt'))

# Перемещение тестовых данных
move_files(images[:num_test], train_images_path, train_labels_path, test_images_path, test_labels_path)

# Перемещение валидационных данных
move_files(images[num_test:num_test + num_valid], train_images_path, train_labels_path, valid_images_path, valid_labels_path)

# Оставшиеся данные остаются в папке train

print(f"Перемещено {num_test} изображений в папку test.")
print(f"Перемещено {num_valid} изображений в папку valid.")
print(f"Осталось {num_train} изображений в папке train.")

Another important step is the creation of a data.yaml file, with information about folders, files and class names in the future model. The file structure looks like this:

train: dataset/train/images 
val: dataset/valid/images 
test: dataset/test/images

nc: 4 
names: [class0, class1, class2, class3]

Now everything is ready and the project structure looks like this:

dataset/
├── train/
│   ├── images/
│   │   ├── img1.jpg
│   │   ├── img2.jpg
│   │   ├── ...
│   ├── labels/
│   │   ├── img1.txt
│   │   ├── img2.txt
│   │   ├── ...
├── test/
│   ├── images/
│   │   ├── img3.jpg
│   │   ├── img4.jpg
│   │   ├── ...
│   ├── labels/
│   │   ├── img3.txt
│   │   ├── img4.txt
│   │   ├── ...
├── valid/
│   ├── images/
│   │   ├── img5.jpg
│   │   ├── img6.jpg
│   │   ├── ...
│   ├── labels/
│   │   ├── img5.txt
│   │   ├── img6.txt
│   │   ├── ...
├── output_images/
│   ├── img1.jpg
│   ├── img2.jpg
│   ├── ...
├── data.yaml
├── Dataset_video2images.py
└── Dataset_HSV_Markup.py

Education

There are just a few parameters that you can and should work with at the very beginning:
epochs = 500 — number of training epochs. Selected individually for the task. During training, YOLO stops automatically if there is no improvement in results for several epochs.

batch = 64 – the size of the “batch of images” transmitted at a time to the neural network. Varies depending on the amount of available memory.

imgsz = 640 – image size

import os
from ultralytics import YOLO

current_dir = os.path.dirname(os.path.abspath(__file__))

data_path = os.path.join(current_dir, 'data.yaml')
model = YOLO(os.path.join(current_dir, 'yolov8n.pt'))


epochs = 500
batch = 64
imgsz = 640
if __name__ == '__main__':
    results = model.train(data=data_path,
                      epochs=epochs, 
                      batch=batch, 
                      imgsz=imgsz, 
                      name="red",
                      device="cuda")

Let's start training and wait!

After the process is completed (the time may vary greatly depending on many parameters), the runs folder will contain graphs and other data from the training process, as well as model files best.pt And last.pt

All that remains is to replace the model file in the detection script with best.pt and HURRAY, you can use it!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *