Object Detection in an Image Using YOLOv5 and YOLOv8 Models

Author of the article: Victoria Lyalikova

Object detection is a fairly popular computer vision task that involves identifying and detecting objects in images or videos. This task is part of many applications, such as self-driving cars, robotics, video surveillance, etc. Over the years, many algorithms and methods have been developed to find objects in images and their positions. The best quality of such tasks is achieved using convolutional neural networks.

One of the most popular neural network architectures for such tasks is YOLO (you only look once), created in 2015. Since then, quite a few versions of these algorithms have appeared. The latest releases of the network are designed for tasks such as recognition, detection and segmentation of images.

We will consider the YOLO architecture only for the task of detecting objects in an image. In this case, the goal of the algorithm is to predict the class of an object and draw a bounding box that defines the location of the object in the input image.

One way to solve detection problems is to split an image into square areas, and then classify these areas for the presence of an object and the classification of the object itself. Thus, the image is viewed twice, once to determine the areas where the object is, and once to classify this object. YOLO uses a different approach. The original image is compressed in such a way as to obtain a square matrix, in each cell of which information is recorded about the presence of an object and the class of this object in the corresponding part of the image. For each cell, the probabilities of the determined class are output. Cells with a class probability above the threshold are selected and used to determine the location of the object in the image. That is, YOLO views the image once, which significantly increases the processing speed. It is distinguished by high speed and accuracy of object detection. At the output of such a network, we want to get approximately the following image:

Thus, we obtained an image with the designation of detected objects and the probability value of belonging to the selected class.

Below we will look at examples of object detection on models. YOLOv5 And YOLOv8 in two versions.

  1. Let's assume that we have one or more images and our task is to detect the maximum number of objects in it that the model will allow.

  2. Let's assume that we have a dataset and we want to train a YOLO model on this dataset.

Both versions of YOLO are free from the same developer Ultralytics, perform well in comparative tests and are implemented on the modern deep learning library PyTorch. Model YOLOv8 is positioned as newer and more modern compared to version 5, although both versions are actively being developed.

Getting Started with YOLO and Preparing a Dataset

YOLO models come pre-trained on the COCO dataset and exported as .pt files. There are three types of models (classification, detection, segmentation) and 5 models of different sizes for each type:

The COCO dataset is a huge collection of images of 80 classes. So, if there are no special needs, you can simply run a ready-made model without preliminary training and evaluate the result. Let's look at the names of the COCO classes.

names: { 0: person,  1: bicycle, 2: car, 3: motorcycle, 4: airplane, 5: bus, 6: train, 7: truck, 8: boat,  9: traffic light, 10: fire hydrant, 11: stop sign, 12: parking meter, 13: bench, 14: bird,  15: cat, 16: dog, 17: horse, 18: sheep, 19: cow, 20: elephant, 21: bear, 22: zebra, 23: giraffe,  24: backpack, 25: umbrella, 26: handbag, 27: tie, 28: suitcase, 29: frisbee, 30: skis, 31: snowboard, 32: sports ball, 33: kite, 34: baseball bat, 35: baseball glove,36: skateboard, 37: surfboard, 38: tennis racket, 39: bottle, 40: wine glass, 41: cup, 42: fork, 43: knife, 44: spoon,  45: bowl, 46: banana, 47: apple, 48: sandwich, 49: orange, 50: broccoli, 51: carrot, 52: hot dog, 53: pizza, 54: donut, 55: cake, 56: chair, 57: couch, 58: potted plant, 59: bed, 60: dining table, 61: toilet, 62: tv, 63: laptop, 64: mouse, 65: remote, 66: keyboard, 67: cell phone, 68: microwave, 69: oven, 70: toaster, 71: sink, 72: refrigerator, 73: book, 74: clock, 75: vase, 76: scissors, 77: teddy bear, 78: hair drier, 79: toothbrush}

Any neural network training begins with preparing training data. Let's talk a little about where you can get training data and how to present it in the format needed to train YOLO network models.

To train the model, we need to prepare a dataset of images and objects marked on it, and then divide this data into sets for training and testing.

The structure of the dataset for training YOLO models should be in this format

├── dataset
│   ├── test
│   │   ├── images
│   │   └── labels
│   ├── train
│   │   ├── images
│   │   └── labels
│   └── val
│       ├── images
│       └── labels
├── dataset.yaml
└── yolov5x.pt

The dataset should be divided into 2 folders: train (training) and val (validation). There may also be an optional test folder (for testing the model). Each folder contains 2 more folders: images (images) and labels (labels) – a folder with text files of annotations containing labels of objects in these pictures in YOLO format. This format implies that each line of the file is presented as:

{object_class_id} {x_center} {y_center} {width} {height}

All coordinates must be normalized so that they fit into the range from 0 to 1. The formulas used to calculate them are

x_center = (box_x_left+box_x_width/2)/image_width
y_center = (box_y_top+box_y_height/2)/image_height
width = box_width/image_width
height = box_height/image_height

where box is the bounding box for the object of the class for which we are calculating the coordinates.

If there are multiple objects in the image, then the annotation text file might look like this

1 0.589869281 0.490361446 0.326797386 0.527710843
0 0.323529412 0.585542169 0.189542484 0.351807229

The first line contains the bounding box for one object, the second line contains the bounding box for the second object. Of course, the file may contain more lines, depending on the number of objects in the image.

Annotation text files must have the same names as the image files and have a .txt extension.

Next, our dataset must also contain a file with the .yaml extension, which contains the paths to the training and validation samples, as well as the number of classes and their labels.

train: ../train/images
val: ../val/images
test: ../test/images
nc: 2
names: ['cat','dog']

The first two lines specify the paths to the images and the validation dataset. Then the nc line specifies the number of classes that exist in these datasets, and the names are the class names in the correct order. The indices of these elements are the numbers that were used when annotating the images, and these indices will be returned by the model when detecting objects using the prediction method. The .yaml file will be passed when training the model.

And here is the main question: how to get such a data set?

  • Take a ready-made dataset in the public domain. For example, you can use the site kaggle.com. But in order to find a set in the correct format, you will have to spend time.

  • You can also find a large selection of free computer vision kits here https://universe.roboflow.com/What is convenient about this service is that the dataset can be downloaded in different formats.

The disadvantage is a lot of repetition and poor quality markings.

  • Collect a dataset yourself, which then, unfortunately, will have to be labeled manually. Or even if there is a ready-made dataset, but not annotated, it will have to be labeled. The good news is that for visual annotation of images for machine learning tasks, there are many different and convenient services. For example, you can simply enter something like “image annotation software for machine learning” into a search engine to get a list of them. There are also many online tools that can do all this work. Labeling, of course, must be done depending on the network model that we will use. Since we can label with frames, points, segmentation areas, and so on. One of the great online tools for this is Roboflow Annotateyou can see more details Here Using this service, we simply upload images, draw bounding boxes on them, and set a class for each bounding box. The tool will then automatically create annotation files, split our data into training and validation datasets, create a YAML descriptor file, and then we can export and download the annotated data as a ZIP file.

Object detection

  1. YOLOv8

First, install the YOLOv8 package. To do this, type the command in Jupyter Notebook

%pip install ultralytics

The Ultralytics package has a YOLO class that is used to create neural network models. Next, we import this module

from ultralytics import YOLO

Now everything is ready to create a neural network model.

model = YOLO('yolov8m.pt')

The first time you run this code, a file will be downloaded to the current folder. yolov8m.pt from the Ultralytics server, which contains the model and pre-trained weights, and then a model object will be created. You can now train this model, detect objects, and export it for use in production.

There are convenient methods for all these tasks:

  • train – used to train a model on your image dataset.

  • predict — is used to make predictions about a given image, such as finding the bounding boxes of all objects that the model could find in that image.

  • export — used to export this model from the default PyTorch format to the specified one. This mode offers a versatile set of options for exporting the trained model to various formats, allowing it to be deployed on various platforms and devices.

Now we can take any image and detect objects in it that were fed into the YOLO model during the training stage.

results = model.predict('boy_dog.jpg')

The predict method accepts many different types of input, including a path to a single image, an array of image paths, a Python PIL Image object, and more.

result = results[0]

result contains the detected objects and convenient properties for working with these objects. The most important is the boxes array, which contains information about the detected classes, the probability with which these classes were accepted, and the coordinates of the bounding boxes for each class in the image.

result.boxes

Result

  • cls – class number (one of 80 object types with ID from 0 to 79. COCO object classes were defined above).

  • conf – the model's trust level for this class.

  • xywh – rectangle coordinates in xywh format.

  • xywhn – rectangle coordinates in xywh format, normalized to the original size.

  • xyxy – rectangle coordinates as an array [x1,y1,x2,y2]

  • xyxyn – rectangle coordinates normalized to the original image size

That is, in our image, 2 classes with numbers 0 and 16 were found, which corresponds to the class human and dog.

We can rewrite the code and get the information in a more convenient form.

def print_box(box):
    class_id, cords, conf = box
    print("Object type:", class_id)
    print("Coordinates:", cords)
    print("Probability:", conf)
    print("---")

[
    print_box([
        result.names[box.cls[0].item()],
        [round(x) for x in box.xyxy[0].tolist()],
        round(box.conf[0].item(), 2)
    ]) for box in result.boxes
]

Result:

If we want to get images with bounding boxes, we can do it this way

from PIL import Image
for i, r in enumerate(results):
	img_bgr = r.plot()
	im_rgb = Image.fromarray(img_bgr[...,::-1])
	r.show()
	r.save(filename = f"results{i}.jpg")

2. YOLOv5

If we work with the YOLOv5 model, then we just need to clone the YOLOV repository from github and also work with the network. In this case, the sequence of actions is as follows.

  1. Set the directory where we will clone the repository

%cd D:/yolov5
  1. Clone the repository from github

!git clone https://github.com/ultralytics/yolov5.git
  1. Import the pytorch library

import torch
  1. We go to our folder and install dependencies from requirements.txt

%cd D:/yolov5/yolov5
!pip install -r requirements.txt

The following commands are used for model detection, training and export tasks:

!python detect.py --weights yolov5x.pt --source img.jpg

The input image can be just a file, a screenshot, a folder, a video, etc. The weights are specified depending on the model we want to work with (nano, small, medium, large, xlarge).

!python train.py --img 640 --epochs 10 --data coco128.yaml --weights yolov5s.pt

You need to specify the image size, number of epochs for training, dataset, pre-trained weights.

!python export.py --weights yolov5s.pt --include torchscript 

It is necessary to specify the model to be exported and the format. This command exports the model to the torchscript format. Also used formats: pytorch, onnx, openvino, engine, coreml and others.

Now let's move on to loading the image and getting the results. We use the command detect.pyassuming we already have a project directory set up.

!python detect.py --weights yolov5x.pt --source D:/yolov_test1.jpg

The command line arguments are as follows

--source: path to input file

--weights: path to the file with yolov5 weights that we want to use for detection, in this case yolov5x.pt

--project: optional argument, used to specify the folder for saving the results, which in turn will contain the exp file.

We get the following result

Here you can see information about the network parameters used (weights, the dataset on which the network was trained, the number of layers in the network, etc.) and the detection results obtained (how many and what objects were detected). All results are saved in the folder …runs\detect\exp.

There is no need for any additional code to draw bounding boxes or determine the class name, everything is already done for us. We have a ready image with detected classes. If we feed a folder with images to the input, we will also get the detection results in the runs\detect\exp directory.

Education

  1. YOLOv8

So far we have used models pre-trained on well-known objects, but in practice you will most likely need a solution to detect specific objects for a specific business problem. Let's say we have a labeled dataset and want to teach our own model to detect objects of a certain type.

I took the dataset from the website universe.roboflow to detect three colors of balloons. The set is convenient because its directory structure is organized specifically for training YOLO models. The set contains train, val and test folders, and each of these folders contains labels and images folders, and there is also a data.yaml file.

In my case the .yaml file looks like this

train: ../train/images
val: ../val/images
test: ../test/images

nc: 3
names: ['blue','green','red']

There are 744 images for training and 71 for validation. The dataset is there, now it needs to be loaded into the model. For training, we will use the train method:

model.train(data="data.yaml", epochs = 20)

We assume that the network model has already been loaded using the command

model = YOLO('yolov8m.pt')

The .yaml file is the only required parameter, but it is also worth setting the number of epochs for training (default is 100). Of course, training settings include other hyperparameters and configurations, but we will not dwell on them. These settings affect the performance, speed and accuracy of the model. Key training parameters include batch size, learning rate, weight_decay (L2 regularization parameter). Also, the choice of optimizer, loss function and composition of the training dataset can significantly affect the training process. Each training cycle consists of a training stage and a verification stage.

During the training phase, the train method performs the following actions.

  • Extracts a random batch of images from the training dataset (the number of images is specified using the batch size parameter).

  • Runs these images through the model and produces the resulting bounding boxes of all detected objects and their classes.

  • Passes the result to a loss function, which is used to compare the obtained results with the correct result from the annotation files for these images. The loss function calculates the amount of error.

  • The result of the loss function is passed to the optimizer to adjust the model weights based on the magnitude of the error in the right direction to reduce the error in the next cycle. The default optimizer is SGD (Stochastic Gradient Descent), but you can try others, such as Adam, to see the difference.

During the verification phase, the train method does the following:

  • Extracts images from a test dataset.

  • Passes them through the model and gets the detected bounding boxes for those images.

  • Compares the obtained result with the true values ​​for these images from the annotation text files.

  • Calculates the accuracy of a model based on the difference between actual and expected results.

The progress and result for each era are displayed on the screen.

….

The training phase involves calculating the error value in the loss function, so the most valuable metrics here are box_loss and cls_loss.

  • box_loss shows the number of errors in the detected bounding boxes.

  • cls_loss shows the number of errors in the detected object classes.

Why are the losses broken down into multiple metrics? Because the model could correctly determine the bounding box around the object, but incorrectly determine the class of the object in that box. The most valuable quality metric is mAP50-95, which is the Mean Average Precision (mAP) metric. This metric shows how accurately the neural network was able to detect objects in the images from the validation set. If the model is trained and improved, the accuracy should increase from epoch to epoch.

If after the last epoch we have not received acceptable accuracy, then we can increase the number of epochs. We can also try to adjust other network parameters. Each case requires its own solution.

In addition to these metrics, a lot of statistics are written to disk during network training. When training begins, a subfolder /runs/detect/train is created in the current folder and after each epoch, different log files are written to it. It is worth looking into this folder, there is a lot of interesting information in it. This directory is saved in the project directory, unless another directory has been additionally installed. Also, the trained model after each epoch is written to the file /runs/detect/train/weights/last.pt, and the model with the highest accuracy is written to the file /runs/detect/train/weights/best.pt. So, after training is complete, we should get the file best.pt for its use.

Contents of the /runs/detect/train subfolder.

Contents of the file results.png

Now let's look at our results after training YOLOv8. Don't forget to load the new model first.

This result is obtained after 20 training epochs. The training took about 6 hours on a computer with an AMD Ryzen 7 3700X 8-Core Processor 3.59 GHz, 32 GB of RAM. In the first image, 1 blue ball is still missing and in 1 red ball the model is a little hesitant, since the confidence probability is not very high, only 58%. You can continue training or try changing the model parameters.

  1. YOLOv5

Training is done with the help of a team train.py with the following arguments:

--data: path to training dataset

--img: the size of the image fed to yolo input

--epochs: number of eras

--weights: starting weights for training

--batch: batch size, i.e. the number of images simultaneously fed to the yolo input

There are other settings that you can read about on project page.

In this case, I took a data set from the site universe.roboflow.com by accessibility of the urban environment on the streets. There are 5 classes into which obstacles encountered on the streets can be divided: makes it difficult to pass (dificulta el paso), inaccessible due to construction (inaccesible por obras), requires a ramp (necesita rampa), inaccessible ramp (rampa inaccesible), unstable platform (suelo inestable).

Examples of images

The .yaml file looks like this

train: ../train/images
val: ../valid/images
test: ../test/images

nc: 5
names: ['dificulta el paso', 'inaccesible por obras', 'necesita rampa', 'rampa inaccesible', 'suelo inestable']

There are 7126 images for training and 918 for validation. Let's try to run training for 10 epochs just for the sake of example. Of course, to train a model on such a dataset, more epochs are needed.

!python train.py --img 416 --batch 16 --epochs 10 --data C:/yolov5/data1/data.yaml --weights C:yolov5/data1/yolov5x.pt

After 10 epochs the results are of course not yet satisfactory, but my goal was not to achieve high detection rates.

Training is carried out in the same way as described above when working with the YOLOv8 model. All training results are also saved in the runs/train/exp directory. At the same time, for each new training, folders …./exp1, …/exp2, etc. are created.

Contents of the runs/detect/train subfolder

The results.png file looks like this

Now let's look at the detection result, only now with the parameter -weights we load our model best.pywhich was obtained after training. The directory from where to run the script must be set detect.py.

!python detect.py --weights runs/train/exp8/weights/best.pt --source C:/streets1.jpg

I would say that our model did a good job on these images, even with only 10 training epochs.

Conclusion

The purpose of this article was to introduce readers who are just starting to work in the field of computer vision and object detection to the operation of modern neural network models. YOLOv5 And YOLOv8. That is, I did not set a goal to solve some complex problem or get new results, compare models, and so on. Thus, we considered the task of detecting objects on pre-trained YOLOv5 and YOLOv8 models, and also tried to train models for detecting objects on our dataset, discussed the issue of creating our own dataset. When training such models, the main thing is patience, since the training process is not fast, unless you are the owner of a graphics processor and do not rent a cloud service with a GPU).


Finally, we invite everyone to open lessons on computer vision at Otus:

  • July 10: YOLO detector family: from v1 to v10. Let's look at object detection, follow the evolution of the YOLO detector family from the very first to the most current version (v10). Registration via link

  • July 23: Human pose estimation – an overview of approaches, models and solutions. You will learn: the formulation of the problem of human pose estimation, an overview of the main directions for solving this problem and what popular frameworks exist. Registration via link

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *