Object Detection Recognize and rule. Part 1

Computer vision technologies make it possible in today’s realities to make life and business easier, cheaper, safer. By estimated various experts, this market will move in the coming years only in the direction of growth, which allows the development of appropriate technologies in the direction of productivity and quality. One of the most popular sections is Object Detection (object detection) – the definition of an object in the image or in the video stream.

The times when the detection of objects was solved exclusively through classical machine learning (cascades, SVM …) have already passed – now approaches based on Deep Learning reign in this area. In 2014, an approach was proposed that significantly influenced subsequent research and development in this area – the R-CNN model. Its subsequent improvements (in the form of Fast R-CNN and Faster R-CNN) made it one of the most accurate, which has become the reason for its use to this day.

In addition to R-CNN, there are many more approaches that search for objects: the Yolo family, SSD, RetinaNet, CenterNet … Some of them offer an alternative approach, while others are developing the current one in the direction of increasing the performance indicator. A discussion of almost each of them can be put out in a separate article, due to the abundance of chips and tricks 🙂

To study, I propose a set of articles with analysis of two-stage Object Detection models. The ability to understand their device brings an understanding of the basic ideas used in other implementations. In this post we will consider the most basic and, accordingly, the first of them – R-CNN.


Bounding box – coordinates that limit a certain area of ​​the image – most often in the form of a rectangle. It can be represented by 4 coordinates in two formats: centered (

$ c_ {x}, c_ {y}, w, h $

) and regular (

$ x_ {min}, y_ {min}, x_ {max}, y_ {max} $


Hypothesis (Proposal), P is the specific region of the image (defined using the bounding box) in which the object is supposedly located.

End-to-end training – training, in which raw images are received at the network input, and ready-made answers are output.

IoU (Intersection-over-Union) – metric of the degree of intersection between the two bounding boxes.


One of the first approaches applicable for determining the location of an object in a picture is R-CNN (Region Convolution Neural Network). Its architecture consists of several successive steps and is illustrated in Figure 1:

  1. Defining a set of hypotheses.
  2. Extracting features from prospective regions using a convolutional neural network and encoding them into a vector.
  3. The classification of an object within a hypothesis based on the vector from step 2.
  4. Improvement (adjustment) of the coordinates of the hypothesis.
  5. Everything repeats from step 2 until all hypotheses from step 1 are processed.

Consider each step in more detail.

Hypothesis Search

Having a specific image at the input, the first thing it breaks down into small hypotheses of different sizes. Authors of this articles use Selective search – top-level, it allows you to compile a set of hypotheses (the class of the object does not matter yet), based on segmentation, determine the boundaries of objects by pixel intensity, color difference, contrast and texture. At the same time, the authors note that any similar algorithm can be used. Thus, approximately 2,000 different regions stand out, which partially overlap each other. For more accurate post-processing, each hypothesis is further expanded by 16 pixels in all 4 directions – as if adding context.


  • Input: original image.
  • Output: a set of hypotheses of different sizes and aspect ratios.

Image encoding

Each hypothesis from the previous step independently and separately from each other enters the input of the convolutional neural network. As it uses the architecture of AlexNet without the last softmax-layer. The main task of the network is to encode the incoming image into a vector representation, which is extracted from the last fully connected FC7 layer. So the output is a 4096-dimensional vector representation.

You can notice that the input of AlexNet has a dimension of 3 × 227 × 227, and the size of the hypothesis can be almost any aspect ratio and size. This problem is bypassed by simply compressing or stretching the input to the desired size.


  • Input: each hypothesis proposed in the previous step.
  • Output: vector representation for each hypothesis.


After obtaining the vector characterizing the hypothesis, its further processing becomes possible. To determine which object is located in the intended region, the authors use the classic SVM-based classification method for separating planes (Support Vector Machine – a support vector machine, can be modeled using Hinge loss) And it should be

$ N_ {c} + 1 $

individual (here,

$ N_ {c} $

denotes the number of defined classes of objects, and a unit is added to separately determine the background) of models trained according to the OvR principle (One vs. Rest – one against all, one of the methods of multiclass classification). In fact, the problem of binary classification is being solved – is there a concrete class of an object inside the proposed region or not. So the way out is

$ N_ {c} + 1 $

-dimensional vector displaying confidence in a specific class of the object contained in the hypothesis (the background is historically denoted by the zero class,

$ C_ {i} = 0 $



  • Input: the vector of each of the proposed hypotheses from the penultimate layer of the network (in the case of AlexNet, this is FC7).
  • Output: after sequentially launching each hypothesis, we obtain a dimension matrix
    $ 2000 × N_ {c} $

    representing the class of the object for each hypothesis.

Specification of hypotheses coordinates

The hypotheses obtained in step 1 do not always contain the correct coordinates (for example, an object may be “cropped” unsuccessfully), so it makes sense to correct them additionally. According to the authors, this brings an additional 3-4% to the metrics. Thus, hypotheses containing an object (the presence of an object is determined at the classification step) are additionally processed by linear regression. That is, hypotheses with the “background” class do not need additional processing of regions, because in fact there is no object there …

Each object, specific to its class, has certain sizes and aspect ratios, therefore, which is logical, it is recommended to use its own regressor for each class.

Unlike the previous step, the authors do not use a vector from the layer at the input for the best work FC7, and feature maps extracted from the last MaxPooling layer (in AlexNet

$ MaxPool_ {5} $

, dimension 256 × 6 × 6). The explanation is the following – the vector stores information about the presence of an object with some characteristic details, and the feature map best stores information about the location of objects.


  • Input: a map of attributes from the last MaxPooling layer for each hypothesis that contains any object except the background.
  • Output: corrections to the coordinates of the bounding box of the hypothesis.

Helper Tricks

Before proceeding to the details of model training, we will consider two necessary tricks that we will need later.

Designation of positive and negative hypotheses

When teaching with a teacher, a certain balance between classes is always necessary. The converse can lead to poor classification accuracy. For example, if in the sample with two classes the first occurs only in a few percent of cases, then it is difficult for the network to learn how to determine it – because it can be interpreted as an outlier. In the case of Object Detection tasks, there is just such a problem – in the picture with a single object, only a few hypotheses (from ~ 2000) contain this very object (

Similar Posts

Leave a Reply