Benchmark OpenCV on STM32

Today, image processing has become a part of our lives. No one is surprised by the recognition of faces or road markings. The most common library for these purposes at the moment is Opencv… Today OpenCV is focused primarily on large platforms. And although the older models of modern microcontrollers have resources comparable to the Pentium II, running OpenCV on them is still a very rare, even exotic phenomenon.

Some time ago, we showed that there is a fundamental possibility of using OpenCV on STM32 (and other microcontrollers of a similar class). Then our goal was to demonstrate the possibility of using this library on similar hardware platforms. Therefore, although we got very low performance, we did not begin to understand its reasons. At the moment, we have fixed the obvious shortcomings of the first solution, which allowed us to achieve acceptable performance. This article presents the results of performance measurements for various examples of using OpenCV on the STM32F7 platform.

All examples given in the article are based on Embox and can be reproduced independently by following the instructions from repository with examples… We also used the -Os optimization flag for our examples on the board. All examples use enabled cache. The images can be located on the SD card. In the examples, we store the images in the QSPI flash that is on the demo board, for simpler basic instructions when reproducing the results.

Edge detection

Let’s start with the same example that was used in previous work, namely defining boundaries. The example uses the algorithm Canny

We will provide a log of the output when running the edges application on Embox, which will allow us to compare the performance improvement compared to our previous work. For other applications, we will provide only tables with measurement results.

An example of the analyzed image

Output for 512×269 image

root@embox:(null)#edges fruits.png 20
Image: 512x269; Threshold=20
Detection time: 0 s 116 ms
Framebuffer: 800x480 32bpp

Output for 512×480 image

root@embox:(null)#edges fruits.png 20
Image: 512x480; Threshold=20
Detection time: 0 s 254 ms
Framebuffer: 800x480 32bpp

results

Image time from ROM (ms) time from QSPI (ms)
fruits.png 512×269 116 120
fruits.png 512×480 254 260

K-means

This example from the OpenCV composition, as a result of its work, must determine the clusters of points and circle each of them with a circle of the corresponding color.

To estimate the density of their distribution in OpenCV, the concept of “compactness” is used:

compactness: It is the sum of squared distance from each point to their corresponding centers.

In other words, compactness is an indicator of how close the points are concentrated from the center of the cluster.

As input, kmeans.cpp generates a 480 x 480 image with several clusters of dots of different colors. The center of each such cluster is chosen at random, and points are added to the cluster in accordance with the normal distribution.

Compactness time from ROM (ms) time from QSPI (ms)
733589 34 98
160406 6 eighteen
331447 fourteen 38
706280 13 36
399182 eight 25

Squares

Recognition of geometric shapes, in particular rectangles, is also a standard example in the OpenCV library

An example of the analyzed image

Results for 400×300 images:

Image time from ROM (ms) time from QSPI (ms)
pic1.png 1312 1668
pic2.png 4893 7268
pic3.png 1263 1571
pic4.png 2351 3590
pic5.png 1235 1515
pic6.png 1575 2202

Facedetect

Facial recognition was the original goal of our research. We wanted to evaluate how well similar algorithms work on similar boards. Using the standard facedetect example with a set of five images The examples use Haar-cascade Detection (https://docs.opencv.org/4.5.2/db/d28/tutorial_cascade_classifier.html)

An example of the analyzed image

For 256×256 images:

Image time from ROM (ms) time from QSPI (ms)
seq_256x256 / img_000.png 3389 3801
seq_256x256 / img_001.png 4015 4454
seq_256x256 / img_002.png 4016 4464
seq_256x256 / img_003.png 3315 3717
seq_256x256 / img_004.png 3526 3952

For 480×480 images:

Image time from ROM (ms) time from QSPI (ms)
seq_256x256 / img_000.png 14406 16149
seq_480x480 / img_001.png 14784 16578
seq_480x480 / img_002.png 15106 16904
seq_480x480 / img_003.png 12695 14352
seq_480x480 / img_004.png 14655 16446

Peopledetect

Increasing the complexity, we decided to try how the definition of people in the image works. You can use the peopledetect example for this.

Sample image

results

Image time from ROM (ms) time from QSPI (ms)
basketball2.png 640×480 40347 52587

QR code

QR codes are a widely used example of pattern recognition.

Sample image taken at random from the Internet

results
This example did not fit into the internal memory, so the results are only from QSPI

Image time from ROM (ms) time from QSPI (ms)
qrcode_600x442.png 3092

Features of work on microcontrollers

There are several interesting things we found when working with OpenCV on microcontrollers. First, the code from the internal memory works faster than from the external QSPI flash, even with the cache enabled.

The second, in our opinion also related to the cache, is the dependence of performance on the placement of the code. We found that minor code changes, such as adding a command that is not called in the main algorithm, can increase or decrease performance by 5 percent or more.

Third, a fairly limited amount of internal memory (2 MB). We were unable to quickly run an example with QR codes recognition from internal memory.

Another important feature relates to the ARM Cortex-m cores. We used kernels with support for SIMD instructions. This technology helps to increase performance by processing multiple arithmetic instructions in a single register. To assess that this helps for our tasks, we carried out measurements on Linux with and without SIMD instructions support and found that in some examples the use of SIMD gives a performance increase of 80%

However, for our processor, gcc cannot automatically generate code using SIMD instructions. There is support only in the form of Intrinsic functions. In other words, you need to insert these commands manually. OpenCV supports this approach. You can implement SIMD support for a custom architecture. But at the moment OpenCV is designed only to work with types of long data types (128 bits and more). Therefore, within the framework of this work, the improvement in performance when using SIMD on STM32 was not evaluated. We hope this will be a direction for future research.

Analysis of results

These results indicate that such complex software as OpenCV can be used on microcontrollers. A number of examples were launched and all worked successfully. However, the performance is noticeably lower than that of the host platforms.

The use of OpenCV on microcontrollers is highly dependent on the tasks that need to be solved. Most of the basic algorithms work imperceptibly to the eye. The same border detection algorithm worked out in a split second; this performance may be quite enough for an autonomous robot. Complex algorithms such as QR code processing can be used, but it is necessary to evaluate the pros and cons of the solution. On the one hand, 3 seconds is a lot for recognition, but on the other hand, for some purposes it may be enough.

Therefore, I will assume that for recognizing complex objects, for example, identifying a person, such platforms are not yet powerful enough. The delay is very noticeable compared to the recognition of the same image on the host. But one should also take into account the fact that it was compared with 64-bit intel-i7 with 8 cores and a fundamentally different frequency, and therefore the consumption of this platform is completely different. And besides, the comparison involved not the most powerful microcontroller. Even the STM32 has an H7 series which is twice as powerful.

The results can be seen in the video

Reproduction of results

You can reproduce the results obtained in the article. This will require two repositories. Main Embox repository and repository with sample images and ready-made configurations for the STM32F769i-discovery board… By following the instructions in the README file from the repository and examples, you can reproduce the results.

You can also use other boards, for this you need to assemble the required configuration. In addition, you can experiment with other images or place images on the SD card, which also only requires changing the Embox configuration.

PS This article was first published on English on emedded.com

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *