Benchmark OpenCV on STM32

Today, image processing has become a part of our lives. No one is surprised by the recognition of faces or road markings. The most common library for these purposes at the moment is Opencv… Today OpenCV is focused primarily on large platforms. And although the older models of modern microcontrollers have resources comparable to the Pentium II, running OpenCV on them is still a very rare, even exotic phenomenon.

Some time ago, we showed that there is a fundamental possibility of using OpenCV on STM32 (and other microcontrollers of a similar class). Then our goal was to demonstrate the possibility of using this library on similar hardware platforms. Therefore, although we got very low performance, we did not begin to understand its reasons. At the moment, we have fixed the obvious shortcomings of the first solution, which allowed us to achieve acceptable performance. This article presents the results of performance measurements for various examples of using OpenCV on the STM32F7 platform.

All examples given in the article are based on Embox and can be reproduced independently by following the instructions from repository with examples… We also used the -Os optimization flag for our examples on the board. All examples use enabled cache. The images can be located on the SD card. In the examples, we store the images in the QSPI flash that is on the demo board, for simpler basic instructions when reproducing the results.

Edge detection

Let’s start with the same example that was used in previous work, namely defining boundaries. The example uses the algorithm Canny

We will provide a log of the output when running the edges application on Embox, which will allow us to compare the performance improvement compared to our previous work. For other applications, we will provide only tables with measurement results.

An example of the analyzed image

Output for 512×269 image

root@embox:(null)#edges fruits.png 20
Image: 512x269; Threshold=20
Detection time: 0 s 116 ms
Framebuffer: 800x480 32bpp

Output for 512×480 image

root@embox:(null)#edges fruits.png 20
Image: 512x480; Threshold=20
Detection time: 0 s 254 ms
Framebuffer: 800x480 32bpp

results

Imagetime from ROM (ms)time from QSPI (ms)
fruits.png 512×269116120
fruits.png 512×480254260

K-means

This example from the OpenCV composition, as a result of its work, must determine the clusters of points and circle each of them with a circle of the corresponding color.

To estimate the density of their distribution in OpenCV, the concept of “compactness” is used:

compactness: It is the sum of squared distance from each point to their corresponding centers.

In other words, compactness is an indicator of how close the points are concentrated from the center of the cluster.

As input, kmeans.cpp generates a 480 x 480 image with several clusters of dots of different colors. The center of each such cluster is chosen at random, and points are added to the cluster in accordance with the normal distribution.

Compactnesstime from ROM (ms)time from QSPI (ms)
7335893498
1604066eighteen
331447fourteen38
7062801336
399182eight25

Squares

Recognition of geometric shapes, in particular rectangles, is also a standard example in the OpenCV library

An example of the analyzed image

Results for 400×300 images:

Imagetime from ROM (ms)time from QSPI (ms)
pic1.png13121668
pic2.png48937268
pic3.png12631571
pic4.png23513590
pic5.png12351515
pic6.png15752202

Facedetect

Facial recognition was the original goal of our research. We wanted to evaluate how well similar algorithms work on similar boards. Using the standard facedetect example with a set of five images The examples use Haar-cascade Detection (https://docs.opencv.org/4.5.2/db/d28/tutorial_cascade_classifier.html)

An example of the analyzed image

For 256×256 images:

Imagetime from ROM (ms)time from QSPI (ms)
seq_256x256 / img_000.png33893801
seq_256x256 / img_001.png40154454
seq_256x256 / img_002.png40164464
seq_256x256 / img_003.png33153717
seq_256x256 / img_004.png35263952

For 480×480 images:

Imagetime from ROM (ms)time from QSPI (ms)
seq_256x256 / img_000.png1440616149
seq_480x480 / img_001.png1478416578
seq_480x480 / img_002.png1510616904
seq_480x480 / img_003.png1269514352
seq_480x480 / img_004.png1465516446

Peopledetect

Increasing the complexity, we decided to try how the definition of people in the image works. You can use the peopledetect example for this.

Sample image

results

Imagetime from ROM (ms)time from QSPI (ms)
basketball2.png 640×4804034752587

QR code

QR codes are a widely used example of pattern recognition.

Sample image taken at random from the Internet

results
This example did not fit into the internal memory, so the results are only from QSPI

Imagetime from ROM (ms)time from QSPI (ms)
qrcode_600x442.png3092

Features of work on microcontrollers

There are several interesting things we found when working with OpenCV on microcontrollers. First, the code from the internal memory works faster than from the external QSPI flash, even with the cache enabled.

The second, in our opinion also related to the cache, is the dependence of performance on the placement of the code. We found that minor code changes, such as adding a command that is not called in the main algorithm, can increase or decrease performance by 5 percent or more.

Third, a fairly limited amount of internal memory (2 MB). We were unable to quickly run an example with QR codes recognition from internal memory.

Another important feature relates to the ARM Cortex-m cores. We used kernels with support for SIMD instructions. This technology helps to increase performance by processing multiple arithmetic instructions in a single register. To assess that this helps for our tasks, we carried out measurements on Linux with and without SIMD instructions support and found that in some examples the use of SIMD gives a performance increase of 80%

However, for our processor, gcc cannot automatically generate code using SIMD instructions. There is support only in the form of Intrinsic functions. In other words, you need to insert these commands manually. OpenCV supports this approach. You can implement SIMD support for a custom architecture. But at the moment OpenCV is designed only to work with types of long data types (128 bits and more). Therefore, within the framework of this work, the improvement in performance when using SIMD on STM32 was not evaluated. We hope this will be a direction for future research.

Analysis of results

These results indicate that such complex software as OpenCV can be used on microcontrollers. A number of examples were launched and all worked successfully. However, the performance is noticeably lower than that of the host platforms.

The use of OpenCV on microcontrollers is highly dependent on the tasks that need to be solved. Most of the basic algorithms work imperceptibly to the eye. The same border detection algorithm worked out in a split second; this performance may be quite enough for an autonomous robot. Complex algorithms such as QR code processing can be used, but it is necessary to evaluate the pros and cons of the solution. On the one hand, 3 seconds is a lot for recognition, but on the other hand, for some purposes it may be enough.

Therefore, I will assume that for recognizing complex objects, for example, identifying a person, such platforms are not yet powerful enough. The delay is very noticeable compared to the recognition of the same image on the host. But one should also take into account the fact that it was compared with 64-bit intel-i7 with 8 cores and a fundamentally different frequency, and therefore the consumption of this platform is completely different. And besides, the comparison involved not the most powerful microcontroller. Even the STM32 has an H7 series which is twice as powerful.

The results can be seen in the video

Reproduction of results

You can reproduce the results obtained in the article. This will require two repositories. Main Embox repository and repository with sample images and ready-made configurations for the STM32F769i-discovery board… By following the instructions in the README file from the repository and examples, you can reproduce the results.

You can also use other boards, for this you need to assemble the required configuration. In addition, you can experiment with other images or place images on the SD card, which also only requires changing the Embox configuration.

PS This article was first published on English on emedded.com

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *