Highly loaded IPC between C++ and Python


background

A few years ago, Auriga was commissioned by a well-known medical startup to develop a solution for processing multiple video streams in parallel. The data were critical to the success of minimally invasive surgery and were the only source of information for the surgeon. The result of processing each frame was a strip with a width of one pixel. The following data transmission characteristics were required: synchronous processing of parallel data streams with a total frequency of 30 frames per second, 400 ms per frame from the device driver to the doctor’s display.

There were the following restrictions:

  • The stream was created by a highly specialized hardware device that had a driver only under Windows

  • Frame processing algorithms were implemented by the customer’s team in Python version 3.7

Solution concept

During operation, the analog device generated 16 data streams. The device driver wrote them in raw form to the ring buffer in non-blocking mode.

Device initialization and data preprocessing are handled by a program implemented in C++. She was responsible for starting/stopping the device, working with the driver, reordering and rejecting frames according to information from the service channels of the stream and application configuration, as well as further transfer of the data stream to the next component.

All computationally complex mathematics (Hilbert transform, Fourier transforms, adaptive filtering and other transformations) were carried out in Python modules using the numpy and scipy libraries. Due to the presence of the GIL, it was not necessary to rely on true multithreading. Therefore, we had to run separate instances of the Python handler as a separate OS process for each data stream.

The resulting heavily thinned data was transferred to the UI on PyQt.

First iteration of the solution

Mathematics at that time calculated the frame in about 150 ms, in the remaining time (250 ms) it was required to put all the other stages from the driver to the image on the screen. Performance tests of several different MQ queues (ActiveMQ, Mosquitto, RabbitMQ, ZeroMQ) showed insufficient stability and data transfer speed – independent threads were desynchronized, and according to design requirements, it was necessary to completely discard the entire cut of the remaining frames from other threads, which led to noticeable failures in visualization and could not provide a sufficient level of quality for a medical device.

According to the initial assessment of the transmitted data, it was required to ensure the reception and transmission of a total stream of 25 Mbps and a frame size of 100 kilobytes.

Such a stream could easily be handled by a connection via TCP sockets, which is easy to implement on both sides of the stream – both in C ++ and in Python. Test runs showed acceptable performance and stability. However, ongoing research in the field of the optical part of the device required the transmission and processing of more data.

It was necessary to significantly increase the detail of the read data, increasing the size of one frame from 100 kilobytes to 1.1 megabytes, which increased the data flow to 26 Gbit / s, and the processing time of one frame – up to 250-300 ms. While at the customer’s stand, the TCP socket showed a maximum speed of 1.7 Gb / s on synthetic tests.

An order of magnitude mismatch between available and required data rates resulted in instantaneous TCP buffer overflows and subsequent cascading packet loss. A new solution had to be found.

Second iteration

Named or anonymous pipes were the next candidate for a transfer medium. A synthetic test on the stand showed a speed of about 21.6 Gb / s, which came close to the requirements. However, already at the start of implementation, technical difficulties arose.

Named Pipe Problems for Large Data Streams

The rate of receiving data exceeded the rate of transmission, which led to the uncontrolled growth of anonymous channel buffers, which were limited only by available memory. On daily stress tests, RAM ran out first, the system began to actively use swap until the swap ended, which led to the stand freezing.

In the process of operation, for the same reason, the time of the data packet passing through the channel was unpredictable – from 50 to 150 ms, which again led to channel desynchronization and data loss.

The processes that considered mathematics heavily loaded the Intel Xeon Gold processor 40 cores. The operating system distributed them among different cores in such a way that each frame handler got a separate core, and the few remaining cores were given to the needs of the operating system.

Against the background of all of the above, another problem that had not been encountered before appeared: 10-15 minutes after launch on all busy cores, the load skyrocketed from ~ 80% to 100%, and the processing time for one frame doubled from an acceptable 300 ms up to 600-700 ms.

Intel vTune and Windows Performance Toolkit with WPR (Windows Performance Recorder) and WPA (Windows Performance Analyzer) were used to investigate this issue.

Analysis of the captured event trace log showed a sharp increase in the execution time of the KeZeroPages system calls, which helped to understand what was happening. Thanks a lot for the article Hidden Costs of Memory Allocation

When freeing a region of memory previously allocated to a process,
the operating system pads it with zeros for security reasons.

This is done in a low-priority system process, which, according to observations, is running on the very last core and quietly does its job – but only as long as it has time to zero out all the freed memory.

As soon as it ceases to cope with this, the task of zeroing the memory begins to be performed in the context of the process that used it.

In total we had:

  • large frame size

  • stream buffering in memory

  • many NumPy calls with the creation of temporary variables containing the data packet being processed

  • heavy load on the core

The combination of these factors led to the fact that the system process of page zeroing could not cope with the work in the background and transferred it to the native kernels of the processes, while increasing the load even more due to frequent context switches on the kernel.

Refactoring work with NumPy significantly reduced the amount of “dirty” memory produced, but did not completely solve the problem. Fast and economical data transfer was required at the junction of C ++ and Python, where the amount of data transferred was maximum. The alternative in the form of porting mathematics from Python to C ++ did not fit into either the time constraints or the project budget.

Data transfer requirements

Implementing your own IPC protocol is a task that is not often encountered, but is well known. A little exotic was added only by different programming languages. Based on the results of the research, the following requirements were arrived at:

  • Using a ring buffer in shared memory

  • Fixed message size set at application startup.

  • Exclusive Publisher / Exclusive Subscriber

  • Non-blocking write, discarding messages on buffer overflow

  • Blocking read

  • No security requirements: work in an isolated system.

Custom IPC

The native IPC implementation used the PyWin32 package to work with semaphores through win32api, which allowed access to the same semaphore from independent applications. To coordinate access to one buffer located in shared memory, two semaphores were used: one for writing and one for reading.

Test runs on the bench showed that a 5-element buffer is enough to smooth out the jitter of transmitted data. In rare cases where there was no free space in the buffer, the data being sent was discarded without blocking the write queue.

The primitiveness of both the protocol and the transmitted data made it possible to achieve an average transmission rate of one frame per 5 ms. The synthetic test showed us a maximum throughput of about 84.7 Gb / s, with a large margin covering the requirements for the amount of transmitted data and delivery time.

conclusions

Having gone from using typical recommended data transfer protocols for interprocess communication, the solution has evolved to develop its own protocol. Prototyping and writing our own synthetic tests made it possible to avoid multiple iterations in development and unpleasant surprises during tests as part of the target device. It is also worth noting that the features of each OS have a critical impact on the final performance on highly loaded tasks. The subsequent porting of the solution to a Linux-like OS went without surprises. Also, attempts to change the processor configuration in the BIOS and manually control the assignment of cores to processes did not give a noticeable result. As a result, we can say that specific tasks require specific solutions.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *