Monitoring AI systems. Part 2

In the life of an AI system, medical or any other, there are unfortunate moments.

Some of these situations are unexpected errors. Yes, all developers understand that sooner or later something will go wrong, but it always happens in different ways and sometimes at the most inopportune moments.

For example, an incorrectly filled body part tag in a DICOM file and incorrect operation of the model to filter images can lead to pneumothorax in the foot:

Light tread

Light tread

If you want to learn even more about organizing ML development processes, subscribe to our Telegram channel Cooking ML

Some of the mistakes are quite expected. It is not for nothing that the registration certificate of any AI medical product indicates some characteristics (accuracy, sensitivity, specificity), and they are not equal to one. Moreover, errors are usually well clustered into categories. For example, one of the first versions of ischemia systems often gave false positives for cystic glial changes.

False alarm

False alarm

And sometimes problems arise not because of the AI ​​system itself, but because of changes to input data properties. A common case is connecting new equipment to the same model.

On the left is a typical input of the FLG model, on the right is a beginner

On the left is a typical input of the FLG model, on the right is a beginner

Monitoring the operation of the ML system on the manufacturer’s side can help in such situations. Properly organized monitoring allows you to quickly respond to incidents, receive information about unexpected business events (for example, they forgot to inform us about the connection of a new hospital), generate new hypotheses for improvement and select data for additional marking.

Let’s figure out where exactly problems may arise, what type of monitoring can help in each case, and what tools can be used.

Technical monitoring

First of all, any ML system is software. Accordingly, it is characterized by all the problems that may arise in the operation of the software:

  • Errors when executing code are also bugs.

  • Problems with hardware – the video card burned out, the cleaning lady pulled out the wire.

  • The load on the system has increased, the average or peak number of requests has increased – accordingly, the speed of the system (latency and/or throughput) has decreased.

That is, we need tools that will allow us to receive alerts about bugs, track processing time, utilization of various hardware resources, and obtain statistics on the number of different errors when processing requests by their type. Classic mistakes mean you can use classic tools.

Tools in a vacuum can only make things worse – we need a clear process for who uses them and how. For example:

  • Automatic alerts – should appear only if there is some probability that a reaction is needed.

  • Regular monitoring – with some frequency, a certain person or teams carry out monitoring.

  • Situational monitoring – in this case, the tools are used to analyze and find the causes of problems reported by external or internal service clients.

Examples of technical monitoring tools

Examples of technical monitoring tools

Sentry has been my favorite tool since 2014, especially when paired with the messenger integration. Each team has its own channel for alerts in Mattermost, and right there you can discuss the error and what actions need to be taken. I would especially like to note the Performance section, which allows you to analyze the operating time of different system components.

Sentry

Sentry

Another classic is Grafana, which can eat information from a variety of sources – for example, from Prometheus and ElasticSearch. Suitable for both hardware monitoring and collecting statistics on predictions.

Grafana

Grafana

Monitoring emissions and data quality

The next delicate place and time is the equipment and its operation at the time of the study. We can use different terms here – outliers, OoD samples, but the essence does not change – sometimes our system receives data that we do not expect to see, and on which the system was not trained or hardly trained. Such data may appear for a variety of reasons:

  • Equipment operation – defects, artifacts, incorrect settings.

  • Laboratory assistant’s work – incorrect positioning of the patient, incorrectly filled in tags.

  • The patient himself – failure to follow the instructions of the laboratory assistant, anatomical features, the presence of foreign bodies.

There are two main ways to detect such data samples:

  • Uncertainty estimation – assessment of the model’s confidence in its prediction. There are a lot of methods – from the simplest (calculating variance using predicted class probabilities) to complex ones that require changing the model architecture or several runs of the study through the grid.

  • Logging and anomaly detection – we log various values ​​that appear during the operation of our system, for example, DICOM tags, intermediate values ​​(volumes of segmented target organs, image brightness) and system predictions. And on top we hang some kind of anomaly detection – and at least alerts on value thresholds.

Here, in addition to the development teams, the role of the medical consultant attached to the team is very important. In many cases, only a doctor can help you understand what is actually shown in the picture.

From the point of view of tools, the same Sentry does an excellent job here – if one of the rules worked when processing the research, we send an alert. Depending on the situation, processing either continues or an error is immediately returned to the client.

By clicking on such an alert, you can go to Sentry and find out the research ID, which in turn can be entered on our platform, made on Streamlit. He will find this research among all databases and provide a link to the web viewer.

Examples of patient-related and equipment-related emissions

Examples of patient-related and equipment-related emissions

Sometimes the research is not defective or rare, but the model did not encounter such pictures during training.

CT on the side led to errors in some pathologies

CT on the side led to errors in some pathologies

Our platform, which I mentioned above, also allows you to filter and sort studies by different conditions – for example, analyze studies with a certain pathology or with the greatest model uncertainty.

Drift monitoring

The distribution of data at the level of a medical institution or an entire region, of course, is not static. The hospital may install new equipment, a new institution or an entire region may be connected to our system, the clinic may suddenly start doing X-rays of children. Ideally, the ML team should know about these business events in advance, but this, alas, does not always happen, so it is imperative to have technical methods for detecting changes in your arsenal.

During intensive monitoring after connecting new clients, all sorts of non-standard cases are often identified that need to be analyzed separately.

On the left - the study shows frames, which in some cases also extend onto the breast area.  On the right is an x-ray of a child, which our system is not designed to work on.

On the left – the study shows frames, which in some cases also extend onto the breast area. On the right is an x-ray of a child, which our system is not designed to work on.

The simplest automatic tool is alerts about important business events, for example, connecting a new client to the application.

Deeper monitoring includes, for example, tracking the emergence of new equipment or changes in the share of a particular equipment in the research flow.

In October, a new manufacturer of X-ray machines appears + the share of another grows significantly

In October, a new manufacturer of X-ray machines appears + the share of another grows significantly

Monitoring the quality of ML model performance

So to speak, last but not least – the model (models) itself or some components of the system (preprocessing, postprocessing) may not work correctly and this leads to errors – false positives and false negatives. Of course, we receive feedback from clients – from some more often, from others very rarely, but it is very desirable to have an idea about the online quality of the system. This is especially critical in the periods after new releases.

Naturally, at this stage, the central place is occupied by a consultant doctor, who can review the results of work on a random sample or using a certain filter.

The tools used here are, plus or minus, the same, except that in addition it still turns into some kind of beautiful report.

Working with incidents

It would be useful to recall the main stages of working with incidents. Unfortunately, medical AI systems here are still very limited – in order to release a full-fledged update of the system, you need to go through the procedure of updating the registration certificate. There are currently discussions about making this procedure easier for certified developers. That’s very reasonable offer from the FDA.

Current issues

The main problems with monitoring now are related to deployment in the regions.

Firstly, most regions insist on local deployment in a closed loop. Such machines most often do not have access to the Internet, and it is often not so easy for us to access it ourselves. This greatly limits the possibility of timely monitoring and essentially turns the system into a black box even for the developers themselves.

Secondly, in the regions the percentage of complex studies is much higher – with incorrect installation, equipment defects, and incorrectly filled in tags. This greatly increases the burden on developers and leads to incorrect operation of AI systems.

At the same time, I would like to express my respect – in some regions they are very active in meeting, hearing feedback, and correcting mistakes. Well, in any case, everything is much better here than in India =)

If you want to learn even more about organizing ML development processes, subscribe to our Telegram channel Cooking ML

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *