How we won the CLEF 2024 competition on medical image generation
Statement of the problem
Diagnostic models based on the analysis of photographs or text descriptions of symptoms are now being actively developed all over the world. However, their training requires large amounts of data. Collecting such data is often difficult because it requires interaction with multiple healthcare institutions and obtaining permission to process confidential patient information.
Extensive datasets for popular imaging modalities such as X-ray and MRI are publicly available. But there are diseases for which there is insufficient data. In addition, this data requires marking by qualified specialists.
One solution to this problem is precisely the development and use of generative neural networks to create training data. To motivate researchers to work in this direction, the ImageCLEFmed MEDVQA‑GI competition was organized as part of the CLEF 2024 conference.
CLEF (Conference and Labs of the Evaluation Forum) is a European conference on artificial intelligence that has existed for 25 years. Its main goal is to stimulate research in various relevant areas of AI. To this end, several competitions are organized every year before the conference, in which researchers solve various problems related to the development and assessment of the quality of models. Participants provide their solutions and articles describing the process of obtaining them, which are then presented at the conference, and participation itself is open and free for everyone. MEDVQA‑GI is one such competition.
The challenge we and other teams were solving was to develop a model for generating artificial images that simulate the results of endoscopic examinations of the stomach and intestines, such as gastroscopy and colonoscopy. It was assumed that the final model would be able to generate images based on text queries containing information about a specific part of the human body, the instrument used during the procedure and the type of disease that should be represented in the picture.
Participants were provided with a unique set of colonoscopy images that had not previously been used to solve similar problems. The main quality metric was the FID metric. Otherwise, participants were given complete freedom in choosing tools and approaches to solve the problem.
Data
To develop the model, we were provided with two sets of data: training and testing. The training set contained 20,241 text-image pairs, where the texts were examples of queries that our model had to learn to process. The test dataset contained only texts that needed to be used to generate a set of images for the final model.
The big problem with this dataset was the lack of diversity in the data: there were only 2,000 pictures, and out of 20,241 texts, only 483 were unique. That is, each picture corresponded on average to about 10 texts. Of the 5,000 texts in the test set, only 260 were unique, and all of them were also found in the training dataset.
Obviously, this is a fairly small and not very diverse dataset, the suitability of which for training a high-quality model seemed questionable. I’ll write a little further on how we tried to deal with this.
Our solution
We began our research with the assumption that pre-trained large generative models would be best able to produce the desired images. These models can produce great images by training on large and diverse data sets, which helps them capture complex structures.
But in their original form, these models cannot create specialized medical images, since they are focused on general tasks and are not sufficiently adapted to narrow areas. Here are examples of images that were generated by some popular models based on queries from our dataset:
Therefore, we decided to fine tune some large model using LoRa, which is now the most popular method for this.
The guys from the Kandinsky team suggested using their model, since it shows high quality generation and is among the top best in the world, while its weight There is publicly available on HuggingFace, which greatly simplifies the process of working with the model.
Kandinsky 2.2
We tried the Kandinsky 2.2 model first because it provides good quality while requiring less training than the newer Kandinsky 3.0 model.
The Kandinsky 2.2 architecture consists of the following parts:
Image Prior model is a DiffusionMapping model that allows you to generate visual CLIP embedding from text promt or text CLIP embedding, while remaining in the latent visual space paradigm. The CLIP model is CLIP‑ViT‑G.
Image Decoder is a U‑Net diffusion model for direct image generation.
Sber‑MoVQGAN is a version of VQGAN modified by the developers, which has demonstrated good quality in experiments.
The Image Prior model is used to synthesize visual embedding based on a given text, which is then used in the Image Decoder training process. Thus, the reverse diffusion mechanism learns to restore the latent representation of the image not only from text, but also from visual embedding, which has a positive effect on the final quality.
You can read more about Kandinsky 2.2 in the developers' article Kandinsky 2.2 – a new step towards photorealism.
The main difficulty in additional training of the Kandinsky 2.2 model was the need for parallel fine-tuning of both its parts to achieve the best results and correct calculation of metrics. The figure below shows an example of generations obtained by additional training of only one of the two parts of the model.
The FID metric values calculated on such images do not allow us to objectively assess the quality of the model’s performance in the context of the problem being solved. In this regard, the additional training process was organized in three stages.
In the first stage, the Decoder model was trained with a LoRa rank of 32 for 40 epochs, while no metrics were calculated. At the second stage, using the already trained Decoder model, experiments were carried out with the Prior model. The main parameter selected during the experiments was the LoRa rank. We sorted through it, starting from 4, then from 8 to 64 in steps of 8. Additionally, rank 128 was considered. The graph below shows the results of these experiments. When assessing the quality of models, we relied mainly on the FID metric.
In general, there is a constant improvement in metrics with increasing rank. However, the change in FID metric from rank 64 to rank 128 was quite small. This, given the large difference between these values, was perceived by us as a signal that a further increase in rank will not give significant results.
The next step was to further train the Decoder model using the Prior model, with the addition of LoRa with a rank of 128. We selected the LoRa rank for the Decoder model in a similar way. The results are presented in the graph:
The influence of the rank value on the final image quality in this case turned out to be less pronounced. Based on the data obtained, we can conclude that it is the Prior‑model that has the greatest impact on the quality of generated images.
Kandinsky 3.0
After a fairly successful start with Kandinsky 2.2, a logical continuation of the experiments was the additional training of a newer version of this model – Kandinsky 3.0, which, according to the developers, will allow one to achieve better results when fine-tuning using LoRa.
In Kandinsky 3.0, the developers abandoned the two-stage generation that was used in Kandinsky 2.2. Instead, they decided to use a more classical approach of passing directly encoded texts into the model. This has been facilitated by the emergence of new large language models that understand text much better than the CLIP text encoder. As a result, the new architecture consists of three parts:
FLAN‑UL2 is a large language model based on the T5 architecture. Kandinsky 3.0 uses only the Encoder of this model.
U‑Net with a modified architecture consisting mainly of BigGAN‑deep blocks, which allows the depth of the architecture to be doubled while keeping the total number of parameters unchanged.
Sber‑MoVQGAN is the same decoder as in version 2.2.
Such changes in the architecture have made it possible to significantly simplify the process of training and additional training, since now it is only necessary to train the U‑Net directly, and all other models are used in frozen form. Details can be read in the article Kandinsky 3.0 – a new model for generating images from text.
As with Kandinsky 2.2, we first experimented with different rank values for LoRa during training. The figure below shows examples of the resulting images. As you can see, they clearly lack photorealism. The images have an unusual texture, reminiscent of drawings on paper or created in a 3D editor. This had a negative impact on the FID metric values, which varied in the range of 90–110.
We conducted additional experiments, tried to change various network and training parameters, but it was not possible to improve the pictures. The guys from the development team also tried to help us, but to no avail. So we decided to move on.
MSDM
During our research, we found an article Diffusion Probabilistic Models beat GANs on Medical Images. In it, the authors also generate medical images and for this they train a small diffusion model from scratch, which they called Medfusion. Their model showed good quality, despite the fact that it is significantly smaller than pre-trained models (in total, about 400 million parameters).
We decided to try to repeat their success. However, their model was not adapted for text-to-image generation, but generated images only using class labels. Therefore, we adapted it to our task, proposing our own modification of their architecture, which we called MSDM (Medical Synthesis with Diffusion Models).
In terms of architecture, this model is generally similar to Kandinsky and is also ideologically based on Stable Diffusion. Our modification consisted mainly of adding Cross‑Attention blocks to the layers in order to take into account the context from the prompt.
This model immediately demonstrated a noticeably better FID value compared to Kandinsky (~35), however, it was during its training that the weaknesses of the dataset became most apparent: the resulting images noticeably lacked diversity. Moreover, the model had a strong tendency to simply generate images from the train.
To solve this problem, we tried to augment the data. This was not easy to do, since in this dataset the color and location of objects in the pictures are of great importance. Therefore, it would be incorrect to apply the most popular augmentations, such as rotation or color change. Therefore, we approached the matter from a different angle and tried to diversify the texts as much as possible so that the model would remember specific prompts less. To do this, we used GPT Turbo 3.5 to generate paraphrases for each prompt from the dataset. Thus, we managed to increase the number of unique texts from ~500 to ~11,000.
Next, we tried different approaches to adding them to training. We tried to completely replace the original texts with them, mix them with a certain chance and simply add them to the old ones. The latter approach seemed to us the best in terms of metrics.
In addition to improved texts, we increased the dropout values in the model, and also added context-dropout and self-conditioning. All together, this helped improve the situation, and the images generated by the model immediately became noticeably more diverse.
Results
As a final solution, we sent three sets of images from the following models: a Kandinsky 2.2 + LoRa model with a rank of 128 and two versions of our model – one without adding paraphrased texts, the other with them.
The organizers tested them using several datasets hidden from us and sent the results, which can be seen in the table:
Model Name | Dataset Type | FID (↓) | IS (↑) | IS (med) (↑) |
Kandinsky 2.2 + LoRa, rank 128 | single | 0.086 | 1.624 | 1.633 |
multi-center | 0.064 | 1.624 | 1.633 | |
both | 0.066 | 1.624 | 1.633 | |
MSDM | single | 0.114 | 1.791 | 1.792 |
multi-center | 0.117 | 1.791 | 1.792 | |
both | 0.114 | 1.791 | 1.792 | |
MSDM + paraphrases | single | 0.125 | 1.773 | 1.775 |
multi-center | 0.121 | 1.773 | 1.775 | |
both | 0.119 | 1.773 | 1.775 |
As you can see, the metric values in the tests differ significantly from those we observed during the experiments. It is also interesting that the Kandinsky 2.2 model showed the best results among the three models on the test set, while its results were worse during the experiments.
While we do not yet have a complete understanding of the reasons for these differences in metrics, we will continue to cooperate with the organizers and plan to conduct additional experiments to better understand the results obtained. However, our metric values turned out to be the best among the teams that submitted solutions, which means we were the winners of the competition!
We made the following conclusions for ourselves.
Firstly, both approaches we tested are quite valid and solve our problem well. The model with 460 million parameters (MSDM) did not show much worse metric values compared to the model with 4.6 billion parameters (Kandinsky 2.2). However, even though this model is smaller overall, training it from scratch still takes more time and resources than retraining large models with LoRa, so we still consider retraining a more effective and promising method for solving such problems. Secondly, data augmentation really helps to significantly improve results – you can and should look for different ways to diversify your data. Our approach to paraphrasing texts has proven to be very effective.
In the future, we plan to test our approach on other datasets and models in order to study in more detail the influence of LoRa parameters on the final result.