Testing Pixtral12B and LLaMA 3.2 11B on popular Tesla P100 and P40
It is worth explaining that inference is not only the process of applying a trained model to new data, but also the software that manages this process. Such programs are often called “inference engines”. They are responsible for how data enters the neural network and how results are output from it. The effectiveness of the inference engine directly affects the performance of the neural network as a whole.
Moreover, many inference solutions act as package managers for various neural networks. They often include functionality for quantizing models and other optimization tools. In our research, we found that a popular tool llama.cpp does not support working with images, which is critical for multimodal models. IN Ollama There were no models we needed.
Attempts to use Mistral.rs on our EndeavorOS system (based on Arch Linux) and vLLM via Docker were also unsuccessful. As a result, we developed our own scripts for inference, relying on examples from the HuggingFace community, since without inference the neural network itself is just a large file of many gigabytes, which by itself will not do anything. It is also worth noting that during the testing process we discovered one interesting graphical interface, also distributed according to the Open Source model – ComfyUI. Previously, we thought it was only suitable for models generating images like StableDiffusion, but as it turned out, it is also suitable for language models.
Separately, we note that we focused on testing quantized versions of neural networks – 4-bit and 8-bit. This decision is due to the fact that the memory capacity of most regular video cards ranges from 12-16 gigabytes, while the non-quantized versions of Pixtral and LLaMA require about 24 gigabytes of video memory. And within the framework of our test, we would then be limited only to the P40 with its 24 gigabytes of video memory, and the P100 in this case would be out of comparison, since it only has 16 gigabytes of video memory, where a full-size neural network would not fit.
There are no tests without a test bench
Let's not forget to tell you about our bench for testing neural networks. Instead of the usual server, we used an open stand for greater flexibility when changing configurations. Here are the characteristics of our test system:
Motherboard: Supermicro H11SSL-I (Rev 2.0)
Processor: AMD EPYC™ 7502 (32 cores / 64 threads, 2.5GHz-3.35GHz, 180W, 128MB L3)
Cooling system: 4U tower type with TDP 240W
RAM: 128GB (8 x 16GB, 3200 MHz, ECC REG)
System storage: Samsung PM9A1 1TB
Video card 1: NVIDIA Tesla P40 (24GB GDDR5)
Video card 2: NVIDIA Tesla P100 (16GB HBM2)
Power supply: Cougar BXM 1000 [CGR BX-1000]
About Python and pip
At ServerFlow we focus on best practices for working with Python and its package manager pip. But unfortunately, many online guides for Linux and Windows recommend installing packages in a system-wide environment, which can lead to serious problems.
Imagine that your operating system is an apartment building. Python in this case is one of its important residents. Installing packages via pip into a shared system is the same as allowing this tenant to uncontrollably change the general communications of the house. Sooner or later this will lead to conflicts and breakdowns. This problem is especially acute on Linux, where Python often performs critical system functions. Changes to system-wide Python packages can break the operating system itself, much like damaging the foundation of our imaginary house. This is why we try to avoid this approach.
Arch Linux, the distribution we use, approaches this issue with extreme caution, although it has the reputation of not “preventing the user from putting a spoke in his own wheel.” It is worth noting that by default Arch Linux does not have pip at all, and after installing it via `sudo pacman-S python-pip`, Arch will not allow it to be used outside the venv virtual environment. It’s as if the management company of our building prohibited residents from changing common communications on their own, giving them instead the opportunity to arrange their apartments at their own discretion. This Arch Linux approach may seem unusual given the fact that almost all guides and tutorials on the Internet ignore the use of venv, but it significantly increases the stability and security of the system in the long run.
Pixtral 12B
Let's start by installing pip on our server. First, let’s connect to it via SSH and forward port 8188 from it to our local machine in advance to make it easier to use WebUI for our neurons.
ssh -L 8188:127.0.0.1:8188 -p 47645 serverflow@IP_SSH_сервера
Let's install pip using the Pacman package manager.
sudo pacman -S python-pip
And we will create and immediately activate venv.
python3 -m venv pixtral
source pixtral/bin/activate
And install all the necessary dependencies.
pip install --upgrade pip
pip install torch transformers bitsandbytes accelerate gradio huggingface_hub numpy pillow requests
And let’s log in to huggingface-cli, for this you must first have already created an account on HuggingFace and created a token there with all permissions, which you then enter into the console on the ssh server. You can do without HuggingFace, making everything more local, however, for our purposes this will be an unnecessary complication of the process.
huggingface-cli login
As we said earlier, due to the lack or impossibility of launching existing ready-made solutions for inference, we decided to write our own script in Python.
main_log.py
import gradio as gr
from transformers import LlavaForConditionalGeneration, AutoProcessor, BitsAndBytesConfig, logging as transformers_logging
import torch
from PIL import Image
import requests
from io import BytesIO
import logging
import time
# Enable transformers logging at INFO level to see tokens per second and other performance info
transformers_logging.set_verbosity_info()
# Optionally, configure your own logger if you want additional control
logging.basicConfig(level=logging.INFO)
# Define the quantization config
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4"
)
# Model and processor ID
model_id = "Ertugrul/Pixtral-12B-Captioner-Relaxed"
# Load the model with 4-bit quantization
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
quantization_config=quantization_config
)
# Load the processor
processor = AutoProcessor.from_pretrained(model_id)
# Define image resizing function
def resize_image(image, target_size=768):
"""Resize the image to have the target size on the shortest side."""
width, height = image.size
if width < height:
new_width = target_size
new_height = int(height * (new_width / width))
else:
new_height = target_size
new_width = int(width * (new_height / height))
return image.resize((new_width, new_height), Image.LANCZOS)
# Define the Gradio inference function
def process_input(text_prompt, image_url):
# Fetch the image from the URL
try:
response = requests.get(image_url)
response.raise_for_status() # Ensure the request was successful
image = Image.open(BytesIO(response.content))
except requests.exceptions.RequestException as e:
return f"Не удалось загрузить изображение по URL: {e}", ""
image = resize_image(image, 768) # Resize for optimal processing
# Prepare conversation with the user prompt and image
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": f"{text_prompt}\n"},
{"type": "image"}
],
}
]
PROMPT = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=PROMPT, images=image, return_tensors="pt").to("cuda")
# Start time for inference
start_time = time.time()
# Generate response using the model with specified parameters
with torch.no_grad():
with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
generate_ids = model.generate(
**inputs,
max_new_tokens=256, # Equivalent to n_predict
temperature=0.7, # As specified
top_k=40, # As specified
top_p=0.5, # Nucleus sampling
repetition_penalty=1.176470, # Repetition penalty
no_repeat_ngram_size=256 # Approximate repeat_last_n with repetition window
)
# End time for inference
end_time = time.time()
inference_time = end_time - start_time # Calculate inference time
# Decode the output
output_text = processor.batch_decode(
generate_ids[:, inputs.input_ids.shape[1]:],
skip_special_tokens=True,
clean_up_tokenization_spaces=True
)[0]
# Add inference time to output
return f"Вывод модели:\n{output_text}\n\nВремя инференса: {inference_time:.2f} секунд", f'<img src="https://habr.com/ru/companies/serverflow/articles/851712/{image_url}" alt="Входное изображение" width="300">'
# Gradio interface setup
with gr.Blocks() as demo:
title = gr.Markdown("## Описание изображения с помощью модели Pixtral")
with gr.Row():
text_input = gr.Textbox(label="Текстовый запрос", placeholder="Например: Опишите изображение.")
image_input = gr.Textbox(label="URL изображения", value="https://huggingface.co/spaces/aixsatoshi/Pixtral-12B/resolve/main/llamagiant.jpg")
# Initial image display
result_output = gr.Textbox(label="Вывод модели", lines=8, max_lines=20)
image_output = gr.HTML('<img src="https://huggingface.co/spaces/aixsatoshi/Pixtral-12B/resolve/main/llamagiant.jpg" alt="Входное изображение" width="300">')
submit_button = gr.Button("Запустить инференс")
submit_button.click(process_input, inputs=[text_input, image_input], outputs=[result_output, image_output])
demo.launch()
Library transformers loads pretrained model weights into video memory, using 4-bit quantization to reduce memory footprint. The LlavaForConditionalGeneration.from_pretrained() function initializes the model and AutoProcessor prepares the input data.
With inference, the input image and text are converted into tensors using processor() and passed to the model. The model.generate() function starts an autoregressive generation process: the model iteratively predicts the next token*, using an attention mechanism to take into account the context of the image and previous tokens. PyTorch performs efficient matrix operations on the GPU by applying model weights to the input data.
Tokens are parts into which the model breaks the input and output text. These are not necessarily whole words; tokens can be parts of words, individual characters, or even punctuation. For example, the word “neural network” can be broken down into the tokens “ney”, “rose” and “t”. The model works specifically with sequences of these tokens, and not with whole words.
As for Gradio, this library creates a simple web interface that allows you to enter text and image URLs through the browser. Gradio automatically handles HTTP requests, calls the process_input() function when a button is clicked, and displays the results.
Running Pixtral 12B 4bit on P100 –
CUDA_VISIBLE_DEVICES=0 python main_log.py
In the console you should see a link to WebUI, just click on it –
* Running on local URL: http://127.0.0.1:7860
Hurray, WebUI has launched! And this means that we continue.
To monitor the load on the GPU we used the utility nvtopwhich is easy to install via the Pacman package manager on Arch Linux – `sudo pacman -S nvtop` .
Nvtop – this is an analogue of the task manager in Windows, but specialized for video cards; for those who are more familiar with Linux, in its essence it is similar to htop, which is used to monitor processors. It allows you to monitor GPU resource usage in real time.
By using nvtop
We observed that the Pixtral 12B model, quantized to 4 bits, consumes about 9 gigabytes of video memory in standby mode. With active inference, memory usage increases to ~10.7 gigabytes. As we said earlier, for a non-quantized model we would need about 24 gigabytes of memory.
The inference of our request for 256 tokens took us 41.36 seconds; with text recognition in Russian, the quantized model is of course lame. But the model did a good job of recognizing what was happening in the image, describing it in sufficient detail.
Running Pixtral12B 4bit on P40
Now let's try it on P40, for this we simply change the CUDA parameter from 0 to 1 –
CUDA_VISIBLE_DEVICES=1 python main_log.py
The model recognizes the English language, obviously better, but what’s surprising is the inference time, it changed very little, after several runs, stable ~40.35 seconds for 256 tokens.
Pixtral 12B 8bit
We next decided to test the model with 8-bit quantization, expecting to see approximately a twofold increase in memory consumption compared to the 4-bit version. The results turned out to be close to predictions: in standby mode, the model occupied 13.4 GB of video memory, and in inference mode it took up 15.5 GB, which is ~1.44 times more than required with 4-bit quantization.
Since the P100 has only 16 GB of memory, we decided to play it safe and start testing with the P40, which has a more comfortable 22.5 GB. This should have given us the necessary margin to evaluate the trade-off between the increased accuracy of 8-bit quantization and increased memory consumption.
But first, let's change the quantization config in the script to a new one, instead of
# Define the quantization config
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4"
)
Will
python
# Define the quantization config for 8-bit with bfloat16 compute dtype
quantization_config = BitsAndBytesConfig(
load_in_8bit=True, # Switch to 8-bit quantization
bnb_8bit_compute_dtype=torch.bfloat16 # Use bfloat16 for compute
)
Pixtral 12B 8bit on P40
Results from the 8-bit version of the model were mixed. The quality of text generation has improved slightly: the model more accurately recognized text in images and formulated more coherent descriptions. However, the inference time increased to 121 seconds – three times longer than the 4-bit version.
Pixtral 12B 8bit on P100
When we switched to the P100, expecting to see a drop in performance, we were in for a surprise. Contrary to predictions, this card completed the task in 119 seconds – slightly faster than the P40. Although the difference of two seconds is within the margin of error, it still makes you wonder what could be the matter.
Le Chat
Let's now move on to the free service from Mistral AI – the developers of the Pixtral model. In a chat bot with an interface similar to ChatGPT, the Pixtral 12B model is available without quantization, this is 16 bits of floating point, as opposed to quantization to integer INT4 or INT8.
Results from the full-size Pixtral 12B model in Le Chat showed improved text generation quality. The description turned out to be more accurate and coherent, although, as in the case of quantized models, it is not ideal. Generation time was approximately 3 seconds, which is significantly faster compared to our previous tests on local hardware.
LLaMA 3.2 11B
In addition to the self-written version of inference in Python + Gradio, our attention was drawn to another interesting solution – the extension ComfyUI-PixtralLlamaMolmoVision for platform ComfyUI. This extension provides the ability to work not only with LLaMA 3.2 11B, but also with Pixtral 12B and Molmo.
Although the extension is limited to the use of pre-quantized models, it was ideal for our purposes. Although we could theoretically adapt the extension to our needs or develop our own solution from scratch, our main task was to test models on existing video cards, and not to dive into the intricacies of AI development.
In addition, using ComfyUI opens up access to a huge community formed mainly around generative neural network models for creating images, however, as we found out, ComfyUI is also easily suitable for working with text.
The beginning in ComfyUI is the same as with Pixtral – we go to our SSH server, except that we change the port in advance to 8188 `8188:127.0.0.1:8188`. And then, you first need to download ComfyUI –
git clone https://github.com/comfyanonymous/ComfyUI.git
Then create a venv inside the ComfyUI folder and install all the dependencies into it, taking into account the fact that we have NVIDIA cards.
python -m venv ComfyUI
source ComfyUI/bin/activatepip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124
pip install -r requirements.txt
Let's check that everything works, run `python main.py` and follow the link that appears –
To see the GUI go to: http://127.0.0.1:8188
Next, install ComfyUI-Manager, first copy its repository to ComfyUI/custom_nodes
git clone https://github.com/ltdrdata/ComfyUI-Manager.git
Now Restart ComfyUI
In the new interface:
Go to the Custom Nodes Manager section
Find and install ComfyUI-PixtralLlamaMolmoVision
In the Install Missing Custom Nodes section, download the missing components
Download the model to the ComfyUI/models/LLM folder using the link from the repository
Restart ComfyUI
Download workspace template `pixtral_caption_workflow.json` from the extension repository or add custom nodes
To test LLaMA:
Use pixtral_caption_workflow.json as a basis
Replace nodes with Llama Vision Model and Generate Text with Llama Vision
Running LLaMA 11B 4bit on P100
Now let's start testing, launch ComfyUI again, but specifying the desired video card in CUDA and start with P100.
CUDA_VISIBLE_DEVICES=0 python main.py
When idle, the model took up 7.58 gigabytes of video memory. But during inference there are already 8.92 gigabytes.
The LLaMA testing results pleasantly surprised us. The model completed the task in just 18.3 seconds, which is significantly faster than our expectations. The quality of delivery also turned out to be at a high level.
However, we noticed an interesting feature: both LLaMA and Pixtral are prone to excessive talkativeness. They strive to describe the entire image, even when we only ask for textual information. When generating a large amount of text, repetitions are also observed, despite the use of the repetition_penalty parameter.
It is curious that in the Le Chat interface Pixtral behaves more reservedly. This suggests that we may be missing some key configuration parameter, perhaps in Le Chat there is a more clever system request that is added to the user one, in order to avoid repetitions. If you have ideas on this matter, we will be happy to discuss them in the comments.
Generated 114 tokens in 18.208 s (6.261 tok/s)
The image features an orc with a humorous caption that reads, "I didn't choose da ork life, da ork life chose me." The orc
is depicted wearing glasses and holding a drink, adding to the comedic tone of the image. The use of the phrase "da ork life" instead of "the orc life" adds a playful touch, implying that the orc has been thrust into this lifestyle against its will. Overall, the image is a lighthearted and humorous take on the idea of being forced into a particular role or identity.
Prompt executed in 18.30 seconds
Running LLaMA 11B 4bit on P40
CUDA_VISIBLE_DEVICES=1 python main.py
And look at the result:
Generated 31 tokens in 61.683 s (0.503 tok/s)
На изображении написано: "Вы повстречали духа воды и получаете +5 к удаче до конца недели."
Prompt executed in 66.07 seconds
During testing, we discovered an interesting feature of LLaMA: the model did an excellent job of recognizing Russian text when the request to it was also in Russian. However, when asked in English, she did not recognize the Russian text, defining it as “an unknown foreign language.” And also, as you can see, the model incorrectly displays Russian text in ComfyUI, so the output was copied from the console where ComfyUI was launched.
But the most surprising result was the significant difference in performance between the NVIDIA Tesla P100 and P40 video cards. The P100 turned out to be 3.6-4.7 times faster than the P40, despite the similar number of CUDA cores and the fact that both cards were released around the same time. This was contrary to expectations, given that the P40 has more memory (24 GB versus 16 GB for the P100). However, the P100 used faster HBM2 memory, while the P40 was equipped with GDDR5. Although, from our understanding of the theory of machine learning, memory speed should not have such a strong effect during inference.
From assumptions, one of the factors for such a difference in performance could be the different degree of support for half-precision operations (FP16) and the general architecture for processing floating point numbers. The P100 probably had a more efficient implementation of these operations, which turned out to be critical for working with neural networks.
Conclusion
Despite the fact that this was an unusual experience for us, everything turned out to be not so scary. Yes, there were some nuances, especially those left behind the scenes. The biggest draw was the attempts to launch ready-made solutions for inference – Mistral.rs, vLLM. And also the inexplicable difference in performance between the P100 and P40 remained a mystery, where for some reason the weaker video card in Pixtral turned out to be on par, and in LLaMa, it was several times faster. However, we will try to find answers to these and other questions in the following articles of the series.
And if you have any guesses as to why this could have happened, or any advice on inference of neural networks, or what tests on these or other video cards you would like to see, we are waiting for you in the comments.